unhosted web apps

freedom from web 2.0's monopoly platforms

22. How to locate resources

Last week, Adrian suggested that this blog really needs an episode about search. I totally agree, so here it is! :)

URLs, DNS, and DNR

The DNS system is a distributed (but centralized) database that maps easy-to-remember domain names onto hard-to-remember IP addresses. In the process, it provides a level of indirection whereby the IP address of a given service may change while the domain name stays the same.

These two characteristics make DNS+HTTPS the perfect basis for weaving together hypertext documents. One downside of DNS is that it is subject to attacks from (especially) nation state governments. Another is that registering a domain name costs money. The money you pay for a domain name is not so much a contribution to the cost of running the world's DNS infrastructure, but more a proof-of-work threshold which curbs domain name squatting to some extent. If registering a domain name were cheaper, then it would be even harder to find a domain name that is not yet taken.

Portals and search engines

Domain names are sufficient for keeping track of the SMTP, FTP, SSH and other accounts you connect to, but to look up general information on the web, you often need to connect to servers whose domain name you don't know by heart, or might not even have heard of before starting a given lookup. Using just DNS to locate resources on the web was just not going to work. You will only be able to find information you have seen before. So portals were invented, effectively as an augmentation of DNS.

A portal effectively replaces the DNS system, because instead of remembering that the information you were looking for is hosted on www.myfriendtheplatypus.com, you would typically have dir.yahoo.com as your home page, and roughly remember that you could get to it by clicking Science -> Biology -> Zoology -> Animals, Insects, and Pets -> Mammals -> Monotremes -> Platypus.

The big advantage of this was of course that you only have to remember these strings receptively, not productively: you are presented with multiple-choice questions, not one open question. And apart from that, it is interactive: you take multiple smaller steps and can reach your target step-by-step.

A big disadvantage is of course that it takes one or two minutes to traverse Yahoo's directory until you get to the page that contains the information you need. Also, you have to know that a platypus is a monotreme. In this case, while writing out this example, I actually used a search engine to look that up. ;)

So, search engines then. They became better, and over the decades we arrived at a web where "being on the web" means you have not only done your Domain Name Registration (DNR), but also your Search Engine Optimization (SEO). The similarity between domain names and search terms is emphasized by the fact that several browsers now provide one input field as a combined address bar and search widget.

Bookmarks and shared links

For general information, portals and search engines make most sense, but the web is also often used for communication that is "socially local": viewing photos that your friends uploaded, for instance. Or essays on specific advanced topics which are propagated within tiny specialist expert communities.

For this, links are often shared through channels that are socially local to the recipient and the publisher of the link, and maybe even the publisher of the resource.

This can be email messages, mailing lists, people who follow each other on Twitter, or who are friends on Facebook, etcetera. You can even propagate links to web resources via voice-over-air (in real life, that is).

It is also quite common for (power) users of the web to keep bookmarks as a sort of addressbook of content which you may want to find back in the future.

All these "out of band" methods of propagating links to web resources constitute decentralized alternatives to the portals and search engines as augmentations of DNS.

Link quality, filter bubbles, collaborative indexes and web-of-trust

Portals used to be edited largely by humans who somehow guarded the quality threshold for a web page to make it into the directory. Search engines are often filled on the basis of mechanical crawling, where algorithms like PageRank extract quality information out of the web's hyperlink graph. With GraphSearch, Facebook proposes to offer a search engine which is biased towards the searcher's reported interests and those of their friends.

In the case of bookmarks and shared links, the quality of links is also guarded by the human choice to retweet something or to forward a certain email. In a way, this principle allows the blogosphere to partially replace some of the roles of journalism. It is also interesting how this blurs the barrier between objective knowledge and socially local opinions or viewpoints.

If you use Google in Spanish, from Spain, while being logged in with your Google account, you will not see all the same information an anonymous searcher may find when searching from a different country and in a different language. This effect has been called the Filter Bubble.

Whether you see the filter bubble as an enhancement of your search experience or as a form of censorship may depend on your personal preference, but in any case it is good to be aware of it, so you can judge the search results you get from a search engine a bit more accurately. In a real life situation you would also always take into account the source of the information you consume, and memes generally gain value by travelling (or failing to travel) through the web of trust, and in collaborative indexes.

Socially local effects on search results are often implemented in centralized ways. For instance, Amazon was one of the first websites to pioneer the "people who bought this product also bought this other one" suggestions.

Don't track us

Offering good suggestions to users from a centralized database requires building up this database first, by tracking people. There are several ways to do this; in order to protect people's privacy, it is important to anonymize the logs of their activity before adding it to the behavioral database. It is also questionable if such information should be allowed to be federated between unrelated services, since this leads to the build-up of very strong concentrations of power.

Whereas Amazon's suggestion service mainly indexes products which Amazon sells to its users itself, Google and Facebook additionally track what their users do on unrelated websites. When a website includes Google Ads or Google Analytics, information about that website's users is leaked to Google. If a website displays a 'Like' button, it leaks information about its visitors to Facebook.

All this tracking information is not only used to improve the search results, but also to improve the targetting of the advertising that appears on websites. If you read The Google Story, it becomes clear how incredibly lucrative it is to spy on your users. Microsoft even crawls hyperlinks which it obtains by spying on your Skype chat logs. In my opinion it is not inherently evil for a service to spy on its users, as long as the user has the option to choose an alternative that doesn't. A restaurant that serves meat is not forcing anyone to eat meat (unless it sneaks the meat into your food without telling you).

The free technology movement is currently not really able to offer viable alternatives to Google and to Facebook. A lot of projects that try come and go, or don't hit the mark, and it sometimes feels like an uphill struggle. But I'm confident that we'll get on top of this again, just like the free-tech alternatives to all parts of the Microsoft stack have eventually matured. Ironically, one of the main nails in the coffins of the Microsoft monopoly, Firefox, was largely supported with money from the Google monopoly. Placing Google as the default search engine in the branded version of Firefox generated so much revenue that it allowed the project to be as successful as it is today.

I don't know which part of Google's revenue is caused purely by searchterm-related advertising, and which part is caused by Google tracking its users, but in any case users of Firefox, and users of the web in general, have the option to use a different search engine. The leading non-tracking search engine seems to be DuckDuckGo currently. Although its search results for rare specialist searches are not as good as those of market leader Google, I find that for 95% of the searches I do, it gives me the results I was looking for, which for me is good enough to prefer it over Google. In occasions where you still want to check Google's results, you can add a "!g " bang at the front of the search term, and it will still direct you to a Google results page.

DuckDuckGo doesn't track you, and also does not currently display any advertising. Instead, it gets its revenue from affiliate sales, which seems to me like a good option for financing freedom-respecting products. I would say it's basically comparable to how Mozilla finances Firefox. And another nice feature of DuckDuckGo is of course that they sponsor us and other free software projects. :)

There are also decentralized search engine projects like YaCy, but they have a barrier for use in that you need to install them first, and when I tried out YaCy right now, searching for a few random terms showed that unfortunately it's not really usable as an everyday primary search engine yet.

Leaking the social graph into the public domain

Locating information on the web is one thing, but what about contacting other web users? When meeting people in real life, you can exchange phone numbers, email addresses, or personal webpage URLs. Most people who have a personal webpage, have it on Facebook (their profile page). This has lead to a situation where instead of asking "Are you on the web?", many people ask "Are you on Facebook?". There are two reasons for this.

First of all, of course, the Facebook application allows me to send and receive messages and updates in a namespace-locked way: as a Facebook user, I can only communicate with users whose web presence is on Facebook.

The second reason is that Facebook's people search only indexes personal web pages within its own namespace, not outside it. This means that if I use Facebook as a social web application, searching for you on the web will fail, unless your webpage is within the Facebook namespace.

This is of course a shortcoming of the Facebook user search engine, with the effect that people whose personal webpage is outside Facebook are not only harder to communicate with, they are even harder to find.

And Facebook is not the only namespace-locked user search engine. There's also LinkedIn, Twitter, Google+, and Skype. Searching for a user is a different game from searching for content. The "social graph", that is, the list of all personal web pages, and their inter-relations, is not that big in terms of data size. Let's make a quick calculation: say the full name of a user profile is about 20 characters long on average, with about 5 bits of entropy for each character, then that would require 100 bits.

Let's estimate the same for the profile URL and the avatar URL. If we want to accommodate 1 billion users, then a friend link would occupy about 30 bits, but that would be compressible again because the bulk of a user's connections will be local to them in some way. So to store 150 connections for each user, we would need about 3*100+150*30=4800 bits, less than 1 Kb. That means the Earth's social graph, including 150 friendships per human, plus a full name, photo, and profile URL for each, would occupy about 1 billion Kilobytes, or one Terabyte. That easily fits on a commodity hard disk nowadays.

A search engine that indexes information on the web needs to be pretty huge in data size and crawler traffic in order to be useful. And the web's hypertext architecture keeps it more or less open anyway, as far as findability of information is concerned. In a way, a search engine is just a website that links to other websites.

But for user search, the situation is very different. If you are not findable on the web as a user, then you cannot play along. It's like you are not "on" the web. The current situation is that you have to say which namespace your web presence falls under, in order to be findable. So instead of saying "Yes, I am on the web", we end up saying "Yes, I am on Facebook", so that people know that the html page that constitutes my web presence can be found using that specific namespace-locked user search engine.

I think it is our duty as defenders of free technology to build that one-Terabyte database which indexes all public user profiles on the web.

It is important of course to make sure only publically accessible web pages appear in this database. If you happen to know the private email address of someone, you shouldn't submit it, because that would only lead to a breach of that person's privacy. They would probably start receiving spam pretty soon as a result.

Since a lot of websites now put even their public information behind a login page, it may be necessary to log in as a random "dummy user" to see a certain page. That is then still a public page, though. Examples are my Facebook page, the Quora page I linked to earlier, and although a Twitter profile page is accessible without login, a list of followees is not. Such pages can however be accessed if you create a dummy account for the occasion, so I still classify them as "public content on the web".

Independent of whether people's user profiles appear in one of the big namespaces, or on a user's own Indie Web domain name, we should add them to our database. Constructing this may technically breach the terms of service of some social network hosting companies, but I don't think that they can claim ownership over such user content if that went to court. So once we have this dump of the web's social graph, we can simply seed it on bittorrent, and "leak" it into the public domain.

This is an irreversible project though, so please comment if you think it's a bad idea, or have any other reactions to this episode.

Next: Network neutrality, ubiquitous wifi, and DRM