unhosted web apps

freedom from web 2.0's monopoly platforms

18. Distributed hash tables

Content-addressable data

The simplest type of database is probably the key-value store: it couples a potentially long string (the value) to a potentially short string (the key), so that you can later retrieve the value that was stored, by presenting its key. Generally, it is possible to store any value under any key. But if for a key-value store, the key of each item is fully determined by the value, we say the data is content-addressable: if you know the content (the value), then you know its address (the key).

Content-addressable data stores are efficient when the same value is stored many times in the same store: each duplicate entry will map to the same key, meaning the data is automatically de-duplicated. On the other hand, the limitation that you cannot choose the key at the time of storing some data, means that in practice content-addressable data stores are only useful when combined with a generic key-value store.

Storing something in such a "blobs+index" store means that large data items are first stored as blobs in a content-addressable data store, mapping them onto short strings which are deterministically derived from the data item in question. Usually some sort of hashing function is used for this.

And then the second step is to store the hash of the blob in the "index", where you can give it the variable name under which you will want to find it back later.

A ring of hashes

When storing content-addressable data redundantly on a cluster of servers, you probably want to be able to add and remove servers without taking the cluster down. If you generate the keys from the values using a hashing function that more or less randomly spreads the resulting keys over their namespace, then you can use that property to easily and fairly determine which servers should store which item.

Imagine the namespace of hashes as a circle, where each point on the circle corresponds to a certain item key. Then you can deploy the servers you have available to each serve a section of that circle.

You may introduce redundancy by making the sections overlap. Then, if a node fails, you can redistribute the nodes by moving them gradually, so that they elastically adapt. Likewise, if you want to add more servers, you can insert them at a random position on the circle, and gradually let the new set of servers settle to their optimal distribution.

Moving a server clockwise or anti-clockwise on the ring means forgetting items on one end of its circle section, while at the same time obtaining new ones at the other end. If you make sure that each item is always stored at least twice (by overlapping each section 50% with each of its neighbors), then if one node fails, the nodes that neighbored it can immediately start moving towards each other to get full duplicated coverage of the section again.

This doesn't help of course when two adjacent nodes fail, and if you lose a server, then the load on the remaining servers of course becomes proportionally bigger. But running a cluster like this can be a lot easier for a sysadmin then running, say, a master-slave cluster of database servers.

Erasure coding

Taking this concept to a more generic level, leads to erasure coding, a strategy implemented by for instance tahoe-lafs. Here, each blob is stored n times instead of necessarily exactly two times. Advantages of storing data multiple times include:

disaster recovery, including easy organization of maintenance windows,
easy expansion and reduction of the cluster size in response to throughput and storage size demand,
increased throughput when the traffic is not uniform (for instance, if only one item is being retrieved at a given time, you can use all servers that store it to serve a part of the item's value),
possibility of using low-quality servers, or when encryption is used even semi-trusted servers, thanks to the inherent robustness of the architecture.

Convergent encryption

Encrypted blobs can easily be stored as content-addressable data. Thus, you can send a short string to the recipient of your message, which would then translate to a potentially very long string on the DHT to which both sender and receiver have access.

To set this up you need a trusted server that does the encryption, and a set of semi-trusted servers that do the actual blob storage.

Do keep in mind that if two people both encrypt the same file (say, a specific webm-file, or a popular linux install CD), then they will not generally result in the same blob, and thus the system will not be able to take advantage of de-duplication.

Convergent encryption fixes this. It uses a different encryption key for each blob, and derives that encryption key from the blob's entropy, such that it's not derivable from the item key (the blob's hash, which is used as the item's address)

Bitcasa, the company that was the subject of the TechCrunch scandal a couple of years ago, claimed that it could do both encryption and de-duplication, using convergent encryption as their clever trick. For Mega.co.nz it is believed that they may also be using convergent encryption, although this has apparently not been confirmed by the company itself.

It is important to note, as this last link also explains, that de-duplication exposes some information about the data people are storing. For instance, a copyright holder could upload their own movie to Mega.co.nz, and ask Mega to report which users own a file that deduplicates against that.

So depending on the reasons you had to encrypt data in the first place, convergent encryption may not be what you are looking for. You may need to use an encryption technique that makes de-duplication impossible.

Indexes to the data

As already stated earlier, content-addressable data by itself is in a way useless by definition. Nobody wants to use hashes as variable names. You need an extra layer that translates ultimately human-memorable, or at least human-intelligible labels to the machine-generated hashes that allow looking up the actual full data items in the DHT.

A good example of this is git: it stores blobs, and then trees of commits that reference these blobs. This is also how git does de-duplication, which can be quite important depending on how you branch and move files around in your version control repo.

Zooko's triangle

A naming scheme can be (source: wikipedia):

Decentralized and human-meaningful (this is true of nicknames people choose for themselves),
Secure and human-meaningful (this is the property that domain names and URLs aim for), or
Secure and decentralized (this is a property of OpenPGP key fingerprints)

However, Zooko's triangle states that is impossible to have all three at the same time. A naming scheme cannot be decentralized, human-meaningful, and secure, all at the same time.

Don't confuse the word "distributed" in the abbreviation "DHT" with "decentralized". For instance, bittorrent.com publishes the Mainline DHT in which bittorrent peers can participate voluntarily, but this DHT is fundamentally centralized in the domain name of its publisher.

The fact that DHTs derive item keys from item values means that the "secure" part of Zooko's triangle can at least be verified: if you arrive at the wrong item value for a given item key, then you can at least know that this happened.

At the same time, it means that item keys in a DHT are not human-meaningful (unless the human in question is able to calculate hash functions in their head).

Uniqueness through dominance

There is in practice a way out of the dilemma introduced by Zooko's triangle: since our planet has a finite size, and a finite population, it is possible to populate a vast majority of all instances of a certain system with the same data, thus in practice (though not in theory) removing the need to choose one authorative instance.

DNS, TLS, PGP keyservers, and Bittorrent's Mainline DHT all use this approach. Even if the authorative single point of failure of these systems goes down, the data still "floats around" in enough places.

If the data is additionally verifiable, like DHT data is, then you can in practice make data blobs globally available in a way that becomes independent of the original publisher.

Conclusion

When storing small items of data, you need an actual key-value store. When the items are big, though, you can combine a first-level key-value store with a second-level content-addressable store, such that the first forms an index to the latter. Doing this also opens up opportunities to use semi-trusted servers on the second level. The first-level index can also be stored as a blob on the second level, meaning your client will only need to remember one hash string - the hash of the index.

Truly decentralized storage of generic key-value data (with human-memorable item names) is impossible, but decentralized storage of content-addressable data can be achieved "in practice", by dominating the finite set of Earthly servers that claim to participate in your DHT, and if clients verify that the values they obtain actually hash back to the item keys they requested.

As always, comments welcome!

Next: BGP, IP, DNS, HTTP, TLS, and NAT