unhosted web apps

freedom from web 2.0's monopoly platforms

28. Synchronizing browser storage with server storage

When you use multiple devices, it's practical to have your data automatically copied from one device to another. Also, when you want to publish data on your web site to share it with other people, it's necessary to copy it from your device to your server.

Going one step further, you can use a setup in which the server, as well as each device, is considered to contain a version of essentially one and the same data set, instead of each holding their own independent one. In this case, copying data from device to server or from server to device always amounts to bringing the data set version at the target location up to date with the data set version at the source location.

When this happens continuously in both directions, the locations stay up-to-date with each other; the version of the data set each contains is always either the same, or in the process of becoming the same; the timing at which the data changes to a new version is almost the same at both locations. This is where the term "synchronization" (making events in multiple systems happen synchronously, i.e. at the same time) comes from.

In practice, there can be significant delays in updating the data version in all locations, especially if some of the synchronized devices are switched off or offline. Also, there can be temporary forks which get merged in again later, meaning the data set doesn't even necessarily go through the same linear sequence of versions on each location. We still call this coordination "sync" then, even though what we mean by it is more the coordination of the eventual version than of the timing at which the version transitions take place.

Star-shaped versus mesh-shaped sync

In the remoteStorage protocol there is only one server, and an unlimited number of clients. Versions in the form of ETags are generated on the server, and the server also offers a way of preventing read-then-write glitches with conditional writes that will fail if the data changed since a version you specify in the If-Match condition.

This means a client keeps track of which version it thinks is on the server, and sends a conditional request to update it whenever its own data set has been changed. The client can also use conditional GETs that make sure it receives newer versions from the server whenever they exist.

Essentially, the server is playing a passive role in the sense that it never initiates a data transfer. It is always the client that takes the initiative. The server always keeps only the latest version of the data, and this is the authoritative "current" version of the data set. It is up to each client to stay in sync with the server. The client can be said to keep a two-dimensional vector clock, with one dimension being its local version, and the other one being the latest server version that this client is aware of.

In mesh-shaped sync, each node plays an equivalent role, and no particular node is the server or master copy. Each node will typically keep a vector clock representing which versions it thinks various other nodes are at. See Dominic Tarr's ScuttleButt for an example.

Tree-based versus Journal-based sync

In CouchDB, each node keeps a log (journal) of which changes the data set went through as a whole, and other nodes can subscribe to this stream. The destination node can poll the source node for new changes periodically, or the source node can pro-actively contact the destination node to report a new entry in the journal. This works particularly well for master-slave replication, where the source node is authorative, and where both nodes keep a copy of the whole data set.

When some parts of the data set are more important than others, it can help to use tree-based sync: give the data set a tree structure, and you can sync any subtree without worrying about other parts of the tree at that time.

It's also possible to simultaneously copy one part of the tree from node A to node B, while copying a different part from B to A. Tree-based sync is also more efficient when a long chain of changes has only a small effect - for instance because one value gets changed back and forth many times. Tree-based sync only compares subtrees between the two nodes, and narrows down efficiently where the two data sets currently differ, regardless of how both datasets got into that state historically.

The remoteStorage protocol offers folder listings, which allow a client to quickly check if the ETag of any of the documents or subfolders in a certain folder tree have changed, by querying only the root folder (a GET request to a URL that ends in a slash). This principle is used even more elegantly in Merkle trees, where the version hash of each subtree can deterministically be determined by any client, without relying on one authoritative server to apply tags to subtree versions.

Asynchronous Synchronization and Keep/Revert events

When all nodes are constantly online, it's possible to synchronize all changes in real-time. This avoids the possibility of forks and conflicts occurring. But often, the network is slow or unreliable, and it's better for a node to get on with what it's doing, even though not all changes have been pushed out or received yet. I call this asynchronous synchronization, because the synchronization process is not synchronized with the local process that is interacting with the data set.

When remote changes happen while local changes are pending to be pushed out, conflicts may occur. When this happens in the remoteStorage protocol, the precondition of the PUT or DELETE request will fail, and the client is informed of what the ETag of the current remote version is. The client saves this in its vector clock representation, and schedules a GET for the retrieval of the remote version.

Since the app will already have moved on, taking its local version (with the unpushed changes) as its truth, the app code needs to be informed that a change happened remotely. The way we signal this in the 0.10 API of remotestorage.js is as an incoming change event with origin 'conflict'.

The app code may treat all incoming changes the same, simply updating the view to represent the new situation. In this case, it will look to the user like they first successfully changed the data set from version 1 to version 2, and then from another device, it was changed from version 2 to version 3.

The app may give extra information to the user, saying "Watch out! Your version 2 was overwritten, click here to revert, or here to keep the change to version 3". This is why I call these events "keep/revert" events, you simply keep the change to let the server version win, or revert it (with a new outgoing change, going from version 3 to version 2) to let the client version win.

Of course, from the server's point of view, the sequence of events is different: it went from version 1 to version 3, and then a failed attempt came from the client to change to version 2.

Caching Strategies

Since the remoteStorage protocol was designed to support tree-based sync, whenever a change happens in a subtree, the ETag of the root of that subtree also changes. The folder description which a client receives when doing a GET to a folder URL also conveniently contains all the ETags of subfolders and documents in the folder. That way, a client only needs to poll the root of a subtree to find out if any of potentially thousands of documents in that subtree have changed or not.

In remotestorage.js we support three caching strategies: FLUSH, SEEN, and ALL. FLUSH means a node is forgotten (deleted from local storage) as soon as it is in sync with the server. Only nodes with unpushed outgoing changes are stored locally.

In the SEEN strategy (the default), all nodes that have passed through will be cached locally, and the library will keep that copy up-to-date until the user disconnects their remote storage.

In the ALL strategy, remotestorage.js will pro-actively retrieve all nodes, and keep them in sync. Caching strategies can be set per subtree, and the strategy set on the smallest containing subtree of a node is the one that counts.

I realize this episode was pretty dense, with lots of complicated information in a short text! Sync and caching are complex problems, about which many books have been written. If you plan to write offline-first apps professionally, then it's a good idea to read and learn more about how sync works, and how conflict resolution affects your application. Next week we'll talk about other aspects of offline first web apps. So stay tuned for that!

As always, comments are very welcome, either in person tomorrow at Decentralize Paris, or through the Unhosted Web Apps mailing list.

Next: Offline-first web app design