unhosted web apps

freedom from web 2.0's monopoly platforms

27. Persisting data in browser storage

Today we'll look at storing and caching data. As we said in the definition of an unhosted web app, it does not send the user's data to its server (the one from which its source code was served). It stores data either locally inside the browser, or at the user's own server.

The easiest way to store data in an unhosted web app is in the DOM, or in JavaScript variables. However, this technique has one big dangerous enemy: the page refresh.

Refreshing a page in a hosted web app usually means the data that is displayed on the page, is retrieved again from the server. But while using an unhosted web app, no user data is stored on the server, so unless the app takes care to store it locally or at the user's server, if the user refreshes the page, all data that is stored in JavaScript variables, is removed from memory. This does not happen when pressing the "back" or "forward" button, although in Chrome (not in Firefox), pressing "back", "refresh", "forward" does flush the memory, also for the forward page.

It's possible to warn the user when they are about to leave the page, using the beforeunload event. Also, the current URL can be used as a tiny data storage which is page refresh resistant, and it is often used for storing the current state of menu and dialog navigation. However, in-memory data as well as the current URL will still be lost when the user shuts down their computer, so for any app that manages important data, it's necessary to save the data in something more persistent than DOM or JavaScript memory.

Places that cache remote data

Apart from surviving the page refresh, and the device reboot, another reason to store data locally, is to cache, so that a copy of remote data is available faster, more reliably, and more efficiently. Http is built to allow transparent caching at any point. Your ISP is probably caching popular web pages to avoid retrieving them all the time. When your browser sees a "304 Not Modified" response to a conditional GET request, this may come from the cache at the ISP level, or at any other point along the way.

However, when syncing user data over the web, you would usually be using a protocol that is end-to-end encrypted at the transport layer, like https, spdy or wss, so between the point where the data leaves the server or server farm, and the point where the browser receives it, the data is unrecognizable and no caching is possible.

Your browser also caches the pages and documents it retrieves, so it's even possible that an XHR request looks like it got a 304 reponse code, but really the request didn't even take place. The browser just fulfilled the XHR request with a document from cache.

Then there are ServiceWorkers and their predecessor, AppCache, which allow for yet another layer of client-side caching.

In your application code, you probably also cache data that comes from XHR requests in memory as JavaScript objects, which you then may also save to IndexedDB or another in-browser data store.

And finally, the data you see on the screen is stored in the DOM, so when your app needs to re-access data it retrieved and displayed earlier, it could also just query the DOM to see what is there.

So even with transport layer encryption, if all these caching layers are active, then one piece of data may be stored (1) on-disk server-side, (2) in database server memory or memcache server-side, (3) possibly in process memory server-side, (4) at a reverse proxy server-side, (5) in the browser's http cache, (6) in appcache, (7) maybe in a ServiceWorker cache, (8) in xhr.response, (9) in memory client-side, (10) in the DOM. For bigger amounts of data it can obviously be inefficient to store the same piece of data so many times, but for smaller pieces, caching can lead to a big speedup. It's definitely advisable to use one or two layers of client-side caching - for instance IndexedDB, combined with DOM.

Persisting client-side data

Whether you're caching remote data or storing data only locally, you'll soon end up using one of the browser's persistent stores. Do not ever use localStorage, because it unnecessarily blocks the browser's execution thread. If you're looking for an easy way to store a tiny bit of data, then use localForage instead.

There's also a FileSystem API, but it was never adopted by any browser other than Chrome, so that's not (yet) practical. That brings us to IndexedDB.

When working directly with IndexedDB, the performance may not be what you had hoped for, especially when storing thousands of small items. The performance of IndexedDB relies heavily on batching writes and reads into transactions that get committed at once. IndexedDB transactions get committed automatically when the last callback from an operation has finished.

When you don't care about performance, because maybe you only store one item every 10 seconds, then this works fine. It's also useless in this case; transactions only matter when you update multiple pieces of data together. Typically, you would edit an item from a collection, and then also update the index of this collection, or for instance a counter of how many items the collection contains. In old-fashioned computer science, data in a database is assumed to be consistent with itself, so transactions are necessary.

This way of designing databases was largely abandoned in the NoSQL revolution, about 5 years ago, but IndexedDB still uses it. To minimize the impact on the brain of the web developer, transactions are committed magically, and largely invisibly to the developer. As soon as you start storing more than a thousand items in IndexedDB, and reading and writing them often, this goes horribly wrong.

Fixing IndexedDB by adding a commit cache

In my case, I found this out when I decided to store all 20,000 messages from my email inbox in an unhosted web app. The data representation I had chosen stores each message by messageID in a tree structure, and additionally maintains a linked-list index for messages from each sender, and a chronological tree index, meaning there are about 80,000 items in my IndexedDB database (probably about 100Mb).

Storing a thousand items one-by-one in IndexedDB takes about 60 seconds, depending on your browser and system. It is clear that this makes IndexedDB unusable, unless you batch your writes together. In version 0.10 of remotestorage.js we implemented a "commit cache", which works as follows:

If no other write request is running, we pass the new write request straight to IndexedDB.
If another write request is still running, then we don't make the request to IndexedDB yet. Instead, we queue it up.
Whenever a write request finishes, we check if any other requests are queued, and if so, we batch them up into one new request.
There is always at most one IndexedDB write request running at a time.
Read requests are checked against the commit cache first, and the new value from a pending write is served if present.
Read requests that don't match any of the pending writes, are passed straight to IndexedDB, without any queueing or batching.

This speeded up the process of storing lots of small items into IndexedDB by about an order of magnitude (your mileage may vary). You can see the actual implementation of our commit cache on github.

27. Persisting data in browser storage

Places that cache remote data

Persisting client-side data

Fixing IndexedDB by adding a commit cache

Other problems when working with IndexedDB