Storing data durably

Digital storage is temperamental. Drives fail, data becomes corrupt, and technologies go in and out of vogue in very short time periods. How are we to ensure that the data we make today is still accessible and usable in the future?

The obvious answer is that works of long-term significance which need permanently archiving for future generations should still be entrusted to paper rather than to a digital medium. But there are obvious disadvantages to this approach: paper is expensive, for one. Paper also limits expression by not allowing interactivity, one of the main attractions of digital publishing in the first place.

So let’s say you’re producing an important interactive work that people will want to reference and access, in its exact original format, in 200 years’ time. How should you produce and store your work? Solving the problem will involve a layered approach: every choice involved in storing the data needs to be considered carefully with long-term accessibility in mind.

Factors to consider

Implementation diversity

Anything which is only available in one implementation, from one manufacturer, or for a limited number of platforms is suspect. For instance, Microsoft Word documents would be an extremely poor choice for long-term data storage. PDFs, on the other hand, would be a far better choice. (See below for specific recommendations on data format choice, however.)

Durability

Physical media should have a long lifespan; any data formats chosen should be able to detect and recover from corruption. By building in checks at every layer, we leave open many options for repair and recovery should the data become corrupt at any time in the future.

Portability

Any media chosen for distribution of the work should work on as many platforms as possible. Things which work on multiple platforms today are likely to still work on multiple platforms in the future.

Track record

Popular computing is only about thirty years old, and yet we’re considering how to store works safely for periods of hundreds of years. How durable individual technologies have proved so far is of less relevance than it seems, but it can still be an important indicator of how likely a format is to survive in the long-term.

Layers

Physical media format

Twenty years ago, almost any digital work would have been distributed on 3½″ floppy diskette. No computers made today still have floppy drives in them. Even if they did, it’s possible that data stored on them then would have become unreadable in the last 20 years. And despite their ubiquity in the early heyday of personal computing, floppy disks came in three sizes (8″, 5¼″, 3½″) and each of those came in numerous, often incompatible capacities.

So just because a format is ubiquitous does not make it a good choice for long-term storage.

The current best technology available for long-term data storage seems to be optical media. Estimates vary wildly, but it seems that Blu-ray discs are expected to last 100–150 years. (By contrast, DVDs, which used different burning technology, are generally estimated to last a maximum of 25 years.) There is also a new technology called M-Disc which is consists of standard DVD or Blu-ray disc that can be read by ordinary DVD and Blu-ray drives, though it needs a special burner. It claims to last 1,000 years because it is made out of more durable materials, though naturally such claims should be taken with a pinch of salt.

The track record of obsolescence of optical media is also good: so far, the drives for every new optical media format are able to read all the previous optical media formats just fine: Blu-ray drives are able to read DVDs and CDs, just as DVD drives could read CDs. Whether this will continue forever is uncertain. (As, arguably, is the future of optical media itself, though as remains a popular choice for backups, I doubt it will ever be as difficult to read as old floppy disks are today.)

The ideal of long-term, reasonably affordable data storage without human intervention seems to be acid-free archival-quality paper (or perhaps more ideally vellum) stored in an extremely dry location such as a salt mine. There are known good ways of encoding digital data onto paper in extremely resilient ways, such as QR codes. They aren’t practical to use by themselves due to low data density, but they show how error-correction codes can be used effectively on paper to ensure that even if data is made partially unreadable, information is still recoverable. Printing two-dimensional black-and-white codes onto long rolls of paper would allow scanners to recover the data later.

This solution is extremely slow and expensive, and not many people are likely to be able to build the right equipment to read the data, so it’s more suitable as a last-resort recovery option: when the last optical disc containing your original work is unreadable, it could still be recovered by scanning the paper roll with it on. Then you can produce more discs, or whatever physical media is popular and durable in the future.

Filesystem

The filesystem chosen must be able to track data corruption and repair it where possible. Being widely-supported today is of little importance, but it should be almost certain that future computers will have at least basic read capabilities for it.

If optical media is being used, a natural choice is the ISO 9660 compact-disc file system, which is an open standard with multiple implementations and is supported by practically all operating systems today. It has severe technical limitations, but even this may not be such a bad thing, because at some point the work might be transferred to a file system with such limitations.

Unfortunately at this point in history, the filesystems with the best promises about durability are also the ones with the least implementation diversity. ZFS and btrfs represent the state-of-the-art, but both only have effectively one implementation. Both only work on a limited set of platforms. The most widely-implemented filesystems like ISO 9660 and FAT make no guarantees about data corruption at all.

To provide data corruption detection, therefore, or just for extra safety if one day a portable durable filesystem does become available, the work distribution could be as a Git repository. Git is very good at detecting storage errors, though recovering from repository corruption is something of a black art. (Using Git doesn’t imply that you have to publish the entire history of your work’s development; just make a new repository with one commit, and Git will keep safe hashes of everything.)

Data formats

The data format must be open, exist in multiple compatible implementations, and be certain to be readable in the future.

HTML is a good choice here. Thanks to a dogged dedication to backwards-compatibility, even the very first web-pages ever made render correctly in modern web browsers. The sheer amount of information on today’s web — and the orders of magnitude more captured for history by the WayBack Machine — ensure that, for better or worse, HTML will probably always be supported as a data format. Even the GeoCities webpages made in the 1990s using HTML features that were never standardized and, at the time, usually only available in one browser still generally work as well today as their authors intended 20 years ago.

HTML even has a way to include interactive systems to work portably on all platforms: JavaScript, for its flaws, is also likely to be supported forever in connection with HTML.

There are cautionary tales here, however. Many old documents were written in one or another dialect of SGML. HTML itself was originally an SGML-based technology. But in the late 1990s XML was developed and replaced it, mainly because it was simpler to work with programmatically. SGML documents can generally be converted to XML, but not always, because SGML had some features with no equivalent in XML.

Since there was no way for the authors of old SGML documents to know that XML was going to come along, and so no way for them to know that they should avoid using those SGML features if they wanted their data to last. Likewise, there is no way for us to know which HTML and JavaScript features will necessarily be supported in the future.

There is, unfortunately, no way around this, except general conservatism in what you choose to use in your long-term work. Using every possible feature of the web platform is unwise, since some of them might just be cast by the wayside of history. But there are many web features which have been around long enough now, and (more crucially) used in a large variety of types of web page, that they’re likely to stick around.

Since it may be necessary to re-encode data into new file formats in the future, any media included in the work must be encoded losslessly. Photos should be stored as PNGs rather than JPEGs, for instance, and sound as FLAC or ALAC rather than MP3 or AAC or Ogg Vorbis or any of those. The reason is that transcoding one lossy format, like one that’s popular today, to another lossy format, like one popular in 100 years, always involves losing more quality overall. After five or ten transcodings, the sound might be unlistenable, the pictures might be blurry and blotchy. Lossless formats are also generally easier to re-implement from scratch if no workable implementations are available in the future.

Unfortunately video tends to be very large indeed when stored losslessly, so if you want to include video there might be no choice but to encode it lossily. When choosing formats, don’t worry much about patents — they expire in 20 years anyway — and focus more on implementation diversity. Most popular video formats today have multiple implementations.

Proposed solution

The solution proposed below is best for works published today (in 2015). In a few years, the best technologies for long-term storage might have changed again. If I had to make long-term archivable digital documents today, however, I would make sure that it was:

published on DVD or Blu-Ray, perhaps using the M-Disc technology, with several paper copies also made on long rolls using some simple encoding technique from which data recovery is possible, and deposited in places safe for long-term storage.
kept in an ISO 9660 filesystem.
distributed as a Git repository inside of that filesystem. (It might be necessary to put the Git repository inside a tarball to work around the name length limitations of the filesystem.)
formatted with HTML, using mainly plain HTML and CSS, with JavaScript kept to a minimum.
losslessly stored, with all images saved as PNG or SVG; all sound as FLAC. Video, if necessary, would probably be encoded as H.264.