Archiving for Data-Immortality
If you’ve been browsing the web for a number of years you’ve probably noticed there are lots of dead pages, lots of data has vanished. It is surprising how transient data is on the Internet. Some people create the impression that once data is posted online it will be there for eternity.
Eric Schmidt or others may say the Internet holds a “complete record” of everything, or “the Internet never forgets,” or “our past follows us like never before,” but from my viewpoint the Internet is positively retarded, it suffers from severe memory loss therefore I must make a very deliberate effort to back-up online data.
Yes there are occasions where embarrassing things are remembered:
“Four years ago, Stacy Snyder, then a 25-year-old teacher in training at Conestoga Valley High School in Lancaster, Pa., posted a photo on her MySpace page that showed her at a party wearing a pirate hat…”
From my viewpoint there are many more occasions where data is irretrievably lost. There is the Wayback Machine but too often the Wayback Machine has not archived the page I’m searching for, thus it irritatingly says: “Hrm. Wayback Machine doesn’t have that page archived.”
There are various reasons for data-mortality. People can delete potentially controversial Tweets. Maybe a site will die due to lack of funding, lack of popularity, which is notable regarding Posterous, Stickam, or Bo.lt.
Content can also be unavailable due to expiration of copyright, which I recently encountered regarding a Guardian article by Ben Goldarce:
Webmasters, when moving sites to new servers, can simply lose pages, or sites may change their method of creating URLs but fail to provide redirection regarding old URLs. I’ve noticed URLs for Yahoo News tend to have a very short shelf life, perhaps no longer than one year whereupon the page will likely be dead, in many cases forever lost.
A good example of dead URL is http://www.hhs.gov/reference/newfuture.shtml, which was titled “2020: A New Vision – A Future for Regenerative Medicine.” I am unsure why HHS removed the page without any archiving or without creating a new location for the information; maybe they feared they couldn’t achieve the 2020 target. Thankfully I managed to retrieve that HHS page via a Google cache before it was lost, I then reformatted the page into a self-contained html document; I converted the images to data URIs. An archived version of the aforementioned HHS page is available to download here, here, or here. Please feel free to upload that archived HHS page to your own server and then archive it via archive.is.
It is not certain archiving old web-pages via archive.is will ensure data immortality, but if more people use archive.is this will make the site more popular thus reliability will be more likely. I can’t archive every page so it’s also good to have more people creating archives thus if I find a dead page, or a page where the content has been removed to due copyright expiration, I can search to see if anyone made an archive before it died.
In August 2013 the owner of archive.is was asked how long archived pages would be available. The reply was forever, which is good news for data immortality:
“Actually, I think, in 3-5-10 years all the content of the archive (it is only ~20Tb) could fit in a mobile phone memory, so anyone will be able to have a synchronized copy of the full archive. So my “forever” is not a joke.”
For Singularity relevant news or to help accelerate the Singularity, please follow SINGULARITY 2045 on Google+.