Nothing lives forever, and researchers have confirmed that web pages are no exception. They pop into existence at one moment in time and have a habit of disappearing with an abrupt “404 not found” at an unknown point in the future.
The rate at which this happens has a name: “digital decay”, or “link rot”. According to an analysis by the Pew Research Center, When Online Content Disappears, we can even put some numbers on the phenomenon.
Looking at a random sample of web pages that existed in 2013, the researchers found that by 2023, 38% had disappeared. If it doesn’t sound surprising that nearly four in ten web pages from 2013 would have disappeared a decade later, they did the same analysis for pages that appeared in 2023 itself, finding that a surprising 8% disappeared by the year end.
But what matters is not simply how many web pages have disappeared but where they disappeared from. On that score, 23% of news pages and 21% of pages on US government sites contained at least one broken link.
The most interesting barometer of all for link rot is Wikipedia, a site which depends heavily on referenced links to external information sources.
Despite the importance of references, the researchers found that at least one link was broken on 54% of a sample 50,000 English language Wikipedia entries. From the total of one million references on those pages, 11% of the links were no longer accessible.
Disappearing tweets
And it’s not just links. Looking at that other cultural reference point, “tweets” on the X (formerly Twitter) platform, a similar pattern was evident. From a representative sample of 5 million tweets posted between 8 March and 27 April 2023, the team found that by 15 June 18% had disappeared. And that figure could get a lot higher if the company ever stops redirecting URLs from its historic twitter.com domain name.
Some languages were more affected by disappearing tweets than others, with the rate for English language tweets being 20% and for those in Arabic and Turkish an extraordinary 42% and 49%, respectively.
Pew is not the first to look into the issue. In 2021, an analysis by the Harvard Law School of 2,283,445 links inside New York Times articles found that of the 72% that were deep links (i.e., pointing to a specific article rather than a homepage), 25% were inaccessible.
As a website that’s been in existence since 1996, The New York Times is a good measure of long-term link rot. Not surprisingly, the further back in time you went, the more rot was evident, with 72% of links dating to 1998 and 42% from 2008 no longer accessible.
This study also looked at content drift, that is the extent to which a page is accessible but has changed over time, sometimes dramatically, from its original form. On that score, 13% of a sample 4,500 pages published in the New York Times had drifted significantly since they’d first been published.
Where is IT going wrong?
Does any of this matter? One could argue that web pages disappearing or changing is inevitable even if not many people notice or care.
While the Pew researchers offer no judgement, the authors of the Harvard Law School study point out the problems link rot leaves in its wake:
“The fragility of the web poses an issue for any area of work or interest that is reliant on written records. […] More fundamentally, it leaves articles from decades past as shells of their former selves, cut off from their original sourcing and context.”
According to Mark Stockley, an experienced content management systems (CMS) and web admin who now works as a cybersecurity evangelist for security company Malwarebytes, while some link loss was inevitable, the scale of the issue suggested deeper administrative failures.
“People seem to be more ambivalent about losing pages than they used to be. When I first started working on the web, losing a page, or at least a URL, was anathema. If you didn’t need a page any more you at least replaced it with a redirect to a suitable alternative, to ensure there were no dead ends,” said Stockley.
“What’s baffling is when CMSs don’t pick up the slack. While some CMSs will catch mistakes and backfill URL changes with redirects automatically, there are others that, inexplicably, don’t. It’s an obvious and easy way to prevent a particular kind of link rot, and it’s baffling that it exists in 2024,” he said.
Alternatively, if the CMS doesn’t include a link checking facility, admins can also deploy link checking tools that will crawl a site to find broken links.
For CMS admins, spotting and correcting broken links should be a defined process not an afterthought.
Anyone who wants more detail on the methodology behind When Online Content Disappears can follow this link (PDF).