A Quarter of Information on the Internet Has Disappeared, What to Do

Scientific publications, government reports, journalistic investigations are disappearing — not even archival copies remain.

In 1938, "World Brain" — a collection of essays and lectures by science fiction writer Herbert Wells — appeared in bookstores. It was a kind of manifesto about the future of human knowledge.

Wells dreamed of a constantly updated "world encyclopedia" — a single repository where scientists, politicians, and ordinary people would find reliable answers to any question. Such a system was supposed to help humanity cope with the chaos of information and avoid wars. "I am talking about the process of the intellectual unification of the world, which, it seems to me, is as inevitable as anything can be inevitable in human affairs," Wells addressed the audience at the World Congress on Universal Documentation in 1937. "The world is like a phoenix: it dies in fire and, dying, is born anew. This synthesis of knowledge is a necessary beginning of a new world."

By the end of the 20th century, the internet emerged (formally in the 1960s, but the general public only gained access in the 1990s). It seemed that everything ever written, said, filmed, or published would remain online forever. Soon it became clear that this was an illusion.

According to the Pew Research Center, today about one in four pages from those existing between 2013 and 2023 are unavailable. Among pages from thirteen years ago, 38% have disappeared. One in five links in news articles is broken, as are links on government websites. More than half of the sources on Wikipedia lead to nowhere. The situation is not better for legal documents. About half of the URLs in U.S. Supreme Court decisions no longer lead to original materials. In the Harvard Law Review and other specialized journals, the figure is over 70%.

Scientific publications, government reports, journalistic investigations are disappearing — not even archival copies remain. The internet does not remember everything, as the popular meme states, but gradually forgets.

This phenomenon has a precise name — link rot. This is the moment when you read old material, click on a hyperlink — and end up in a void with the message "404 Not Found." The reasons for this situation can vary: the site has moved, the domain was not renewed, or the company has closed altogether. In any case, the result is the same — an empty page, as if nothing had ever been there.

The problem is deeper than it seems. Knowledge is increasing, but the systems are failing to preserve it. Of 7 million scientific articles, more than a quarter have either not been reliably archived or are already unavailable. "All our epistemology of science and research relies on a chain of citations," explains literature, technology, and publishing expert Martin Eve. "If you cannot verify what someone else said at some other point, you simply rely on blind faith in artifacts that you can no longer witness yourself."

Perhaps the main reason why the digital transition has not solved the problem of knowledge preservation, which Herbert Wells dreamed of among other things, is that responsibility has become blurred.

In the era of print publications, the preservation of scientific materials was primarily the task of libraries. Open does not mean eternal: the internet destroyed this system but did not create a new one in its place (it seemed that technology would protect the world from repeating the fate of the Library of Alexandria). Who should be responsible for the preservation of electronic archives: publishers, universities, libraries, or the authors themselves — remains unclear. If responsibility is distributed among all, then in practice, no one bears it.

The Dictatorship of Algorithms

Another marker of the era: the internet is increasingly filled with content created not for readers but for search algorithms. As recently as 2023, NewsGuard analysts described how major brands inadvertently fund websites that publish hundreds of AI-generated texts daily. This occurs through an advertising system: ads on popular pages are placed automatically, and the quality or origin of the content does not matter.

The Reuters Institute for the Study of Journalism at Oxford warns that such content is turning into digital noise. Websites are created not to inform the audience but for the cheapest possible optimization for search results. As a result, the materials of professional journalists and researchers risk dissolving in a stream of empty texts.

In the spring of 2024, Google announced a major update to its search algorithm aimed at pages created "primarily for search engines, not for people." The noble goal is to reduce the share of low-quality and non-original content in search results by 40%. And the very fact of such intervention shows how widespread the phenomenon has become.

Moreover, such materials are inherently short-lived. As long as a site generates advertising revenue, it exists, but as soon as the traffic flow dries up, the domain closes and disappears into oblivion. Thus, the internet is gradually filled with yet another type of digital ruins.

The Politics of 404

A separate threat is not technical but political. Since the beginning of Donald Trump's second term, more than 8,000 pages and about 3,000 datasets have been removed or significantly altered from federal websites in the U.S. Data on climate, health, racial statistics, gender identity, and HIV disappeared within days of presidential directives.

In February 2025, a federal court ordered regulators to restore some of the removed content, and some materials returned. But the precedent itself turned out to be more important: in the digital age, the transition from a change in political course to the erasure of unwanted information can be very swift. Unlike printed books, which cannot be removed from all libraries at once, a web page (especially a government one) can disappear in seconds.

Serious data issues are also observed in Russia. After February 2022, about 1,000 datasets were hidden in 48 federal agencies, notes the project "If to Be Precise." The peak occurred in 2022-2023, when sensitive data on the economy, crime, and mortality disappeared. And in 2025, demographic statistics were severely affected: there is no longer publicly available data on marriages, divorces, births, and population numbers.

A Trillion Pages

In October 2025, the Internet Archive, the largest archive of human memory ever created, surpassed the mark of 1 trillion saved pages — about 125 for every living person now. Apparently, this is an order of magnitude more than the web collections of the largest libraries in the world.

This archive was founded by Brewster Kahle in 1996 in San Francisco with a simple and almost utopian mission: "Universal access to all knowledge." Over three decades, the organization has become one of the most visited non-profit websites in the world with an annual budget of $20+ million*. It has also managed to accumulate over 200 petabytes of data. For context: renting the same 200 petabytes on commercial servers like Amazon S3 at standard rates would cost about twice the archive's annual budget.

The project's survival is aided by ingenuity taken to almost ascetic levels. The archive's servers are built according to their own designs: engineers developed a storage system called PetaBox — high-density and energy-efficient racks with thousands of hard drives — collectively known as the Wayback Machine. Some of them are located in an old church in San Francisco. They preserve the memory of humanity (about 150 TB of important data every day!) and also heat the building.

The Internet Archive operates under the principle of LOCKSS, or Lots of Copies Keep Stuff Safe, meaning reliability through redundancy. The database exists in several physical copies, distributed across different points on the planet — from Africa to Canada, so that a local disaster cannot destroy the collection entirely. In 2025, the archive established a headquarters in Amsterdam.

The viability of such architecture was confirmed by hacker attacks in 2024. On October 9, the group SN_BlackMeta (many consider them vandal hackers, while they call themselves political activists) overloaded the archive with a flood of false requests and hacked the user database through a vulnerability in one of the libraries. The archived data and their backups were not harmed, and ultimately the Internet Archive withstood.

In July 2025, California Senator Alex Padilla granted the archive the status of a federal depository library — for the first time in history, this title was given to a digital organization. "This allows us to get closer to the source from which materials come," explained Brewster Kahle.

However, this step is largely symbolic: the new status does not provide additional funding and does not protect against copyright lawsuits. The archive continues to balance between its mission and legal reality, as it has since its inception. However, maintaining that balance is not always possible.

Who Profits from This

In 2020, the four largest American publishers — Hachette, HarperCollins, Penguin Random House, and Wiley — filed a lawsuit against the Internet Archive for the practice of "controlled digital lending." The archive purchased physical books, scanned them, and provided free temporary access to the digital version on the principle of "one purchased physical copy = one file in one hand" — almost like in a regular library. Publishers called this piracy.

By March 2023, the court sided with the publishers, and in September 2024, the Second Circuit Court of Appeals upheld the decision: digitizing a physical copy of a book violates the rights of publishers and authors. Because, as the court ruled, such an internet archive "is not transformative, does not add new expression, meaning, or message to the original works," but simply copies and distributes content without the permission of rights holders. The archive did not appeal, and over 500,000 books had to be removed.

Critics of the decision warn about the precedent. "Libraries are already burdened by licensing fees for e-books," said Dave Hansen, executive director of Authors Alliance. "This decision may benefit only the largest publishers and the most famous authors, but it promises more harm for the rest. It could even stifle academic research and education as a whole."

At the same time, the conflict with the music industry has intensified. A couple of years ago, Universal Music Group, Sony Music Entertainment, and Concord filed a $621 million lawsuit against the Internet Archive for digitizing old records as part of the Great 78 Project. More than 600 musicians defended the archive, and a petition to withdraw the lawsuit gathered 125,000 signatures. In September 2025, the case was settled under confidential terms.

Artificial Intelligence Rules

Simultaneously, a new scandal is unfolding. Publishers have begun blocking the Internet Archive's crawlers, but not because they oppose archiving as such. The problem lies elsewhere: open archives have become a convenient source of data for AI companies.

Indeed, while news sites are closing direct access to AI bots and agents and filing lawsuits against model developers, they are increasingly turning to public infrastructure — archives and indexes, because these remain open for machine access by definition. As a result, the pressure initially directed against AI giants is shifting to institutions of digital memory.

The logic of the media industry here is quite consistent: access to content should be paid. Gannett (owner of USA Today and hundreds of regional newspapers) reported that it blocks tens of millions of requests from AI bots every month, a significant portion of which is related to OpenAI. At the same time, the conglomerate has simultaneously signed licensing deals with Perplexity.

In this scheme, the archive appears not as an enemy but as an inconvenient intermediary: it does not charge for access and cannot control who uses it. The Hachette vs Internet Archive case revealed a broader contradiction: the more rights holders restrict AI companies' access to their content, the greater the burden on public archives and libraries. Ironically, the bastions of open digital knowledge have found themselves held hostage by their own openness.

In Anticipation of a Global Backup

International organizations have long been discussing digital memory, but these discussions have not yet formed into a coherent regulatory system.

Globally, the UNESCO charter on the protection of digital heritage recognizes that copyright so restricts copying that even transferring files to library systems may violate the rights of rights holders. It is separately noted that without cooperation between publishers, archives, and libraries, long-term preservation of digital heritage is practically impossible. Progress in practice has been minimal.

However, some countries have begun to create digital memory infrastructure. The UK has mandated major internet resources to transfer content to the British and other libraries. Germany and Australia are developing their state web archives. In the U.S., the OPEN Government Data Act passed in 2019 requires agencies to publish data in machine-readable formats — although nothing prevents them from deleting it.

Evidently, our heritage is not exabytes of priceless data, but the ability to take timely screenshots.