The internet is dead. And we killed it.
Though wikipedia describes the dead internet theory as a “conspiracy theory” [2], I think many people are starting to believe in it, or at least are passingly aware of it. The dead internet theory postulates that the internet is basically fully populated by bots or generated content and there are really no humans left [3]. Sounds implausible and fringe, but we can see the sharp increase in interest in this theory over time [4]:

Correspondingly, with the rise of generative AI (ChatGPT [5], for example, which launched towards the end of 2022), many people are starting to believe in this trend. This is especially interesting, or maybe frightening, for me as a data scientist because so much of our data is generated from scraping freely available information from the web. What spurred me to write about this topic is the recent update from the wordfreq python library [6] maintainer that they intend to sunset support [7]. In their own words:
“
I don't think anyone has reliable information about post-2021 language usage by humans.
The open Web (via OSCAR) was one of wordfreq's data sources. Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies.
Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere.
As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.
“
The article mentioned by Robyn Speer [8] is equal parts hilarious and sad. The main graphic in the article is also very telling

Basically, we’re starting to see a marked increase in ChatGPT generated data that is polluting true usable data for training NLP models or doing NLP in general.
Well, I don’t see there being a true reduction in generative AI without some sort of legislative intervention, so we need to actually start generating tools to detect and find generated content. I don’t claim to be smart enough to build a truly accurate ChatGPT detector [9]. But, I am a Bayesian, and I do believe it’s interesting and useful to try and train a Bayesian generative AI detector. In this series, I’ll take you through the steps I took to train this detector.