Could AI Consume Itself?

Steve Sammartino examines the implications of increasing amounts of content being AI-generated and what the creation of an AI-generated echo chamber means for investing.

By Steve Sammartino ·
25 Jul 2023 · 5 min read

Since generative AI has entered the realm, discussions on training data sets have become commonplace. Notably, corporations seem to care a great deal more than consumers ever have about their data being used for commercial applications without permission. And while those battles are likely to be drawn out in courts of law around the globe, there’s another data reality that we need to think about: the data that AI is being trained on.

Generative AI Ingredients

There are three key ingredients that have made the generative AI revolution we’ve just entered possible:

the AI models, most notably the arrival of Large Language Models, which are the neural networks;
the chips and processing units that can cope with such a large compute across many billions of parameters; and
the data sets that populate the LLMs and get processed.

Of course, without all of it, there can be none of it, but the latter ingredient – the data sets – are vital, given that is what the AI models learn from and use to predict what we want and essentially provide the ‘generative content’.

And up until November last year, when the first useful ChatBot – ChatGPT – was released to the general public, almost everything on the internet was essentially ‘us’: human-centric content. Articles, posts, podcasts, chat forums, videos, images — all created, posted, described and tagged by people.

This 30-years or so of deep and wide content which populated the internet has been the perfect training ground for Generative AI. All the breadth, nuance and insight of creative humans is what enabled the output to be so ‘human-like' - even if it comes off a little dry at times. And in less than a year, the internet is starting to morph and change its shape.

The AI Internet

Just reading the web in recent months, and we can already see the shift. Google is rapidly trying to cope with a new ‘direct answers’ business model to replace its ten blue links. Social media is increasingly besieged with newly AI-generated content or content telling you how to create AI-generated content while technology firms and digital media are making cutbacks in staff in a move to create automated content well beyond earnings reports.

Both the demand and supply of AI-generated content are skyrocketing, with the most common job posts in the content realm requiring ‘AI work-withs’ to accelerate output to 100x of what a mere human could produce.

In mere months, the digital landscape has transformed itself. Sites once filled with human insight and opinion are now flooded with AI-generated text, audio, images, and video. Some AIs are even starting to quote and cite each other, creating these echo chambers of misinformation. The internet is going through a hyper-scaled AI industrialization. In a meaningful way, the internet is becoming less human.

While much of this is anecdotal, some research is starting to emerge which demonstrates these changes.

Experiments in AI

Mechanical Turk is an Amazon-owned platform where individuals can take on micro-jobs, earning small amounts for each task completed. Many social science researchers rely on this platform to gather participants for their experiments. In a recent study, researchers examined crowdsourced workers on the Mechanical Turk platform and discovered an increasing trend of using large language models for text-based tasks. The study found that a significant portion of these workers (around 33 out of 44) were already using large language models to complete their tasks. This raises concerns as researchers expect human responses for studying human behaviour or society. With the use of language models, the authenticity of the findings may be compromised.

Another study emerging out of a collaboration from the University of Oxford, Cambridge, Toronto, and the Imperial College of London found the type of data in the models is all-important. They concluded that if you train an AI system on what they call ‘Synthetic Data’, that is data generated by another AI system, it causes the models to degrade rapidly, and ultimately collapse and fail to function. It may well be that data is a little bit like food. That which is generated naturally by humans, or ‘organically’, is different from the manufactured type.

This is where things get interesting, even a little strange. Given that all LLMs are trained on huge bodies of human text, it seems logical that we’ll need to update that corpus or continue to add human content. And already, that requirement is being compromised by the AI era of the web.

This research is essentially saying that if enough of the internet is output from Generative AI models, then the models will stop working - AI could well eat itself. But we don’t know because most of the training sets are not live and rely on pre-generative AI internet training data sets from 1-2 years ago.

Dead Internet Theory

The Dead Internet Theory is a quasi-conspiracy that has been around for a few years. The general idea is that the internet has largely been taken over by bots – with Statista claiming it is almost 50 per cent of web traffic. Given that generating attention and making money has become so algorithmically driven and a contest for SEO, likes, followers, and fans, a way to win the game is to be releasing bots to generate content and populate your feed or desired political message. Theorists posit that the internet will eventually be a battle of bots against bots, with humans mere bit players.

Investing Insights

This, and the potential for Generative AI consuming itself, have dramatic implications for investing advice. Our game is won through careful study, nuance, and insight. The counter-intuitive nature of successful investing means that outcomes of high probability (what most AI does) don’t create any economic advantage. Likewise, reports with the same data (other than AI-generated earnings reports) provide little value to end-users. What we need is the peculiar viewpoint a single human with unique experience can provide: industry insight, which is a combination of gut, intuition, and data.

If, and it is still a big ‘if’, AI becomes a circular reference tool with degrading data, any advice it provides would just become a loud echo chamber worth avoiding. For now, at least, the human voice with a breaking viewpoint and nuanced insight still matters.