close
close

Tech companies are turning to ‘synthetic data’ to train AI models – but there’s a hidden cost

Tech companies are turning to ‘synthetic data’ to train AI models – but there’s a hidden cost

Last week, billionaire and owner of X, Elon Musk, claimed the pool of human-generated data used to train artificial intelligence (AI) models such as ChatGPT is exhausted.

Musk cited no evidence to support this claim. But other high-profile figures in the tech industry have made similar allegations in recent months. And previous research said human-generated data would be exhausted within two to eight years.

This is largely because humans cannot create new data such as text, videos, and images fast enough to meet the rapid and enormous demands of AI models. When authentic data runs out, it will pose a major problem for both AI developers and users.

This will force tech companies to rely more on AI-generated data, known as “synthetic data.” And this, in turn, could lead to the AI ​​systems currently used by hundreds of millions people being less accurate and less reliable – and therefore useful.

But this is not an inevitable outcome. In fact, if used and managed carefully, synthetic data could improve AI models.

Phone running the ChatGPT app in front of the OpenAI logo.

Tech companies like OpenAI are using more synthetic data to train AI models. T. Schneider/Shutterstock

Problems with real data

Technology companies depend on data – real or synthetic – to create, train and refine generative AI models such as ChatGPT. THE quality of this data is crucial. Bad data leads to bad results, in the same way that using low-quality ingredients in cooking can produce low-quality meals.

Actual data refers to text, video, and images created by humans. Companies collect it through methods such as surveys, experiments, observations, or crawling websites and social media.

Real-world data is generally considered valuable because it includes real events and captures a wide range of scenarios and contexts. However, it’s not perfect.

For example, it may contain spelling errors and inconsistent or irrelevant content. This can also be heavily biasedwhich can for example lead to generative AI models create pictures that only show men or white people doing certain jobs.

This type of data also requires a lot of time and effort to prepare. First, people collect data sets, before labeling them to make them meaningful for an AI model. They will then review and clean this data to resolve any inconsistencies, before computers filter, organize and validate it.

This process can take up to 80% of total time invested in the development of an AI system.

But as noted above, actual data is also available. an increasingly limited supply because humans can’t produce it fast enough to meet the growing demand for AI.

The rise of synthetic data

Synthetic data is created artificially or generated by algorithmssuch as text generated by ChatGPT or an image generated by SLAB.

In theory, synthetic data offers a cost-effective and faster solution for training AI models.

It also addresses privacy concerns and ethical questionsespecially with sensitive personal information such as health data.

It is important to note that unlike real data, it is not rare. In fact, it’s unlimited.

From here its only synthetic data.

“The cumulative sum of human knowledge has been exhausted in AI training. This happened basically last year.”

-Elonpic.twitter.com/rdPzCbvdLv

-RohanPaul (@rohanpaul_ai) January 9, 2025

The challenges of synthetic data

For these reasons, technology companies are increasingly turning to synthetic data to train their AI systems. Gartner research firm estimates that by 2030, synthetic data will become the primary form of data used in AI.

But while synthetic data offers promising solutions, it is not without challenges.

One of the main concerns is that AI models can “collapse” when they rely too much on synthetic data. This means they start generating so many “hallucinations” – a response containing false information – and decline so much in quality and performance that they become unusable.

For example, AI models already struggling by spelling certain words correctly. If this error-riddled data is used to train other models, then they too are bound to reproduce the errors.

Synthetic data also carries a risk of being too simplistic. It may lack the nuanced detail and diversity found in real data sets, which could make the results of AI models trained on it too simplistic and less useful.

Create robust systems to keep AI accurate and reliable

To resolve these issues, it is essential that international bodies and organizations such as International Organization for Standardization or the United Nations International Telecommunications Union introduce robust systems for tracking and validating AI training data, and ensure that systems can be implemented globally.

AI systems can be equipped to track metadata, allowing users or systems to trace the origin and quality of any synthetic data they have been trained on. This would complement a globally standard monitoring and validation system.

Humans also need to monitor synthetic data throughout the process of training an AI model to ensure its high quality. This oversight should include setting goals, validating data quality, ensuring ethical standards are met, and monitoring AI model performance.

Ironically, AI algorithms can also play a role in data auditing and verification, ensuring the accuracy of AI-generated results from other models. For example, these algorithms can compare synthetic data to real data to identify any errors or discrepancies to ensure data consistency and accuracy. Thus, synthetic data could lead to better AI models.

The future of AI depends on high quality data. Synthetic data will play an increasingly important role in overcoming data shortages.

However, their use must be carefully managed to maintain transparency, reduce errors and preserve privacy – ensuring that synthetic data serves as a reliable complement to real data, ensuring the accuracy and reliability of AI systems.The conversation