New Technology, Same Old Data Problem

Sep 18, 2024

Article voiceover

1×

0:00

-5:20

I bet that you haven’t gone a single workday this year without either hearing about or using Generative AI. And I bet that most of what you heard wasn’t related to hesitations about adoption, but rather to the promise of the future and all the innovative change this novel technology will bring. It’s hard not to be a believer, given all the use cases in areas like protein structure identification, drug discovery, and more. The hype is certainly backed by several studies, including one from Forrester, showing that 70% of enterprise-level companies are already using GenAI and another 20% are exploring its use. So, where’s the problem?

Data is the problem! Although we don’t know what the GenAI future holds, we do know that large language models (LLMs) are heavily dependent on the quality of their training data. Thus, if training data is biased, incomplete, or erroneous, the outputs can become unreliable or even harmful. It’s the classic “garbage in, garbage out” problem.

a computer chip with the letter a on top of it

Most LLMs are still primarily trained on open-source text data, such as web pages. While abundant, open-source data often lacks the necessary quality control and consistency, which leads to models learning undesirable behaviors that produce poor outputs. Moreover, high-quality internet data is often drowned out by revenue-driving SEO posts, attention-grabbing headlines, and other shallow content. This can result in AI that sounds convincing but lacks depth and precision.

How do we resolve this? Training GenAI models on internal data offers a solution, particularly in ensuring that the model is tailored to the specific needs and context of an organization. Internal data is often more relevant, accurate, and representative of the organization's domain, resulting in models that produce more reliable and contextually appropriate outputs. This also provides organizations with a competitive edge, because such models provide insights and capabilities that are unique and not easily replicated by competitors who don't possess the insights. Data is power, after all.

Now that we use our own data, do AI data problems vanish? Not so fast. What happens if data is not accessible, or not clean and tagged? Internal data often requires significant preprocessing, such as cleaning, tagging, anonymizing, and structuring, to ensure it is suitable for training purposes. In fact, research firm IDM found that the lack of accessibility to data is one of the main challenges to AI adoption.

The same tools, techniques, and investments that companies use for ensuring data quality and storage—such as for advanced analytics projects—also work with GenAI applications, but many organizations have not invested appropriately in data management. It then becomes far more difficult for data management teams to predict which data must be cleansed to deliver accurate insights because they don’t control the data and certainly don’t control which questions are being asked. Simply put, not all organizations can afford to use their actual data.

One interesting solution to this problem is to use synthetic data, which mimics real-world patterns without exposing sensitive personal information. Forrester defines synthetic data as “Generated data of any type (e.g., structured, transactional, image, audio) that duplicates, mimics, or extrapolates from the real world but maintains no direct link to it, particularly for scenarios where real-world data is unavailable, unusable, or strictly regulated.”

turned on flat screen monitor — Photo by Chris Liverani on Unsplash

There are several benefits to adopting synthetic data. First of all, synthetically generated data that is completely disconnected from the original dataset isn’t traceable back to its source, which is important for highly regulated industries. From a data accuracy perspective, developers would no longer need to rely on scraping bots that misuse authorized access to collect public data in bulk. This not only helps enable proper training, but also the emergence of new business models, such as credible content licensing (i.e., AI developers licensing content from journalism sites), or the building of AI factories which generate more efficient data. At the very minimum, we can fill the kind of gaps that arise simply from LLMs behaving like “internet simulators.” For example, if your model is hallucinating because you don’t have enough training examples of people expressing happiness, or biased because it has unrepresentative data of a certain demographic, then generate some better examples! Of course, synthetic data is not as simple as it sounds. Companies investing time and money in AI need to recognize the importance of using high-quality data. It takes time, money, and effort to get it right. If the

AI builders don’t know how to manage data properly, how can they promise customers transparency and consent? Anything less could be a breach of trust. AI data management is a complicated problem, with many roadblocks. But that’s why we, the humans, need to stay in the loop.

New Technology, Same Old Data Problem

Discussion about this post