AI Startups Have Tons of Cash, but Not Enough Data. That’s a Problem.

“We’ve seen lots of pitches from companies who may well be pursuing a brilliant application of AI, but they don’t have access to data that will give them the ability to build a powerful application, let alone proprietary data that will help them have competitive moats in their business,” said Brad Svrluga, co-founder and general partner of venture-capital firm Primary Venture Partners.

Today, having the right data is more critical than ever for success. Now that building the actual models has become somewhat commoditized, the real value is in the data, said Paul Tyma, CTO-in-residence of Bullpen Capital.

Venture funding in generative-AI startups has grown from $4.8 billion in 2022 to $12.7 billion in the first five months of 2023, according to PitchBook. Now many of these companies are looking to build more niche AI models in areas such as finance or healthcare—but getting access to training data sets there hasn’t been easy.

Some AI startups are targeting partnerships with big, data-rich enterprises. For example, Marna Ricker, global vice chair for tax at EY, said the firm has generative-AI startups approaching it everyday thanks to its vast troves of transactional data. But EY’s global managing partner of client service, Andy Baldwin, said he has concerns about what would happen to EY’s data if it is used to train an outside model.

“Who owns that data? And when we train the model, what’s our access rights to that model? And how else are other people going to be able to use that model?” said Baldwin. “The data is part of our IP that we bring.”

Startups can get around the IP issue by training a different model for each of their clients on only that client’s data. That’s a strategy that startup TermSheet is using to build its Ethan product, a generative-AI model to answer industry questions for real-estate developers, brokers and investors. But even getting clients to agree to that takes some education and convincing, said CEO and co-founder Roger Smith.

Convincing enterprises that you have a strong cybersecurity posture and can actually protect that data can also be a challenge, said Andy Wilson, co-founder and CEO of legal tech company Logikcull.

Primary Venture Partners’ Svrluga said larger tech incumbents may hold an advantage over startups in generative AI applications in part because they already have the trust of large customers who are comfortable with them handling data.

Tracy Daniels, chief data officer at financial services company Truist, said she’s currently only exploring generative AI use cases with larger tech vendors rather than startups. She said she can trust larger vendors to keep the data safe.

It means that even startups who can get a head start with publicly available data face challenges fleshing out their models with enterprise data sets. Veesual, an AI startup that can generate images of what people look like trying on clothes, initially leveraged public images from the internet for training, but struggled to get large retailers to agree to hand over their data to enhance the model.

In some cases, the large retailers asked for huge payouts or equity in the company in exchange for how Veesual would profit off that data, and the deals didn’t come to fruition, said CEO and co-founder Maxime Patte.

PatentPal, a generative-AI startup that helps law firms draft patent applications, is trained on publicly available patent filings, said CEO and Founder Jack Xu. There is an opportunity to make the tool more accurate by continuing to train it on actual customer feedback that is been encrypted or anonymized, he said. But it is complex because that feedback has to be separated from highly sensitive and confidential data, including trade secrets.

“For early stage startups, there’s an issue of brand recognition, an issue of social proof,” he said.

But at the same time, the pressure is on. Adam Struck, founder and managing partner of Struck Capital, said that some startups are racing to compete with each other to secure more data within certain niches, and to do it faster.

“If you believe there’s a proprietary data set, you’re going to want to get that before them and then negotiate exclusivity,” he said. “In that sense it almost becomes an arms race.”