top of page
Search

Data is the New “D” in DCF: Navigating the Emerging Markets for AI Training Data

  • Writer: Yiwang Lim
    Yiwang Lim
  • May 18
  • 3 min read

Updated: May 19

ree

The AI stack has long been summarised as power + compute + data. While investors have poured capital into gigawatt-scale data centres and advanced nodes at TSMC, the market is only just pricing in the real scarcity factor: high-quality, rights-cleared data. A recent FT op-ed underscores this pivot, arguing that “every company should think about its strategy to capture emerging opportunities” in data markets.


Deal flow signals a new asset class

  • Reddit × Google – c.$60 m p.a. for 20 years of user-generated threads.

  • Financial Times × OpenAI – multi-year content-licensing partnership announced April 2024.

  • Stack Overflow × OpenAI – API-based deal giving OpenAI structured Q&A data.


These contracts have two common traits: (i) long-dated revenue visibility (quasi-SaaS) for the data owner; (ii) optionality for the AI buyer—datasets become more valuable as model size (and therefore marginal willingness to pay) scales. In DCF terms, data owners are effectively monetising an unrecognised intangible, shifting it from “off-balance-sheet asset” to annuity-style cash flows.


Supply crunch meets exponential demand

Epoch AI and MIT estimate that the stock of public, high-quality text could be exhausted between 2026-32 if current scaling trends persist. OpenAI reportedly consumed ~13 trn tokens and ~$63 m in compute just to pre-train GPT-4. When the cost of incremental compute falls faster than the availability of new data, the bottleneck flips—and that’s exactly what we’re seeing.


From an investor’s perspective, this resembles a classic inelastic supply curve: demand accelerates, supply is capped by copyright and privacy constraints, so price (licensing fees) clears the market.


Synthetic data: from niche to mainstream

Grand View Research pegs the synthetic-data TAM at £170 m in 2023, compounding 35.3 % CAGR to 2030. Nvidia’s Omniverse and Tesla’s digital twin of global roads illustrate why: unlimited, label-perfect datasets de-risk safety-critical AI without real-world sampling errors. My take: synthetic data will not replace proprietary human data, but it will extend it—think of it as leverage. Owners of “ground-truth” datasets still control the seed that generates the synthetic derivative.


Regulatory arbitrage emerging

  • EU AI Act – sweeping risk-tiered regime; general-purpose AI (GPAI) compliance hits Aug 2025.

  • UK Data (Use and Access) Bill – Lords’ amendment would force model developers to disclose copyrighted inputs; ping-pong continues.


If the amendment survives, UK-based AI firms could face disclosure obligations—and potential royalty back-payments—that their US rivals may avoid. Cross-border investors should stress-test EBITDA margins for “data-royalty drag” under a worst-case UK/EU scenario.


Where the alpha is

Theme

Listed Picks

Rationale

Proprietary “narrow-domain” content

RELX, Pearson

Academic & legal archives ideal for chain-of-thought training; high switching costs.

Vertical data marketplaces

Snowflake, Databricks (private)

Enabling data-clean-room licencing and usage tracking—critical for provenance compliance.

Simulation & synthetic infrastructure

Nvidia, Unity

Hardware + software toolchain for photorealistic digital twins.

A different lens: mid-cap UK publishers with rich but under-monetised archives (e.g., Haynes manuals, trade-press portfolios) are ripe for carve-outs. A roll-up could arbitrage valuation gaps between “old media” (<8 × EBITDA) and “AI data platforms” (>20 ×).


MY VIEW

Data is following the trajectory of spectrum auctions in the early 2000s: once-ignored assets becoming price-setting. The difference? Data can be replicated (near-zero marginal cost) but not perfectly substituted. That asymmetry favours owners with provenance-clean, well-labelled datasets—precisely the type most incumbents have been sitting on for decades.


For analysts: start modelling a “Data-as-a-Service” (DaaS) line item. For operators: audit your data lake, fix metadata, and get ahead on consent architecture. For regulators: focus on authentication standards; ex-post copyright fights are value-destructive for all parties.


Bottom line: The market is already paying up for GPUs and megawatt hours. The next rerating will accrue to firms that can supply—or synthetically amplify—scarce, lawful data. Ignore this and your valuation model is missing the key driver of AI ROIC over the next cycle.

 
 
 

Recent Posts

See All

Comments


©2035 by Yiwang Lim. 

Previous site has moved here since September 2024.

bottom of page