DataHQ · AI thesis

The missing layer for enterprise AI is the Data Product, and the right way to build Data Products is above the engine, not inside it.

Everyone is layering AI on top of warehouses, lakehouses, and ERPs, and everyone is hitting the same ceiling. The model isn't the bottleneck, the context is. And the context you need, a versioned, governed, semantically precise view of how your business actually works, does not live in the engine. It has to be built on top of it, as a first-class product.

DataHQ · April 2026 12 minute read
The mechanic, in plain terms

Every event becomes a data point. Every data point posts to a driver. The DataHQ ledger is the dictionary.

Most BI products start at the report. We start at the dictionary, the contract that says what each event is, what each driver is, and how each one maps to a row on the General Ledger. Once that dictionary exists, three things become trivial that are otherwise impossible: closing the books in days, answering an agent's question without hallucination, and simulating the business by dragging a driver.

Source systems
POS · catalog · playout ERP · WMS · TMS CRM · ad-tech · GAM Sub billing · payments HRMS · roster Banks · treasury
01 · Discover

Process logs & events → data points

Every observable action: a sale, a PO, a stock move, a payroll run, a campaign serve, a settlement received, typed by the actor that produced it.

→ Typed data points
02 · The dictionary

DataHQ FinOps Ledger, the analytics-integrable Chart of Accounts

Every data point posts to a driver. Every driver decomposes to the smallest controllable unit: revenue, cost, financing, asset. Driver-based by construction.

→ Queryable in ms · auditable to the row
03 · Hand off

Your General Ledger CoA
Zoho · SAP · Oracle · Tally

The DataHQ CoA maps 1:1 onto your General Ledger. No schema fight, no double-entry, no manual reconciliation at month-end.

→ Trial balance · audit by design
What it enables
Real-time outlet-wise P&L MIS for every role Budget variance & alerts Driver-based simulation Power BI · Tableau connector Agents with grounded answers
1

The dictionary, not the report.

Most BI products start at the report. We start at the contract. Every event has a class. Every class has a posting rule. Every posting maps to a driver, and every driver reconciles to a line on your General Ledger. Reports are derivable; the dictionary is the asset. Anyone with the dictionary can produce every report. No-one without it can.

2

The context engine, because of structure.

Because every data point is typed, posted, and linked to a driver, an LLM attached to the DataHQ ledger gets context, not retrieval. The agent doesn't search through documents or guess at a join, it reads from a ledger that already knows what each row is worth, what produced it, and how it moves. That is what makes the answer auditable.

3

The simulation engine, because of drivers.

Drag any driver, fuel price, channel mix, content slate, FX assumption, working-capital terms, and the entire ledger recomputes. Same governed rows. Same close-grade audit trail. The same structure that produces the report is what runs the scenario. No second copy. No spreadsheet ferry. No reconciliation when the meeting ends.

You can swap engines, you can swap models, you can swap vendors. What you cannot swap, cheaply, is the definition of revenue, the definition of a site, the definition of a unit, the definition of on-time. That lives above the engine, or it lives nowhere.
The DataHQ thesis · the data product as the durable layer
The architecture

Above the engine, not inside it.

Data products are not tables in Snowflake, not views in BigQuery, not dashboards in Power BI. They are contracts, versioned, governed, and consumable, that sit on top of whichever engine you happen to run today, and will keep working when you swap it tomorrow.

Consumption
AI agents · BI · humans · apps
Query by concept, not by column. "Margin for Dubai dine-in last week," not a 40-line CTE.
Consumers
The missing layer
Data Product Layer · semantics, lineage, contracts, governance
One definition of revenue, one definition of a site, one definition of SSSG. Versioned, owned, tested, addressable.
Semantic model Lineage + audit Contracts + SLAs Row-level governance Process twin Addressable by LLM
DataHQ lives here
Engine · warehouse, lakehouse, OLAP
Snowflake, Databricks, BigQuery, Postgres, ClickHouse. Swappable. Boring. Commodity over time.
Compute & storage
Sources · POS, ERP, WMS, TMS, MES, HRMS, banks
The raw truth of the business, events as they happen. Messy, evolving, vendor-specific.
Systems of record
Why "inside the engine" fails

Two ways to build a data product. Only one compounds.

Build the data product inside the engine, and you tie your company's most valuable asset, the definition of its own business, to whichever vendor you signed with three years ago. Build it above the engine, and you own the asset, the vendor becomes a commodity.

The common mistake

Data product inside the engine

Dashboards in the warehouse. Semantics in the BI tool. Contracts in a Confluence page. AI plugged into a SQL generator that hits the raw tables.

  • Every engine migration is a rewrite of the business, not a technology swap
  • Definitions drift, three teams, three versions of "revenue," no one is wrong
  • Lineage stops at the warehouse boundary, source events are opaque
  • AI sees tables, not concepts, hallucinates joins and filters
  • Row-level governance lives in the BI tool, does not extend to agents
  • You pay the vendor to keep owning your company's own model
The DataHQ way

Data product above the engine

A first-class layer that owns the semantic model, the lineage, the contracts, and the governance. The engine becomes swappable. The product is the asset.

  • Swap engines with zero rewrite, the contract is the contract
  • One definition of revenue, site, SKU, LFL, OEE, enforced everywhere
  • Lineage runs all the way back to the POS ticket, the GRN, the shift punch
  • AI queries concepts, not columns, answers are cited to source rows
  • Row-level governance flows through agents automatically
  • You own the layer that matters, the engine is a cost line, not a strategy
Evidence from the research community

LLMs cannot plan alone. That isn't opinion, it's benchmark.

Work presented at ICML 2024 on the LLM-Modulo framework, together with PlanBench results, makes the case formally. Frontier models, on their own, score well below chance on standard planning tasks. The fix is not a bigger model. The fix is a framework where the LLM generates, and an external, formal, domain-aware layer of critics verifies, reformats, and back-prompts until a valid solution is reached. That external layer is exactly what a data product looks like.

Source · Kambhampati et al., LLM-Modulo, ICML 2024 · PlanBench, as of 8/27/2024

PlanBench, frontier-model results

Percent of instances solved · lower is worse
600 instances per cell
Domain Shot Claude 3.5 Sonnet Claude 3 Opus GPT-4o GPT-4 GPT-4 Turbo LLaMA-3.1 405B LLaMA-3 70B Gemini Pro
Blocksworld One-shot 346/60057.6% 289/60048.1% 170/60028.3% 206/60034.3% 138/60023.0% 284/60047.3% 76/60012.6% 68/60011.3%
Blocksworld Zero-shot 329/60054.8% 356/60059.3% 213/60035.5% 210/60034.6% 241/60040.1% 376/60062.6% 205/60034.2% 3/6000.5%
Mystery Blocksworld One-shot 19/6003.1% 8/6001.3% 5/6000.8% 26/6004.3% 5/6000.8% 21/6003.5% 15/6002.5% 2/5000.4%
Mystery Blocksworld Zero-shot 0/6000.0% 0/6000.0% 0/6000.0% 1/6000.2% 1/6000.2% 5/6000.8% 0/6000.0% 0/5000.0%
Read this as ceiling, not floor. On a standard blocksworld problem the best frontier model solves ~58% of instances. Rename the predicates so surface-level pattern matching fails (Mystery Blocksworld) and performance collapses to 0–4%. There is no model in this table that can plan reliably on its own.
The framework

LLM-Modulo · a principled framework for planning where LLMs play multiple constructive roles.

The LLM generates. The data product verifies. A bank of critics (binary, constructive, style, model-based) reads the plan from a shared blackboard, reaches agreement or sends the plan back for another pass. Successful solutions seed synthetic data that compiles the verified behavior back into the base model via SFT or RL post-training.

(1) Input
Problem specification
Complete, partial, or abstract. Refined in an interaction loop with the end user.
(1a) Human-in-the-loop
Domain expert · model-based critic construction
The expert teaches the system the rules of the domain. These become formal critics, not prompts.
(3) Shared state
Plan Blackboard
Concrete plan, hierarchical plan. The object every role reads and writes.
The engine
(2) LLM response
Large Language Model
Generator. Hypothesis machine. Good at style. Alone, unreliable at planning.
(5) Backprompt on disagreement
Meta controller
Prompt selection, diversification. Decides when to ask the LLM again, and how.
(4) Critic agreement
Valid solution
Emitted only when the critic bank agrees. No agreement, no solution.
(6) + (7) Compile back
Synthetic data → finetune
Valid plan data, style prompts, and interaction data feed SFT or RL post-training on the base model.
The missing layer
Critic bank
Reformatter + Critics
Binary, constructive, style, model-based. Each gives yes/no or structured feedback. The LLM does not get to grade itself.
Critic B1Binary
Critic B2 … BnBinary
Critic CiConstructive
Critic SiStyle
Critic (model-based)Formal
The loop Spec LLM Blackboard Reformatter Critics Agreement? Valid solution, else Meta controller Backprompt LLM. Successful runs become synthetic data.
Summary, in the author's words

Six things the research community now accepts.

01

LLMs trained just on web corpora have severe limits on planning and reasoning tasks.

The PlanBench table above is the evidence. Scaling has not closed the gap.

Planning ceiling
02

They can still be good arbiters of style, though.

Tone, format, coherence, fluency. LLMs do that part well. Soundness is a different job.

Style ≠ soundness
03

In a Generate-Test cycle, LLMs become robust generators.

LLM-Modulo is a form of test-time scaling, the LLM proposes, a bank of critics disposes. The loop, not the model, does the reasoning.

Test-time scaling
04

The improved behavior can be "compiled" back into the base LLM.

SFT or RL post-training on synthetic data from successful Generate-Test runs incrementally compiles verifier signal into the weights.

Compile verifier → model
05

The resulting LRMs still have no correctness guarantees, they're just better generators.

Post-training tightens the distribution. It does not create a proof. If correctness matters, the verifier has to stay in the loop at inference time.

No guarantees
06

The anthropomorphization of intermediate tokens as "reasoning traces" is questionable.

Token streams that look like thinking are still token streams. A trace is not a proof, and a proof is what the enterprise actually needs when money or safety is on the line.

Traces ≠ proofs
Three more papers, three more reinforcements · 2025–2026

The case keeps building. Same conclusion: external verifier, not bigger model.

Three papers from the Kambhampati group at ASU (April 2025 → May 2026) close the door on three remaining counter-arguments, that reasoning traces are explanations, that Chain-of-Thought is internal reasoning, and that Large Reasoning Models eliminate the need for an external verifier. Each finding sharpens the data-product thesis.

07
User trust · false confidence

Reasoning traces are persuasive but not informative. They engender false trust regardless of correctness.

Between-subject user study: participants shown an LLM's chain-of-thought or a post-hoc explanation accepted the AI's answer at the same rate whether it was correct or wrong. Only a contrastive dual explanation, arguments for and against the AI's answer, genuinely improved users' ability to spot incorrect outputs.

Palod V., Biswas U., Kambhampati S. · Arizona State, May 2026 · "Evaluating the False Trust Engendered by LLM Explanations" · arXiv:2605.10930
DataHQ implication →

Asking an agent to "show its reasoning" is theatre when the reasoning is internal tokens. The verifier is the explanation. Every Pilot answer must come with the chart, the SQL query, and the row that proves it, and ideally the counter-case as well. Ungrounded "explanations" actively make decision-makers more confident in wrong answers.

08
Chain-of-Thought is not reasoning

Models trained on corrupted reasoning traces match, and often outperform, models trained on correct traces.

Transformers trained from scratch on formally-verifiable traces still produce invalid traces while arriving at correct answers. More striking: models trained on intermediate steps that bear no relation to the problem perform on par with correct-trace training, and generalise better on out-of-distribution tasks. The semantic content of the trace is largely irrelevant to task performance.

Valmeekam K., Palod V., Stechly K., Gundawar A., Kambhampati S. · TMLR April 2026 · "Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens"
DataHQ implication →

The output of a "reasoning model" is not a proof of reasoning. You cannot use the trace as audit evidence. The only durable verifier is external: a governed semantic model, a domain critic, a query that returns the same answer whether the LLM produced it or not. Which is, again, the data product above the engine.

09
The LRM tax

o1 and DeepSeek R1 improve accuracy on planning, at large, unpredictable inference-time cost, and compound systems match them at the same price.

First comprehensive PlanBench-style evaluation of Large Reasoning Models. o1-preview and o1-mini do beat frontier LLMs on accuracy, but the accuracy gain comes with token bills and latency that are an order of magnitude higher and an order of magnitude more variable. At the same effective price point, compound systems, LLM + external verifier in the loop, perform comparably. The verifier remains the cheaper and more reliable lever.

Valmeekam K., Stechly K., Gundawar A., Kambhampati S. · TMLR April 2025 · "A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1"
DataHQ implication →

You don't have to pay the LRM tax to get the answer right. The compound system wins on cost, latency, and reliability. The data product is what makes the compound system possible, it's where the critic bank, the rules, the semantic contracts live. DataHQ Pilot is the operator-facing instantiation of exactly this architecture.

Why this reinforces the DataHQ thesis

The critics, the meta-controller, the blackboard, the synthetic-data loop, these are the data product.

Every piece of the LLM-Modulo framework that is not the LLM lives in a durable, governed, versioned layer above the engine. Domain rules become model-based critics. The semantic model becomes the contract the plan is checked against. The blackboard is a first-class, addressable object, not a prompt. The three 2025–2026 papers above close the remaining counter-arguments: traces are not explanations, CoT is not reasoning, and LRMs do not eliminate the need for an external verifier. DataHQ is the enterprise instantiation of that layer, Fabric as the blackboard and contracts, Spyke as the critics and reports, Pilot as the generator-tester loop, with write-back into the ledger once a plan clears the critic bank.

The argument in four parts

Why now, why this layer, why DataHQ.

Four things have changed at once, and they make the data product layer unavoidable for any enterprise that wants AI to work on its own business, not on the internet's.

01

Models are cheap, context is not.

Foundation models are a commodity on a one-year curve. The marginal IQ is already there. What the model does not have, and cannot get from fine-tuning, is your operational context, your SKU hierarchy, your shift patterns, your covenant tests, your definition of "on-time." That lives in a layer above the engine, or it lives nowhere useful.

Implication · invest in context, not in more tokens
02

The warehouse is the wrong chassis.

Warehouses were built to answer questions fast, not to own the meaning of a business. Semantic layers inside a warehouse are read-only opinions, not contracts. They cannot version, cannot be negotiated between producers and consumers, cannot carry SLAs, and cannot survive the next vendor migration. A data product layer can do all four.

Implication · separate the product from the compute
03

Agents need a surface, not a SQL prompt.

The current generation of "text to SQL" tools are pattern matchers on top of raw tables. They work in demos, fail in production, because schema alone cannot express SSSG, cannot express four-wall EBITDA, cannot express "cash position net of covenant buffer." Agents need an addressable, typed, governed surface to reason over. That is the data product.

Implication · give agents products, not prompts
04

Planning and reporting converge on the same object.

Historically, planning tools and reporting tools were separate stacks, because planning wrote to spreadsheets and reporting read from warehouses. That split is finally collapsing. A data product can be read from, written to, branched, compared, and published. The same object powers the P&L, the forecast, the agent, and the board pack.

Implication · one ledger, for all of Fin-Ops
What a real data product looks like

Six properties, non-negotiable.

A dashboard is not a product. A table is not a product. A semantic view in the BI tool is not a product. These six properties separate a data product from everything adjacent to one.

01

Addressable

Every concept, metric, and dimension has a stable identifier that humans and agents can call by name.

02

Versioned

Changes to definitions are diffable, reviewable, and revertible, like code, because they are code.

03

Governed

Row-level access, PII controls, and approval flows are enforced at the layer, not in each consumer.

04

Observable

SLAs on freshness, completeness, and accuracy, with alerts that fire before a consumer notices.

05

Lineaged

Every number walks back to the source row, the POS ticket, the GRN, the shift punch, end to end.

06

Writable

Plans, forecasts, adjustments, and decisions write back to the same object, fully audited.

We are building the layer we wish every enterprise had ten years ago.

DataHQ is the data product layer for physical businesses, Fabric, Spyke, and Pilot on one living model, above whichever engine you run. Plug your AI, your BI, your agents into it, and everything downstream gets smarter, safer, and cheaper.