The LRM tax
o1 and DeepSeek R1 improve accuracy on planning, at large, unpredictable inference-time cost, and compound systems match them at the same price.
First comprehensive PlanBench-style evaluation of Large Reasoning Models. o1-preview and o1-mini do beat frontier LLMs on accuracy, but the accuracy gain comes with token bills and latency that are an order of magnitude higher and an order of magnitude more variable. At the same effective price point, compound systems, LLM + external verifier in the loop, perform comparably. The verifier remains the cheaper and more reliable lever.
Valmeekam K., Stechly K., Gundawar A., Kambhampati S. · TMLR April 2025 · "A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1"
DataHQ implication →
You don't have to pay the LRM tax to get the answer right. The compound system wins on cost, latency, and reliability. The data product is what makes the compound system possible, it's where the critic bank, the rules, the semantic contracts live. DataHQ Pilot is the operator-facing instantiation of exactly this architecture.