Calculated Self Improvement

Robotics has spent three years chasing scale. But could it be that the next era will be won by whoever can mathematically prove which data is meaningfully useful and what is not?

The reliability layer every robotics hardware player needs is one of calculated self-improvement. It's what will enable over-the-air software updates that make hardware commoditized and allow a startup to win in an industry many believe will be dominated by incumbents.

As you mentioned on our run, the future isn't here but it's coming. My hypothesis is that your five research papers form one loop. If you read them in sequence, they describe a single continuous-improvement system for deployed robot fleets.

click any node to read

Sutton's Bitter Lesson says the biggest AI wins come from scaling compute and data, not from clever architectures. Here we try to expore if that's the truly the case.

The Bitter Lesson assumes the data is fungible and that is because text arguably is. An extra billion tokens of Reddit is roughly as useful as an extra billion tokens of news. The challenge is that robot data may not be. A robotic demonstration is a combination of visual depth, joint torque, tactile resistance, and spatial geometry collapsed into a trajectory with real physical consequences. Some demonstrations teach skill while others arguably teach nothing. And possibly most problematically, some just degrade the policy. Right now, nobody in the field can yet tell you which is which with confidence, and scaling the dataset does not obviously resolve the ambiguity.

From the summary of a recent workshop hosted for the leading physical AI labs:

Data collection has completely outpaced our scientific understanding of what that data is doing.

And then, in April 2026, Physical Intelligence released π0.7. A generalist policy that, trained on enough data across enough embodiments, showed emergent compositional generalization. Folding laundry on a bimanual UR5e the model had never seen laundry on. Making coffee. Assembling boxes. The kind of transfer the field spent years saying required curated data, task-specific fine-tuning, or new architectures. π0.7 got there on scale.

So the honest version of this section is a question. If scale alone is producing generalization we did not expect, does causal data quality still matter? Or does it matter more, because the same scaling laws that produce π0.7's successes also absorb and amplify whatever silent failures live inside the training set?

The reliability story is what this thesis is about. The question the rest of this essay tries to answer is whether that story is still load-bearing in a world where the generalization story keeps getting better faster than anyone expected.

Three assumptions shared across the major players.

assumption oneScale wins. More trajectories equal more capability.
assumption twoReliability is a post-hoc safety layer, bolted on after training.
assumption threeHardware is the bottleneck.

Most of the training data the robotics industry is collecting is not teaching robots anything. Some of it is actively degrading them. Scaling could make this worse.

It is the thesis underneath CUPID, underneath Sentinel, underneath FAKTUAL. The old world is not wrong that scale helps. It is wrong that scale is enough. Somewhere inside every 76,000-trajectory dataset, there are a few hundred demonstrations doing the real teaching and a few hundred quietly making the policy worse. Nobody can tell you which is which.

Until now.

The argument that reliability is a business problem, not an academic one, is easiest to see where a customer has already paid for the old world and discovered it does not work.

For example, Monarch Tractor raised hundreds of millions on the promise of a production-ready autonomous electric tractor. Forbes put the MKV on a cover. Dealers wrote real checks on the strength of the demo.

Despite the accolades, serious problems emerged quickly. Idaho dealer Burke's Tractor invested over $773,000 in 10 MKV tractors, only to discover the machines couldn't deliver on Monarch's promises. The autonomous features were severely limited, and the tractors couldn't function indoors, critical flaws Monarch's own sales team eventually admitted in writing.

This is not a Monarch story. It is the shape of every autonomous-systems bet that shipped the promise before it shipped the mechanism. The tractor generalized badly to the barn, and there was no system to answer the question the dealer was actually asking. When the autonomy does not work, what happens next?

Scale alone cannot answer that. A larger teleoperation dataset does not tell you why the policy failed indoors, or which new data would fix it. Only a layer that attributes failure to specific training demonstrations, verifies new data before it ships, and monitors the fleet in deployment can turn an unpaid bill into a solvable one.

Companies that raised on autonomy and could not deliver are the market for this layer, not the exception to it.

A deployed policy is failing on reflective surfaces. CUPID uses influence functions to estimate the causal impact of each individual training demonstration on the deployed behavior. Click Run CUPID analysis below to surface the demonstrations causally responsible, and remove them.

deployed policy · reflective-surface grasps100 rollouts

55 / 100 successful graspssuccessslip

success rate

54%

The policy fails 46% of the time on reflective surfaces. Ask CUPID which demonstrations are to blame.

training demonstrations · ranked by influence

0 / 30 removed

The demonstration is illustrative. CUPID: Curating Data Your Robot Loves, RSS RoboEval 2025.

Every deployed robot fleet runs a runtime monitor that flags anomalies before they cascade. Every flagged failure is causally traced to specific training examples within hours. Every manufacturer ships weekly policy updates, and the entire global fleet improves in lockstep. Hardware continues to commoditize. The reliability software layer is where the margin lives, and where the platform lock-in accrues.

The reliability layer for physical AI is a venture-scale opportunity. It is a B2B software wedge that sits between commoditizing robotics hardware and the fleet operator. The research that defines it already exists, published between 2023 and 2026. The company hasn't been built yet.

In physical AI, the companies Floodgate has backed include Applied Intuition, Cruise, Boom Supersonic, Hadrian, Substrate, and Valgo.

tap any company to read

click any company to read

Category design is how Floodgate works with founders. The idea is that a new company wins by defining a new category so sharply that the existing market reorients around it, and the investor's job is to do that work alongside the founder from the earliest days. The playbook is here.

If the thesis above is close to something you are already building, I want to hear where I have it wrong and what I am missing.

RSS RoboEval 2025read →

CUPID: Curating Data Your Robot Loves

Influence functions for robot imitation learning. Causally estimates which demonstrations teach the policy and which degrade it.

CoRL 2024read →

Unpacking Failure Modes of Generative Policies (Sentinel)

A runtime anomaly detector trained only on successful rollouts. Catches unknown failure modes at deployment without needing failure data.

CoRL 2025read →

FORTRESS: Frontier VLMs for Runtime Safety

Multimodal monitor that intercepts OOD inputs and substitutes a semantically safe fallback.

IJRR 2025read →

Points2Plans: From Point Clouds to Long-Horizon Plans

Composable symbolic and geometric planning from a single partial-view point cloud of an unseen scene.

Preprint 2025read →

FAKTUAL: Diversity You Can Actually Measure

Signature-kernel entropy for robot trajectories. A model-free way to measure whether a dataset is actually diverse.

Autonomous Robots 2023read →

Semantic Anomaly Detection for Robot Policies

Identifies why deployment failures are almost always semantic-understanding failures, not hardware.

RSS 2024read →

DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset

350 hours of teleoperation across 564 scenes and three continents. The old-world bet at its maximum.

Stanford 2026read →

Deployment Time Reliability of Learned Robot Policies (dissertation)

The full dissertation

Question one

Is this a thesis about humanoids, or about robots in general?

Is this most useful for humanoids or other form factors? The reliability stack acts on deployed-policy telemetry, which is embodiment-agnostic. Sentinel, CUPID, and FAKTUAL care about trajectories and their causal effect on policy behavior, not about whether the arm is attached to a torso, a wheeled base, or a fixed cell. The real variable is fleet size, failure cost, and whether the deployer is already a software company.

In the near term, that points away from humanoids. Humanoid fleets in 2026 are still measured in the low hundreds, and the two most visible builders, Tesla and Figure, are the ones least likely to outsource this layer. The early wedge is in mobile manipulators and industrial arms being deployed at fleet scale by operators who are buying robots, not building them. Warehouse manipulation, pick and pack, material handling, light assembly.

Question two

If this plays out like autonomy, is the customer Boston Dynamics and Agility, not Tesla and Figure?

Our Hypothesis: Yes, and the Applied Intuition analogy is the right one. Applied Intuition did not win by selling to Waymo or Tesla. It won by selling simulation and validation infrastructure to the legacy OEMs, Ford, GM, Stellantis, Toyota, who had world-class mechanical engineering and no software culture to build a modern validation stack themselves. The winning shape was a software company sitting between commoditizing hardware and the operator.

Physical AI rhymes. Tesla and Figure are software-first. They will build the reliability layer in-house, the same way Waymo built its own simulator. The customers for an independent reliability layer are the hardware-native incumbents and the operators who deploy mixed fleets. Boston Dynamics, a Hyundai subsidiary with a hardware heritage. Agility Robotics, whose Digit deployments at GXO and Amazon make fleet telemetry a first-order problem. ABB, FANUC, Kuka, the industrial arm makers now shipping learned-policy SKUs. Apptronik, partnered with Mercedes. And underneath all of them, the 3PLs and contract manufacturers who operate mixed-vendor fleets and cannot build a data-curation team in Palo Alto.

Calculated Self Improvement

The Central Hypothesis

The Reliability Stack

The Bitter Lesson

The Old World

Heretical Belief

The Bill Comes Due

See It Once, Can't Unsee It

Inevitability

How Floodgate Works

Evidence

Open Questions