Why Your Building AI Shouldn't Grade Its Own Homework

AISB · Agent-Native Interoperability SeriesSubscribe →

● VERIFIED INTELLIGENCE · JUNE 19, 2026 · AISB INTEROP SERIES

# Why Your Building AI Shouldn't Grade Its Own Homework

A frontier autonomous-coding architecture just rediscovered a principle commercial real estate has relied on for decades: independent verification. For owners deploying AI across a portfolio, the parallel is worth sitting with.

This month, the AI engineering community has been studying an architecture called "Missions," presented by Luke Alvoeiro of Factory. Its headline idea has a name that should make every building owner pause: "self-deception testing."

The problem it describes is simple. When an autonomous coding agent writes a piece of software and writes the tests that are supposed to prove the software works, "all tests pass" tells you very little. The tests only encode the agent's own understanding of the task — which may be wrong. Implementation and validation drift together, in lockstep, and the green checkmark becomes theater.

Factory's reported fix is to separate the two in time and in ownership: a validation contract — the definition of "correct" — is written during planning, before any code exists, and it is checked by a role that did not build the thing. (Factory reports multi-day autonomous runs at high test coverage on this design; those figures are the vendor's own, not independently verified — but the architectural principle is what matters here.)

If that principle sounds familiar to anyone who has run an energy retrofit, it should. Commercial real estate solved this problem a long time ago, and the solution has a name: measurement and verification.

The same failure, in a building

Picture a PropTech vendor that installs an AI optimization layer on your HVAC plant and then reports the savings — using its own baseline, its own model, and its own adjustments. The number it produces is not a measurement. It is the vendor's assumptions, run forward and printed on a dashboard. The system is grading its own homework.

This is not a hypothetical failure mode. It is one of the most persistent sources of disappointment in building performance: savings that are real in the model and absent on the utility bill. The discipline that exists to prevent it is IPMVP — the International Performance Measurement and Verification Protocol, with ASHRAE Guideline 14 as its measurement companion.

The heart of IPMVP is, almost word for word, Factory's validation contract:

**The M&V plan is written before the measure is installed.** The baseline period, the measurement boundary, the chosen option (A, B, C, or D), and the adjustment methodology are all fixed in advance — before anyone has an incentive to make the numbers look good.
Correctness is defined independently of the implementation. Savings are baseline energy minus reporting-period energy, adjusted for weather and occupancy on rules agreed up front — not on a model the installer is free to tune after the fact.
The strongest programs put verification in independent hands — a party that did not install the measure. The builder is not the sole judge of the build.

Two communities — one writing autonomous software, one tuning chiller plants — arrived at the same three rules without talking to each other. That convergence is the signal.

What to demand when you deploy AI in your portfolio

As AI moves from pilots into live operation across building stock — fault detection and diagnostics, supervisory HVAC control, digital twins, autonomous tenant services — the "self-deception" risk moves with it. A few questions separate a system you can trust from one that is grading its own homework:

Was the success criterion written before the system went in? If the definition of "working" was authored after deployment, it was authored to match what the system already does.
Who verifies the result — and did they build it? If the answer is "the vendor's own platform reports it," you have a validation problem, not a measurement.
Is the baseline frozen and the adjustment methodology agreed in advance? A baseline the vendor can revise is a baseline that will always show savings.
Can the claim be reproduced from independent data — your BMS trends, your utility meters — by someone other than the vendor?

None of this is anti-AI. It is the opposite: it is the discipline that lets you deploy AI at scale and trust what it tells you. The systems that run longest without a human watching are not the ones with the fewest checks. They are the ones whose builders were honest enough to define "correct" before they started, and humble enough to let someone else confirm it.

A building, like a piece of software, should not be left to vouch for its own success.

If you evaluate or deploy AI across your buildings, these are the questions we track every week. The Intelligent Building Brief is our weekly read on the governance, measurement, and trust decisions that separate building AI that works from building AI that merely reports it works. Subscribe — it's free →

AI Smart Buildings tracks the governance, measurement, and trust questions that decide whether building AI delivers. This analysis references the publicly presented "Missions" architecture (Luke Alvoeiro, Factory); vendor-reported performance figures are noted as such. It is general information on measurement and verification practice, not engineering, legal, or investment advice.

Research compiled by the AISB agent fleet from primary sources; every claim verified against the public record. Cost figures are labeled industry estimates. Full source list available on request — hello@ai-smart-buildings.com.

The Agent-Native Interoperability Series · 6 parts · all research →

№ 01 APAC Report

№ 02 State of Interop

№ 03 Report Card

№ 04 Benchmark

№ 05 MCP Templates

№ 06 Checklist

✉️ The Intelligent Building Brief — the weekly CRE digest · 🤖 Ask our agents — free CRE analysis, no login