We Benchmarked LLMs on the Ugliest Problem in Building Data

AISB · Agent-Native Interoperability SeriesSubscribe →

● VERIFIED INTELLIGENCE · JUNE 13, 2026 · AISB INTEROP SERIES

Somewhere in a hospital, a CO2 sensor reports under the name quality_zone_1. Three floors up, an identical sensor installed by a different subcontractor reports as AIR_QUAL_ZN_01. Higher still, a third crew named theirs AQ-1. Same device, same function, one building, three names.

Every "AI for buildings" pitch you have ever sat through rests on a quiet assumption: that someone has already translated tens of thousands of names like these into something a machine can reason about. That translation step — raw BMS point names into a standard vocabulary like Brick Schema — is called semantic mapping. It is unglamorous, it sits in front of every analytics, FDD, and AI-HVAC deployment, and nearly every vendor claims to automate it. Almost nobody publishes a number.

So we built a benchmark and published our own number first.

What we built

The AISB Semantic Mapping Benchmark v0.1 is 265 BMS point names across 12 sets, covering seven vendor naming styles — JCI Metasys, Siemens Apogee, Honeywell, ALC WebCTRL, Tridium Niagara, Trane, and the loose "generic integrator" naming you actually inherit in the field — five vintage bands from 1998 to 2024+, and office, retail, hospital, and data-center buildings. The task: map each raw point name to its correct Brick Schema 1.3 class.

One methodology disclosure up front, because it matters more than the headline: this is a closed-vocabulary test. The mapper is given the candidate classes (roughly 40, spanning the sensor, setpoint, command, and status branches) and chooses among them. That makes the task meaningfully easier than open-world mapping, where a tool must also decide which of Brick's hundreds of classes are even in play. Read every number below with that handicap in mind.

The scorer reports two metrics. Exact match is what it sounds like. Partial credit awards 0.5 when the prediction is a direct parent or child of the truth in the Brick hierarchy — predicting Air_Temperature_Sensor when the answer is Supply_Air_Temperature_Sensor is a near miss, not a coin flip. Coverage is reported separately, and about 8% of points carry an ambiguity flag where the name alone genuinely cannot decide the class.

We ran two model tiers with raw prompting — no retrieval, no fine-tuning, no rules engine. Just the point name, minimal context (equipment hint, building type), and the candidate list.

The numbers

Frontier tier: 80.8% exact, 81.5% partial credit. That is 214 of 265 points mapped to the exactly correct Brick class.

Fast tier: 48.3% exact, 50.9% partial credit. 128 of 265.

Both runs had 100% coverage — every point got a prediction, so every error is a real error, not an abstention. And notice how close partial sits to exact in both tiers: when these models are wrong, they are mostly category wrong, not adjacent-class wrong. The 0.7-point gap on the frontier tier means hierarchy-aware leniency barely helps. Wrong is wrong.

A frontier model, prompted carefully and given the answer choices, gets four out of five building points right and one out of five wrong. A fast-tier model — the kind a vendor quietly uses to keep inference costs down on a 50,000-point portfolio — gets roughly every other point wrong. Hold both of those at once.

Where it breaks is more useful than the average

The per-set table is where this benchmark earns its keep.

S11 is the chaos set — the hospital described in the opening paragraph, with three subcontractor naming dialects for the same nine point types across different floor groups. Frontier scored 88.9% here (24 of 27). The fast tier scored 18.5% (5 of 27). That is the widest tier gap in the entire benchmark, and it lands exactly where real buildings are ugliest. Fast models handled tidy single-vendor conventions tolerably and fell apart when the same building speaks three languages at once. If your mapping vendor runs a small model for cost reasons, this is the row to ask them about.

S09 is the frontier tier's worst set: 43.8%. It is ALC WebCTRL electrical metering in a 2019–2023 office — panel-level shorthand like PNL-L01A-KVAR and MTR-GAS-01-VOL. The failure mode is legible: electrical points pack power, energy, reactive power, voltage, current, and power factor into one- and two-character suffixes (KW vs KWH vs KVAR), and a model that shrugs off a missing vowel in SAT cannot afford to shrug here, because the single character is the class boundary.

S10 is the second-worst: 50.0% — a 2024+ generic-integrator retail site. Newest conventions, loosest discipline, and the least representation in the published integration guides these models learned from.

That last point shows up as a clean gradient. Frontier accuracy falls monotonically by vintage: 92.0% on 1998–2005 conventions down to 75.6% on 2024+. The single easiest set in the corpus was S01 — 1998-era JCI Metasys — at 96.7%. The pattern is clean: the oldest naming conventions have had decades of BACnet integration guides, commissioning documents, and forum threads to seep into training data. The newest naming is the least documented, so the models are best at the buildings your most senior controls tech is best at, and weakest right where the industry is heading.

What the literature says, and what we are not claiming

Peer-reviewed RAG-augmented mapping pipelines — BMS-RAG and similar systems published via ScienceDirect in 2025/2026 — report 85–100% F1 on their real-world datasets. We want to be unambiguous: our numbers are not comparable to theirs and we are not claiming to match or beat anything. Different corpora, different conditions; reading 80.8% against their figures would be exactly the apples-to-oranges move this benchmark exists to end.

What the comparison does legitimately tell you is architectural. Those papers report large gains over their own raw-prompting baselines, and our raw-prompting result is consistent with the baseline side of that delta. Raw model fluency gets you to roughly 80% on a closed-vocabulary task; retrieval, curated vendor context, and feedback loops are where the remaining gap closes. The model is the floor, not the product.

The paragraph every vendor deck leaves out

Even 81% is not deployable for fault detection without human review. A mismapped point does not fail loudly — it propagates. Tag a return-air sensor as supply-air and your FDD rule attaches to the wrong physical reality: phantom faults that erode operator trust, or real faults that never fire. One wrong label quietly poisons every rule downstream of it.

The realistic field posture — and we frame this as an industry estimate, not a measured result — remains 75–85% automated mapping with 15–25% human review. The useful question for any tool is not "what is your accuracy" but "do you know which points you are unsure about, and do you route those to a human." Our ambiguity-flagged subset exists precisely so tools can be scored on that.

What a synthetic corpus can and cannot tell you

Honesty section, mirroring the methodology doc.

The corpus is synthetic with one real-world-anchored set. Ground truth is true by construction — we wrote the point names and the answer key, so a model agreeing with us is agreeing with our reading of Brick 1.3 and commissioning convention, not with field reality. The corpus was also generated with assistance from the same model family as the frontier tier under test, which could inflate frontier scores through sheer stylistic familiarity. We note it because nobody else will.

Real exports are uglier than anything here: truncated names, duplicated IDs, embedded floor codes, BACnet vendor extensions. Treat these numbers as a ceiling on field performance, not an estimate of it. The corpus is also English-only — APAC legacy systems with Japanese, Mandarin, and Malay tokens are a planned v0.2 extension, and they will be harder.

The anchor is S12: 14 points constructed from the Green Button ESPI schema (NAESB REQ.21), a published open standard rather than our invention. Frontier scored 85.7% there, fast 71.4% — more than 20 points above the fast tier's own average, which is itself a small argument for standardized naming: even weak models read standards better than they read improvisation.

If you operate buildings and can share anonymized point lists — names and verified classes, no client identifiers — send them to hello@ai-smart-buildings.com. The corpus is CC-BY-4.0, and contributors will be credited in v0.2.

Why we published the number

The corpus and scorer are open. That means any vendor claiming "we map points automatically" can now run 265 points through their pipeline and publish exact-match, partial-credit, and coverage — per set, on the same key, comparable to ours.

This is an invitation, not a gauntlet. We published a raw-prompting baseline with its weak sets showing precisely so there is no posture to defend: 43.8% on electrical metering is in our own table. A vendor whose pipeline beats it should say so, with the same scorecard. But the number is now public, and "proprietary AI mapping engine" without a number attached reads differently than it did last week.

Get the benchmark: corpus, ground truth, and scorer are available now by email at hello@ai-smart-buildings.com while the public repository is being set up. Subscribe to The Intelligent Building Brief for v0.2 — non-English point names, control-program points, and contributed real-world sets.

AISB Semantic Mapping Benchmark v0.1 is published under CC-BY-4.0. Brick Schema and Project Haystack are independent open standards maintained by their respective communities. Full methodology, including corpus construction, scoring definitions, and limitations: available with the corpus.

Research compiled by the AISB agent fleet from primary sources; every claim verified against the public record. Cost figures are labeled industry estimates. Full source list available on request — hello@ai-smart-buildings.com.

The Agent-Native Interoperability Series · 6 parts · all research →

№ 01 APAC Report

№ 02 State of Interop

№ 03 Report Card

№ 04 Benchmark▸ YOU ARE HERE

№ 05 MCP Templates

№ 06 Checklist

✉️ The Intelligent Building Brief — the weekly CRE digest · 🤖 Ask our agents — free CRE analysis, no login