June 16, 2026 · multi-agent · llm · engineering · verification

Trust via Harness: getting useful work out of models that lie

A confession that should be obvious but somehow isn’t: a language model will hand you a confident, well-formatted, completely fabricated number, and nothing about its tone will warn you. I watched a local model “evaluate” a trading strategy and invent a drawdown figure out of thin air. The prose was excellent. The number was fiction.

If you are wiring models into anything where being wrong costs money, this is the whole game. Here is the pattern I settled on after getting burned.

The mistake: trusting the model instead of the harness

The instinct is to reach for a bigger, smarter model and hope the fabrication rate drops. It does drop — and then it bites you anyway, because “lower” is not “zero,” and you have now made the failures rarer and more expensive to catch.

The better framing: you are not trying to trust the model. You are trying to build a harness the model cannot fool even when it tries.

Three parts that make it work

  1. Isolation. Every delegated unit of work runs in its own throwaway git worktree. It can write whatever it wants; it cannot touch anything real until a gate lets it.
  2. A deterministic honesty gate. Not another model judging the first — code. Claims that can be checked, get checked: does the cited file exist, does the function it references actually take those arguments, does the number reproduce when you re-run the deterministic evaluator? A claim that fails verification is rejected mechanically.
  3. A real-model review at the top. The cheap tiers do the bulk; a top-tier model signs off on anything that changes state. The expensive judgment is rationed to the moments that deserve it.

Why “deterministic” is load-bearing

The honesty gate has to be something the model cannot negotiate with. A second LLM acting as judge can be charmed by the same confident prose that fooled you. A unit test cannot. The art is converting as many “trust me” claims as possible into “re-run this and see.”

What this buys you

  • Cheap models become useful — they can be wrong, because wrong gets caught.
  • Cost scales with risk — bulk reasoning is free/local; you pay top-tier rates only at the gate.
  • Failures are legible — a rejected claim tells you which check it failed.

The uncomfortable lesson underneath all of it: the quality of an agent system is set less by the cleverness of its smartest model than by the rigor of the dumbest, most boring, most deterministic part of the harness. Build that part first.

This is one slice of a larger system I’m documenting in the HMAS white paper. More to come.


All writing