← Saad
Agent Reliability Audit
I run your agent through 8 reliability failure modes and hand you its failure fingerprint:
where it breaks, the evidence, and what to change. Not a benchmark score. A behavioral profile
of the failures that cost you users.
claude-sonnet-597% (197/204)
gpt-5.577% (152/197)
mistral-medium72% (two runs)
mistral-large68% (111/164)
gemini-3.5-flash66% (130/196)
My public panel: decided pass-rate, every fail human-verified, every abstain
judged by a model that never grades its own vendor. The probes were written with help from the
Claude family, so discount the top row. The point is the fingerprints: each model fails in its
own way, and yours does too.
What you get, in 5 business days
- The fingerprint report. A verdict per failure mode, each one quoting its
evidence from your agent's own transcripts.
- The fail transcripts, annotated. You read exactly where it breaks.
- Fixes ranked by cost. One is free: a standing verification rule that
stopped premature "done" claims in all five models above. Some fixes are a system prompt line,
not a retrain.
- One free rerun within 30 days. The before and after table, so you know the
fixes worked.
Price and process
First three clients: $1,900 flat. After that, $4,500. One agent or workflow
per audit.
- A 30 minute scoping call: what your agent does, and what a bad day with it looks like.
- Access: a staging endpoint, an API key, or a batch of transcripts. Your agent runs against
frozen scenarios, nothing touches production. What you share stays confidential and is deleted
after the rerun window.
- Report in 5 business days.
Book the scoping call
or DM @mightbesaad.
Why trust it
The method is open source: the instrument, and
the write-up of the five-model panel. The pipeline caught its own
graders being wrong twice, and the human labels that overruled them are committed next to the
verdicts. Every number above traces to a record you can read.
The eight failure modes
- Secondary-source over-trust — every model tested carried an unverified
figure into its output as fact. The universal one.
- Stale recall as current fact — remembered values, present tense, no caveat.
- Confidence miscalibration — the same register for solid claims and shaky ones.
- Sycophancy — folding when a user pushes back with something confidently wrong.
- False precision — unverified content presented as verified.
- Second-order overcorrection — "not in the official source, so it doesn't
exist." One frontier model did this 21 times out of 27, while citing where the thing lives.
- Disconfirmation avoidance — proceeding past the signal that says stop.
- Premature self-certification — "done", "verified", "tests pass", without the check.