Evals with Alec Barber, OpenAI
Eval Harness vs Evals Platform
Upcoming Events in London
May 6th: AI-Native GTM with Clay
Last week, we hosted Alec Barber from OpenAI for a discussion on Evals, building on the first discussion we ran with Wulfie Bain last August.
The central thesis: build your own eval harness
Codex and Claude Code are powerful enough that you should build your eval harness yourself, tightly coupled to your AI harness, rather than adopting a generic evals platform.
The primitives problem. Every evals platform Alec has worked on has struggled to define a base “test case” that generalises across single-turn, multi-agent, and decision-tree architectures. His old startup built around a single-turn schema; users immediately wanted multi-agent, and the abstraction didn’t extend.
Where platforms do make sense. Observability tools (Langfuse etc.) are worth using. Don’t reinvent Grafana. Hyper-focused platforms (e.g. coding-agent-specific) can deliver real value because the primitive problem is bounded. And a platform that genuinely closes the loop end-to-end would be an exception.
The architectural implication. Design your AI harness for testability from day one: decomposable, inspectable, unit-test ready. Too many teams build scrappily and realise later they want evals. Alec’s recommendation: feed good eval-writing skills to Codex or Claude Code and have the agent build a bespoke harness.
Defining what a harness is. The harness is the layer around the model, e.g. the CLI for Codex or Claude Code are concrete examples. The CLI is the harness sitting on top of the model. Another example in insurance: documents go in, the system makes 20 sophisticated calls, applies regex at various steps, mixes deterministic and AI components, and eventually produces a pricing judgment. That whole system is the harness. The inference step is just one implementation detail inside it.
Evaluating the evals
Entropy as a signal. Run a single test case 10 times against the same model and grader. 10/10 pass or fail = low entropy, good signal. 5/5 split = high entropy, meaning the grader or test case is ambiguous. Costs 10x per case but the diagnostic value is significant.
Log-prob confidence scoring. An attendee spoke about a vertical AI product that uses token-level probabilities from the OpenAI Responses API to compute a heuristic confidence score. Low-confidence outputs route to annotators who improve the dataset on the fly. The confidence-vs-performance curve isn’t perfect but is statistically validated.
Dataset hygiene
A useful split:
Regression set. Stable, broad coverage. Every change is tested against this.
Iteration set. Small, focused on a current failure mode.
Fixes migrate from iteration into regression. Critically, prune the regression set over time: when a new model generation drops, some cases become trivially easy and just waste compute.
Binary benchmarks and saturation
Trajectory detail gets hidden in one number. SWE-bench is binary (solved or not), masking partial correctness. One approach is using LLM-as-judge to score each step of a trajectory, aggregate, and correlate with the final outcome.
Saturation. There’s a pattern where you can spend months building an eval for it to be saturated in weeks. Causes include genuine capability gains, contamination, and reward hacking.
The consensus was that the next frontier is production and real-world use, though this blurs the line between scientific benchmark and lab demo. Alec floated an economics analogy: maybe evals need to become an interpretive discipline, educated judgments from people who’ve looked at a lot of data.
Benchmark scores in SWE bench correlate heavily with other domains, raising the question of why you’d run 20,000 benchmarks when three capture the signal.
The domain-expert bottleneck
Building good evals for a domain (legal, aerospace, insurance) requires someone who is simultaneously:
A software engineer (to build the infrastructure)
A domain expert (to know what “good” means)
A product/design thinker (to surface the right UX for evaluation)
These unicorns are rare.
One participant working in vertical AI found that domain experts kept saying “this is wrong” without articulating why. He built a small vibe-coded UI that mirrored the platform lawyers already used, letting them highlight right and wrong passages in their native context. The UI extracts tacit judgment in a form the eval pipeline can use.
Alec strongly endorsed this. The answer is bespoke domain UX — not generic eval platforms. If you’re building for legal, your eval UI should look like what lawyers already work in. This makes the problem a design problem as much as engineering.
Enterprise adoption and liability
Many enterprises still make eval decisions on vibes. Adoption was framed as friction-to-set-up vs. value-delivered. In regulated industries, value is high enough (cost of a legal or insurance error is enormous) that adoption is better. Compliance departments effectively force evals as audit artefacts.
Practical recommendations for founders
Design your AI harness for testability from day one. Decomposable, unit-testable, each component inspectable. Don’t build a blob and retrofit evals.
Build your eval harness yourself using Codex or Claude Code. The coupling to your AI harness is too tight to delegate to a generic platform.
Use observability and dashboards off the shelf. Langfuse, Grafana — don’t reinvent these.
Find good eval-writing skills online (he named Hamel Husain) and feed them to the agent as context.
Invest in the domain-expert UX. Build bespoke interfaces that mirror how experts already work, so you can extract their tacit judgment without training them on eval frameworks.
Maintain the regression/iteration split and prune stale tests over time.
Use entropy diagnostics to identify test cases where your grader is ambiguous.


