I have spent 3 plus years on one problem. ML systems that hold up when it actually matters, not just on a benchmark. Every LLM pipeline leans on an automated judge that grades its output and a monitor meant to catch its failures, and most teams simply assume both work. My research measures when you can trust them and builds the ones that do. That is the whole thread, from evaluating the judge to hardening the monitor.
The thesis
Think of an AI pipeline as a factory floor. Models do the work, and standing over them is a quality inspector deciding what ships. Everyone studies the workers. I went and audited the inspector. One project asks whether it can be fooled. The other asks whether hiring it even helped.
I trained scoring models on real human preference data, then put them under systematic adversarial pressure. Two models that looked identical on a clean benchmark came apart by more than 75 points the moment something tried to deceive them. The judge can be fooled, and how badly depends on the architecture.
A three agent monitor, Planner, Critic, Fixer, that catches and corrects bad LLM output at runtime without retraining. It delivered a statistically significant gain. But the finding I care about is the failure condition. Above roughly 65 percent baseline, monitoring made models worse.
The finding I keep coming back to
Add a critic and fixer to a weak coding agent and it jumps 32 points. Add the exact same layer to a strong one and it quietly loses ground. Every model I tested landed on the same downward line, with the crossover around 65 percent baseline. Knowing which side of that line your model sits on, before you ship, is the actual job.
Each point is a model. X axis is how good it is on its own, Y axis is what the monitoring pipeline added or removed.
Why it matters
How I work
Clean accuracy is table stakes. I build expecting an adversary, then measure exactly how far the system bends before it gives.
Local inference, MLflow tracking, pinned dependencies, confidence intervals. Every result can be rerun, not just believed.
The scoring pipeline runs as a public app you can poke right now. A working demo beats a paragraph claiming it works.
Toolkit
I graduate June 2026 and I am looking for teams that care whether their AI actually holds up in production. That is the only thing I have ever worked on. Let us talk.