LLM evaluation and trustworthiness

Making LLM systems you can actually trust.

I have spent 3 plus years on one problem. ML systems that hold up when it actually matters, not just on a benchmark. Every LLM pipeline leans on an automated judge that grades its output and a monitor meant to catch its failures, and most teams simply assume both work. My research measures when you can trust them and builds the ones that do. That is the whole thread, from evaluating the judge to hardening the monitor.

First author research, targeting ICSE 2027 Production ML across 26 deployments MS Computer Science, DePaul, 2026
0
human preference annotations analyzed
0
false preference rate under attack
+0
accuracy lift for a weak agent
0
model architectures stress-tested

The thesis

Two projects, one question I could not let go of

Think of an AI pipeline as a factory floor. Models do the work, and standing over them is a quality inspector deciding what ships. Everyone studies the workers. I went and audited the inspector. One project asks whether it can be fooled. The other asks whether hiring it even helped.

The finding I keep coming back to

More monitoring is not always better

Add a critic and fixer to a weak coding agent and it jumps 32 points. Add the exact same layer to a strong one and it quietly loses ground. Every model I tested landed on the same downward line, with the crossover around 65 percent baseline. Knowing which side of that line your model sits on, before you ship, is the actual job.

Monitoring benefit versus baseline capability

Each point is a model. X axis is how good it is on its own, Y axis is what the monitoring pipeline added or removed.

Why it matters

You cannot pick an evaluator from a clean benchmark

The leaderboard hides the gap. Logistic Regression and XGBoost scored almost the same on clean data. Under attack one collapsed to a 78.4 percent false preference rate and the other barely flinched. Nothing on the benchmark told you which to trust.
Length is not quality. Across 170,000 plus human judgments the most verbose answers had the lowest preference rate at 48.9 percent. Shorter, warmer ones won. Any model leaning on length gets played.
Knowing where it breaks is half the value. A reliability system that helps weak models and hurts strong ones is only safe if you know the crossover. That failure condition is a finding, not a footnote.

How I work

Reliability is the whole point, not a nice to have

Assume someone will break it

Clean accuracy is table stakes. I build expecting an adversary, then measure exactly how far the system bends before it gives.

Prove it, do not claim it

Local inference, MLflow tracking, pinned dependencies, confidence intervals. Every result can be rerun, not just believed.

Ship it live

The scoring pipeline runs as a public app you can poke right now. A working demo beats a paragraph claiming it works.

Toolkit

What I build with

Pythonscikit-learnXGBoostPyTorchtransformerssentence-transformersLangGraphOllamaMLflowStreamlitDockerHugging FaceHumanEvalHH-RLHFLLM-Bar

Hiring for ML or AI Engineering?

I graduate June 2026 and I am looking for teams that care whether their AI actually holds up in production. That is the only thing I have ever worked on. Let us talk.

Email me See the full background