Kalyan Venkatesh - LLM Evaluation and Trustworthiness

The thesis

Two projects, one question I could not let go of

Think of an AI pipeline as a factory floor. Models do the work, and standing over them is a quality inspector deciding what ships. Everyone studies the workers. I went and audited the inspector. One project asks whether it can be fooled. The other asks whether hiring it even helped.

Can the judge be fooled · Live demo

Inference-Lens

I trained scoring models on real human preference data, then put them under systematic adversarial pressure. Two models that looked identical on a clean benchmark came apart by more than 75 points the moment something tried to deceive them. The judge can be fooled, and how badly depends on the architecture.

Explore and try it live → First author research · ICSE 2027

Agentic LLMOps

A three agent monitor, Planner, Critic, Fixer, that catches and corrects bad LLM output at runtime without retraining. It delivered a statistically significant gain. But the finding I care about is the failure condition. Above roughly 65 percent baseline, monitoring made models worse.

The finding I keep coming back to

More monitoring is not always better

Add a critic and fixer to a weak coding agent and it jumps 32 points. Add the exact same layer to a strong one and it quietly loses ground. Every model I tested landed on the same downward line, with the crossover around 65 percent baseline. Knowing which side of that line your model sits on, before you ship, is the actual job.

Monitoring benefit versus baseline capability

Each point is a model. X axis is how good it is on its own, Y axis is what the monitoring pipeline added or removed.

Why it matters

You cannot pick an evaluator from a clean benchmark

The leaderboard hides the gap. Logistic Regression and XGBoost scored almost the same on clean data. Under attack one collapsed to a 78.4 percent false preference rate and the other barely flinched. Nothing on the benchmark told you which to trust.

Length is not quality. Across 170,000 plus human judgments the most verbose answers had the lowest preference rate at 48.9 percent. Shorter, warmer ones won. Any model leaning on length gets played.

Knowing where it breaks is half the value. A reliability system that helps weak models and hurts strong ones is only safe if you know the crossover. That failure condition is a finding, not a footnote.

How I work

Reliability is the whole point, not a nice to have

Assume someone will break it

Clean accuracy is table stakes. I build expecting an adversary, then measure exactly how far the system bends before it gives.

Prove it, do not claim it

Local inference, MLflow tracking, pinned dependencies, confidence intervals. Every result can be rerun, not just believed.

Ship it live

The scoring pipeline runs as a public app you can poke right now. A working demo beats a paragraph claiming it works.

Toolkit

What I build with

Pythonscikit-learnXGBoostPyTorchtransformerssentence-transformersLangGraphOllamaMLflowStreamlitDockerHugging FaceHumanEvalHH-RLHFLLM-Bar

Hiring for ML or AI Engineering?

I graduate June 2026 and I am looking for teams that care whether their AI actually holds up in production. That is the only thing I have ever worked on. Let us talk.

Email me See the full background