The throughline
That is the honest center of how I work. At sensen.ai my models produced court admissible enforcement tickets sent to real people. A model confusing a B for an 8 in bad lighting was not a wrong number on a dashboard, it was someone's problem. We had cases where that went wrong, and that is what I was building against. It taught me to stop trusting clean metrics and start asking the harder question. Does this still work six months after I shipped it, and can I prove it.
My two research projects are two angles on that same instinct. Agentic LLMOps asks whether a runtime monitor inside an agent loop actually helps, and finds the exact point where it starts to hurt. Inference-Lens asks whether the automated judge we all rely on can be fooled, and shows how badly. The thread across all of it is simple. I want ML pipelines that work the way they are supposed to, with measurable and trustworthy output at every layer.
Where I have done it
First author on the multi-agent inference reliability work, alongside two faculty co-authors, targeting ICSE 2027. I own it end to end, from the problem to the architecture to the statistics. Ten phases, five model families, a statistically significant +4.85 point gain, and the Inverse Capability Hypothesis that says where monitoring stops helping.
Built the model validation framework that became the standard validation workflow across 26 global deployments, mapping every vehicle event to track accuracy continuously with reports a non-technical council member could read. Rebuilt the ETL behind weekly invoicing from 3 plus hours down to under 2, and the ticket generation pipeline from roughly 800ms to 480ms per instance.
Built Python and Pandas pipelines for cleaning and monthly cost consolidation, cutting dashboard turnaround from 3 hours to under 1, and tuned SQL across cost databases from roughly 20 seconds to 8. Early career, but the instinct was already there. Take messy and manual, make it clean, fast, and repeatable.
Toolkit
Research
A three agent runtime monitor for code generation. A real +4.85 point gain, and the failure condition that matters. Above roughly 65 percent baseline, monitoring hurts.
Read more → Can the judge be fooledAn adversarial stress-test of automated LLM evaluators. Two models tied on clean accuracy diverged by over 75 points under attack.
Read more →