About

Kalyan Venkatesh

I have spent 3 plus years on one problem. ML systems that hold up when it actually matters, not just on a benchmark. I am finishing my MS in Computer Science at DePaul, graduating June 2026, and I am looking for ML Engineer and AI Engineer roles.

Email me LinkedIn

The throughline

Every system I built had a human on the other end

That is the honest center of how I work. At sensen.ai my models produced court admissible enforcement tickets sent to real people. A model confusing a B for an 8 in bad lighting was not a wrong number on a dashboard, it was someone's problem. We had cases where that went wrong, and that is what I was building against. It taught me to stop trusting clean metrics and start asking the harder question. Does this still work six months after I shipped it, and can I prove it.

My two research projects are two angles on that same instinct. Agentic LLMOps asks whether a runtime monitor inside an agent loop actually helps, and finds the exact point where it starts to hurt. Inference-Lens asks whether the automated judge we all rely on can be fooled, and shows how badly. The thread across all of it is simple. I want ML pipelines that work the way they are supposed to, with measurable and trustworthy output at every layer.

Where I have done it

Experience

Sep 2025 to present · DePaul University

Graduate Research Engineer

First author on the multi-agent inference reliability work, alongside two faculty co-authors, targeting ICSE 2027. I own it end to end, from the problem to the architecture to the statistics. Ten phases, five model families, a statistically significant +4.85 point gain, and the Inverse Capability Hypothesis that says where monitoring stops helping.

2022 to 2024 · sensen.ai

Data Scientist

Built the model validation framework that became the standard validation workflow across 26 global deployments, mapping every vehicle event to track accuracy continuously with reports a non-technical council member could read. Rebuilt the ETL behind weekly invoicing from 3 plus hours down to under 2, and the ticket generation pipeline from roughly 800ms to 480ms per instance.

2021 to 2022 · AECOM and Siri

Data Engineer

Built Python and Pandas pipelines for cleaning and monthly cost consolidation, cutting dashboard turnaround from 3 hours to under 1, and tuned SQL across cost databases from roughly 20 seconds to 8. Early career, but the instinct was already there. Take messy and manual, make it clean, fast, and repeatable.

Toolkit

What I work with

Core ML

Pythonscikit-learnXGBoostPyTorchNumPypandas

LLM systems

transformerssentence-transformersLangGraphOllamaDeBERTa

Ship and track

MLflowStreamlitDockerHugging Face SpacesSQL

Evaluation

HumanEvalHH-RLHFLLM-Baradversarial testing

Methods

clusteringfeature engineeringexperiment designstatistical CIs

Education

MS CS, DePaul, 2026BTech, VNIT Nagpur

Research

Selected work

First author · ICSE 2027

Agentic LLMOps

A three agent runtime monitor for code generation. A real +4.85 point gain, and the failure condition that matters. Above roughly 65 percent baseline, monitoring hurts.

Inference-Lens

An adversarial stress-test of automated LLM evaluators. Two models tied on clean accuracy diverged by over 75 points under attack.

Get in touch

Let us talk

adavivenkatesh@gmail.com

in/kalyan-venk

GitHub

kalyan-venk