First author research · targeting ICSE 2027

Agentic LLMOps

What happens when the model itself starts failing at runtime and there is no human in the loop to catch it? That became my research problem. I drove it as first author alongside two faculty co-authors, owning everything from the question to the architecture to the statistics. The answer was a three agent monitor that catches and corrects bad LLM output without retraining anything.

LangGraph Ollama MLflow HumanEval HumanEval+ MBPP Python 3.12

View on GitHub

pass@1 gain, p = 0.010

research phases over two quarters

model families tested

baseline where monitoring flips to harmful

The architecture

Three agents and a do-no-harm gate

A Planner writes the code. A Critic scores it for hallucination risk between 0 and 1. If that score crosses a threshold the Fixer rewrites it, and a Post-Critic scores the result. Then a reversion gate checks whether the rewrite actually improved things and throws it away if it did not. Everything runs locally through Ollama. No API calls, no fine-tuning, fully reproducible.

Planner

writes the code

→

Pre-Critic

scores risk 0 to 1

→

Fixer

rewrites if over τ

→

Reversion gate

keep only if better

→

Post-Critic

final score

A baseline of Planner straight to Post-Critic runs alongside every condition, so every result is a clean before and after. The operating threshold τ settles at 0.70.

The result, and the better finding

The gain was real. What did not work is the interesting part.

Across ten phases and five model families on HumanEval, the monitor delivered a statistically significant +4.85 point pass@1 gain. But the number I care about is the failure condition. Above roughly 65 percent baseline, monitoring made models worse. I call it the Inverse Capability Hypothesis. Knowing where a reliability system turns into a liability is as valuable as the system itself.

Monitoring delta by model

Change in pass@1 from adding the monitoring pipeline, single-trial endpoints.

Model	Base	Monitored	Δ
StarCoder2 7B	~0%	~0%	floor
Code Llama 7B	20%	52%	+32
Llama 3.1 8B	72%	68%	-4
DeepSeek Coder 6.7B	77%	76%	-1
Qwen2.5 Coder 7B	90%	87%	-3

StarCoder2 is the floor case that proves the rule from below. At roughly zero baseline the fixer has nothing to work with, so the benefit collapses back to nothing. The gain is largest in the middle-low band, where the model is wrong often but not hopeless.

Making it defensible

A single run has no error bars, so reviewers would not buy it

A deterministic run has zero variance, and a number with no confidence interval will not survive review at a venue like ICSE. So I reran the two endpoints as three genuinely independent trials each with a non-deterministic planner. Both intervals exclude zero. The gain and the regression are both statistically defensible, not a pair of lucky runs.

Model	τ	Trials	Mean Δ (pp)	95% CI	Excludes zero
Code Llama 7B	0.60	3	+30.3	+24.1 to +36.6	yes
Code Llama 7B	0.70	3	+31.0	+26.7 to +35.3	yes
Qwen2.5 Coder 7B	0.60	3	-4.0	-6.5 to -1.5	yes
Qwen2.5 Coder 7B	0.70	3	-2.0	tight, near -2.0	yes

What actually does the work

I turned each piece off to see which one carried the result

Between the net-negative early phases and the net-positive turn, three things changed at once. An 8B fixer, the problem statement handed to the fixer, and the reversion gate. So I turned each one off in isolation on Code Llama to see which one actually mattered.

Configuration	pass@1	Contribution of the missing piece
Baseline, no monitoring	20%	reference
Full pipeline, all on	52%	control
Without problem context	48%	context worth +4pp
With 3B fixer instead of 8B	51%	model upgrade worth +1pp
Without the reversion gate	53%	gate costs 1pp here

The gate surprised me. On a weak model it slightly lowers raw pass@1, because almost any fix helps when you start at 20 percent. Its real value shows up on strong models, where it reverts bad rewrites and keeps the monitor from doing harm. Handing the fixer the problem statement was the single biggest lever on the weak model.

Core findings

What ten phases taught me

Monitoring benefit is inversely proportional to baseline capability. The central result, now backed by 95 percent intervals that exclude zero at both ends. A model that already gets nine of ten right has little to gain and real risk of harm. A model that gets two of ten right has everything to gain.

The reversion gate is a safety device, not an accelerator. It reverts 73.4 percent of attempted fixes. On weak models that costs a point. On strong models it is what keeps the whole thing from going negative. Do no harm.

The fixer, not the critic, is the bottleneck. Scaling the critic to 8B did nothing early on. A bigger judge gives better scores but cannot rescue a fixer that introduces fresh bugs.

Context beat raw model size. The ablation says handing the fixer the problem statement mattered more than upgrading its parameter count.

It generalizes beyond one benchmark. On HumanEval+ and MBPP, both harder than the original HumanEval, the weak model still gained around 30 points and the strong models still lost a little. The law is not a quirk of a single dataset.

The full arc

Ten phases, from first signal to publishable

Phase 1

Baseline and a calibration trap

3B models throughout. The critic caught risk reliably but the small fixer was net negative. Set the threshold too low and the fixer fired on 74 percent of problems, dropping accuracy by 18 points. Calibration clearly mattered.

Phase 2

A bigger critic was not the answer

I scaled the critic to 8B and kept the 3B fixer. Still net negative across all 18 conditions. That pinned the bottleneck on the fixer, not the critic.

Phase 3

The turning point

An 8B fixer, the problem statement as context, and a reversion gate together flipped the direction. Net positive at +4.85 points, 95 percent CI of +3.36 to +6.34, p of 0.010. The gate reverted 73.4 percent of fixes.

Phase 4 and 4b

Cross-model comparison

I extended it to Code Llama 7B and ran a full sweep. Code Llama gained 32 points, the largest in the study, and the inverse relationship between capability and benefit came into view.

Phase 5

The pattern holds

Added DeepSeek Coder 6.7B at -1 and Qwen2.5 Coder 7B at -3. The inverse law held across architectures.

Phase 6

Threshold and retry sweep

Swept τ across 0.60, 0.70, 0.75. At 0.70 the system stopped over-triggering with no loss for weak models, and 0.75 was identical, marking the plateau. For Llama 3.1 8B the fixer trigger rate fell from 46 to 10 percent and latency dropped 39 percent. A single retry lifted Code Llama's pass@2 from 20 to 38 percent.

Phase 7

Confidence intervals on the endpoints

My advisor pointed out that single-trial numbers would not survive review. So I reran the two endpoints as three independent trials each. Code Llama landed at +30 to +31 points and Qwen at -2 to -4, both excluding zero.

Phase 8

Component ablation

One change at a time on Code Llama. Problem context was the biggest single lever at +4 points, the fixer model upgrade added +1, and the gate slightly lowered raw accuracy on a weak model while staying the safeguard for strong ones.

Phase 9

Extending intervals, and an honest catch

I ran the interval study across the remaining models, then caught my own flaw. The planner caches were built deterministically, so the three trials were identical and carried no real variance. Worth flagging rather than reporting fake error bars.

Phase 10

Cross-benchmark generalization, and it held

Regenerated the caches with a non-deterministic planner for genuine intervals, then ran the law on two harder benchmarks. On HumanEval+ Code Llama gained about 32 points and Qwen lost about 4. On MBPP Code Llama gained 30 and Qwen lost a few. The inverse capability law held on both, so the central finding is not a quirk of HumanEval. This is the evidence that anchors the submission.

Status

Where it stands

All ten experiment phases are done. I am writing it up as first author with Prof. Vahid Alizadeh and Prof. Noriko Tomuro as faculty co-authors at DePaul, targeting ICSE 2027. The Phase 10 cross-benchmark runs were built to test whether the claim holds beyond HumanEval, and it did. The law held on both HumanEval+ and MBPP.

Browse the code →