What happens when the model itself starts failing at runtime and there is no human in the loop to catch it? That became my research problem. I drove it as first author alongside two faculty co-authors, owning everything from the question to the architecture to the statistics. The answer was a three agent monitor that catches and corrects bad LLM output without retraining anything.
The architecture
A Planner writes the code. A Critic scores it for hallucination risk between 0 and 1. If that score crosses a threshold the Fixer rewrites it, and a Post-Critic scores the result. Then a reversion gate checks whether the rewrite actually improved things and throws it away if it did not. Everything runs locally through Ollama. No API calls, no fine-tuning, fully reproducible.
A baseline of Planner straight to Post-Critic runs alongside every condition, so every result is a clean before and after. The operating threshold τ settles at 0.70.
The result, and the better finding
Across ten phases and five model families on HumanEval, the monitor delivered a statistically significant +4.85 point pass@1 gain. But the number I care about is the failure condition. Above roughly 65 percent baseline, monitoring made models worse. I call it the Inverse Capability Hypothesis. Knowing where a reliability system turns into a liability is as valuable as the system itself.
Change in pass@1 from adding the monitoring pipeline, single-trial endpoints.
| Model | Base | Monitored | Δ |
|---|---|---|---|
| StarCoder2 7B | ~0% | ~0% | floor |
| Code Llama 7B | 20% | 52% | +32 |
| Llama 3.1 8B | 72% | 68% | -4 |
| DeepSeek Coder 6.7B | 77% | 76% | -1 |
| Qwen2.5 Coder 7B | 90% | 87% | -3 |
Making it defensible
A deterministic run has zero variance, and a number with no confidence interval will not survive review at a venue like ICSE. So I reran the two endpoints as three genuinely independent trials each with a non-deterministic planner. Both intervals exclude zero. The gain and the regression are both statistically defensible, not a pair of lucky runs.
| Model | τ | Trials | Mean Δ (pp) | 95% CI | Excludes zero |
|---|---|---|---|---|---|
| Code Llama 7B | 0.60 | 3 | +30.3 | +24.1 to +36.6 | yes |
| Code Llama 7B | 0.70 | 3 | +31.0 | +26.7 to +35.3 | yes |
| Qwen2.5 Coder 7B | 0.60 | 3 | -4.0 | -6.5 to -1.5 | yes |
| Qwen2.5 Coder 7B | 0.70 | 3 | -2.0 | tight, near -2.0 | yes |
What actually does the work
Between the net-negative early phases and the net-positive turn, three things changed at once. An 8B fixer, the problem statement handed to the fixer, and the reversion gate. So I turned each one off in isolation on Code Llama to see which one actually mattered.
| Configuration | pass@1 | Contribution of the missing piece |
|---|---|---|
| Baseline, no monitoring | 20% | reference |
| Full pipeline, all on | 52% | control |
| Without problem context | 48% | context worth +4pp |
| With 3B fixer instead of 8B | 51% | model upgrade worth +1pp |
| Without the reversion gate | 53% | gate costs 1pp here |
Core findings
The full arc
3B models throughout. The critic caught risk reliably but the small fixer was net negative. Set the threshold too low and the fixer fired on 74 percent of problems, dropping accuracy by 18 points. Calibration clearly mattered.
I scaled the critic to 8B and kept the 3B fixer. Still net negative across all 18 conditions. That pinned the bottleneck on the fixer, not the critic.
An 8B fixer, the problem statement as context, and a reversion gate together flipped the direction. Net positive at +4.85 points, 95 percent CI of +3.36 to +6.34, p of 0.010. The gate reverted 73.4 percent of fixes.
I extended it to Code Llama 7B and ran a full sweep. Code Llama gained 32 points, the largest in the study, and the inverse relationship between capability and benefit came into view.
Added DeepSeek Coder 6.7B at -1 and Qwen2.5 Coder 7B at -3. The inverse law held across architectures.
Swept τ across 0.60, 0.70, 0.75. At 0.70 the system stopped over-triggering with no loss for weak models, and 0.75 was identical, marking the plateau. For Llama 3.1 8B the fixer trigger rate fell from 46 to 10 percent and latency dropped 39 percent. A single retry lifted Code Llama's pass@2 from 20 to 38 percent.
My advisor pointed out that single-trial numbers would not survive review. So I reran the two endpoints as three independent trials each. Code Llama landed at +30 to +31 points and Qwen at -2 to -4, both excluding zero.
One change at a time on Code Llama. Problem context was the biggest single lever at +4 points, the fixer model upgrade added +1, and the gate slightly lowered raw accuracy on a weak model while staying the safeguard for strong ones.
I ran the interval study across the remaining models, then caught my own flaw. The planner caches were built deterministically, so the three trials were identical and carried no real variance. Worth flagging rather than reporting fake error bars.
Regenerated the caches with a non-deterministic planner for genuine intervals, then ran the law on two harder benchmarks. On HumanEval+ Code Llama gained about 32 points and Qwen lost about 4. On MBPP Code Llama gained 30 and Qwen lost a few. The inverse capability law held on both, so the central finding is not a quirk of HumanEval. This is the evidence that anchors the submission.
Status
All ten experiment phases are done. I am writing it up as first author with Prof. Vahid Alizadeh and Prof. Noriko Tomuro as faculty co-authors at DePaul, targeting ICSE 2027. The Phase 10 cross-benchmark runs were built to test whether the claim holds beyond HumanEval, and it did. The law held on both HumanEval+ and MBPP.