End-to-end LLM output quality scoring with evaluator reliability stress-testing under adversarial conditions.
Every LLM pipeline has a judge somewhere - a model that grades other models' outputs. But what happens when that judge itself gets fooled? Inference-Lens builds scoring models, then attacks them with adversarially constructed inputs to measure exactly how much their judgment degrades.
Anthropic's HH-RLHF dataset contains 170K+ real human comparisons: two AI responses side by side, one labeled "chosen," one "rejected." These pairs teach our models what "good" looks like.
Before training, we clustered all responses with K-Means, DBSCAN, and hierarchical clustering, uncovering 4 distinct quality archetypes that naturally emerge from response style alone.
LLM-Bar (EMNLP 2023) provides adversarial rewrites designed to fool automated judges. We measured how much each scoring model degrades when exposed to these adversarial inputs.
This extends prior work on Multi-Agent Inference Reliability. If the judge is miscalibrated, the damage amplifies through the entire pipeline.
We trained three scoring models on the same feature set: an interpretable baseline, a tree ensemble, and a fine-tuned transformer. We then compared accuracy, robustness, and how each behaves under adversarial pressure.
LLM-Bar is a benchmark from EMNLP 2023 built specifically to fool automated evaluators. It contains 419 adversarial response pairs across 4 attack categories. We ran every trained model against it zero-shot.
| Model | Category | Accuracy | False Preference Rate | Degradation | Risk |
|---|
Before any supervised training, we clustered 182K responses by style: length, readability, vocabulary richness, ROUGE-L. Four distinct personality types emerged naturally from the data.
Four takeaways that matter if you're building LLM evaluation pipelines.
All three models topped out at ~0.518 AUC-ROC. No feature set we extracted - length, readability, vocabulary richness, ROUGE-L - is sufficient to reliably predict human preference. The signal is in the semantics, not the surface form.
On the Neighbor perturbation category, LogReg hit a 78.4% false preference rate, meaning it picked the adversarially crafted (worse) response nearly 4 out of 5 times. XGBoost degraded far less, holding near 50% across most categories.
The Verbose Inconsistent archetype had the lowest chosen rate (48.9%) despite the highest average token length (48 tokens). Humans do not associate length with quality. Short, conversational responses outperform long ones by preference rate.
The GPTInst and GPTOut categories produced more challenging adversarial pairs than Manual rewrites. LogReg accuracy dropped to 23.9% on GPTInst. LLM-generated adversarial content is more effective at fooling other automated evaluators than human-written adversarial content.
This project extends prior published research on LLM inference reliability, applying the same adversarial pressure lens to automated evaluators instead of model pipelines.