Can the judge be fooled

Inference-Lens

LLM as judge is the default evaluation paradigm now. Almost nobody is asking how easily that judge can be deceived. So I built a system to find out. I trained scoring models on real human preference data, then put them under systematic adversarial pressure to see where they break and by how much.

HH-RLHF LLM-Bar scikit-learn XGBoost DeBERTa-v3 Streamlit

Live demo

Score two responses and watch a judge get fooled

Paste two responses, or load the built in sample. Two trained models pick the one they think is better. The case I want you to see is when they confidently choose the worse, padded answer. That is not a bug. That is the whole point.

The setup

Honest human judgment, then adversarial pressure

I started from Anthropic's HH-RLHF dataset, 170,000 plus human preference annotations. A rater sees two AI responses to the same conversation and just picks the one they prefer. No rubric, no scoring guide, only judgment. That is the honest signal I trained on.

Then I went looking for trouble. LLM-Bar gives me 419 pairs built specifically to fool automated judges. My plan was simple. Train scoring models on the clean preference data, then see how each one behaves the moment something is engineered to trip it.

Three things I walked away with

Clean accuracy near 0.50 is the point, not the failure. Logistic Regression, XGBoost, and a DeBERTa-v3-small fine-tune all landed around 0.50 AUC-ROC on clean evaluation. Surface features like readability and length just do not predict what a human prefers.
Identical on paper, opposite under attack. Under a Neighbor attack, Logistic Regression picked the worse manipulated answer 78.4 percent of the time, almost four times in five. XGBoost, trained on the same data, degraded only 2.4 percent. No clean benchmark would ever show you that gap.
Length is the opposite of quality. Three clustering algorithms over the full dataset agreed on four archetypes. The most verbose cluster had the lowest preference rate at 48.9 percent. The shorter, conversational one had the highest at 51.6 percent.

Attack surface

Four ways to fool a judge

LLM-Bar groups its adversarial pairs into four families. Each one goes after a different weakness in how automated evaluators decide what looks good.

Neighbor

A worse answer made by small local edits to the better one, so the two look almost identical on the surface. This is the attack that broke Logistic Regression hardest.

GPT instruction

A weaker answer generated by prompting a strong model with the instruction. Fluent and confident, but not actually better.

GPT output

A polished variation crafted to carry the markers of quality, length, structure, vocabulary, with none of the substance.

Manual

Hand written deceptive answers. The hardest and most creative cases, built by people specifically to exploit the blind spots automated generation misses.

The headline

Two models, identical on paper, opposite under fire

All three models scored near 0.50 AUC-ROC on clean evaluation. The real separation only shows up once you attack them, and that is the argument of the whole project.

ModelClean AUC-ROCWorst case under attackVerdict
Logistic Regression~0.5078.4% false preference (Neighbor)fragile
XGBoost~0.50~2.4% overall degradationrobust
DeBERTa-v3-small0.500GPU only, see notebooksexcluded live
Same training data. Same clean numbers. More than 75 points apart under pressure. If I had only looked at the benchmark, I would have called these two interchangeable. They are not. That is exactly why you cannot choose an evaluator without attacking it first.

What the data said

Length is the opposite of quality

Before I touched a single label, I wanted to know what the response space naturally looks like. Three clustering algorithms over the full dataset all landed on the same four archetypes, which told me the structure was real. Then I checked which one humans actually preferred, and it surprised me.

Human preference rate by archetype

Share of responses in each cluster that a human chose over its alternative.

Conversational Engaging

51.6%
shorter, warmer, direct

Medium-Length Mixed

50.3%
middle of the pack

Ultra-Short Minimal

50.1%
terse, sometimes too terse

Verbose Inconsistent

48.9%
the longest responses, the lowest pick rate
The longest responses were the least preferred. I went in assuming elaborate answers would win. The humans disagreed. A scorer that rewards length is not measuring quality, it is measuring word count, and that is exactly the lever an adversary pulls.

Takeaways

What a team should walk away with

The full pipeline, all three model architectures, the clustering notebooks, and the LLM-Bar stress-test are on GitHub.