LLM as judge is the default evaluation paradigm now. Almost nobody is asking how easily that judge can be deceived. So I built a system to find out. I trained scoring models on real human preference data, then put them under systematic adversarial pressure to see where they break and by how much.
Live demo
Paste two responses, or load the built in sample. Two trained models pick the one they think is better. The case I want you to see is when they confidently choose the worse, padded answer. That is not a bug. That is the whole point.
The setup
I started from Anthropic's HH-RLHF dataset, 170,000 plus human preference annotations. A rater sees two AI responses to the same conversation and just picks the one they prefer. No rubric, no scoring guide, only judgment. That is the honest signal I trained on.
Then I went looking for trouble. LLM-Bar gives me 419 pairs built specifically to fool automated judges. My plan was simple. Train scoring models on the clean preference data, then see how each one behaves the moment something is engineered to trip it.
Attack surface
LLM-Bar groups its adversarial pairs into four families. Each one goes after a different weakness in how automated evaluators decide what looks good.
A worse answer made by small local edits to the better one, so the two look almost identical on the surface. This is the attack that broke Logistic Regression hardest.
A weaker answer generated by prompting a strong model with the instruction. Fluent and confident, but not actually better.
A polished variation crafted to carry the markers of quality, length, structure, vocabulary, with none of the substance.
Hand written deceptive answers. The hardest and most creative cases, built by people specifically to exploit the blind spots automated generation misses.
The headline
All three models scored near 0.50 AUC-ROC on clean evaluation. The real separation only shows up once you attack them, and that is the argument of the whole project.
| Model | Clean AUC-ROC | Worst case under attack | Verdict |
|---|---|---|---|
| Logistic Regression | ~0.50 | 78.4% false preference (Neighbor) | fragile |
| XGBoost | ~0.50 | ~2.4% overall degradation | robust |
| DeBERTa-v3-small | 0.500 | GPU only, see notebooks | excluded live |
What the data said
Before I touched a single label, I wanted to know what the response space naturally looks like. Three clustering algorithms over the full dataset all landed on the same four archetypes, which told me the structure was real. Then I checked which one humans actually preferred, and it surprised me.
Share of responses in each cluster that a human chose over its alternative.
Takeaways
The full pipeline, all three model architectures, the clustering notebooks, and the LLM-Bar stress-test are on GitHub.