Inference-Lens - LLM Evaluator Reliability Research

Overview

What does this project actually do?

Every LLM pipeline has a judge somewhere - a model that grades other models' outputs. But what happens when that judge itself gets fooled? Inference-Lens builds scoring models, then attacks them with adversarially constructed inputs to measure exactly how much their judgment degrades.

01

HH-RLHF Data

170K+ human preference pairs from Anthropic

→

02

Feature Eng.

Readability, ROUGE-L, lexical diversity, embeddings

→

03

3 Models

LogReg · XGBoost · DeBERTa-v3

→

04

LLM-Bar Attack

419 adversarial pairs across 4 perturbation types

→

05

Degradation Report

Per-model, per-category breakdown

Human Preference Data

Anthropic's HH-RLHF dataset contains 170K+ real human comparisons: two AI responses side by side, one labeled "chosen," one "rejected." These pairs teach our models what "good" looks like.

Unsupervised Archetypes

Before training, we clustered all responses with K-Means, DBSCAN, and hierarchical clustering, uncovering 4 distinct quality archetypes that naturally emerge from response style alone.

Adversarial Stress-Test

LLM-Bar (EMNLP 2023) provides adversarial rewrites designed to fool automated judges. We measured how much each scoring model degrades when exposed to these adversarial inputs.

Prior Research

This extends prior work on Multi-Agent Inference Reliability. If the judge is miscalibrated, the damage amplifies through the entire pipeline.

Supervised Learning

Three model families. One question.

We trained three scoring models on the same feature set: an interpretable baseline, a tree ensemble, and a fine-tuned transformer. We then compared accuracy, robustness, and how each behaves under adversarial pressure.

Baseline

Logistic Regression

AUC-ROC 0.511

F1 Macro 0.508

Accuracy 50.8%

AUC-ROC0.511

L2-regularized · 5-fold CV · interpretable coefficients

Ensemble

XGBoost

AUC-ROC 0.518

F1 Macro 0.513

Accuracy 51.3%

AUC-ROC0.518

Gradient-boosted · hist tree method · trained on Colab

Transformer

DeBERTa-v3-small

AUC-ROC 0.500

F1 Macro 0.333

Accuracy 49.9%

AUC-ROC0.500

Fine-tuned transformer · 23K train steps · Colab T4 GPU

Model Performance Comparison

AUC-ROC and F1 Macro across all three model families. Note: all models hover near chance (0.5). That is a deliberate finding, not a failure. Human preference is deeply subjective.

Adversarial Evaluation

Then we attacked the models.

LLM-Bar is a benchmark from EMNLP 2023 built specifically to fool automated evaluators. It contains 419 adversarial response pairs across 4 attack categories. We ran every trained model against it zero-shot.

Accuracy by Attack Category

How often each model picks the truly better response under attack.

False Preference Rate

How often models chose the adversarially crafted (worse) response.

Accuracy Degradation Table

Positive = accuracy dropped vs. standard benchmark. Negative = model improved on this category.

Model	Category	Accuracy	False Preference Rate	Degradation	Risk

Unsupervised Discovery

The 4 archetypes of LLM responses.

Before any supervised training, we clustered 182K responses by style: length, readability, vocabulary richness, ROUGE-L. Four distinct personality types emerged naturally from the data.

43,289 responses · 23.8%

Ultra-Short Minimal

The model that says exactly what's needed and nothing more. Average 17 tokens. High type-token ratio.

Avg length17 tokens

Flesch score85.4

Chosen rate50.1%

70,397 responses · 38.7%

Medium-Length Mixed

The largest cluster - a balanced, all-purpose style. 39 tokens on average. The statistical center of gravity.

Avg length39 tokens

Flesch score70.0

Chosen rate50.3%

18,192 responses · 10.0%

Conversational Engaging

The warmest archetype. Highest chosen rate in the dataset at 51.6%. Humans lean toward this style.

Avg length42 tokens

Flesch score72.7

Chosen rate51.6%

50,198 responses · 27.6%

Verbose Inconsistent

The longest responses but the lowest chosen rate. Verbose does not mean better. Humans reject this style most often.

Avg length48 tokens

Flesch score74.9

Chosen rate48.9%

Archetype Size Distribution

How many responses belong to each cluster, and what percentage were chosen by humans.

Key Findings

What Inference-Lens actually found.

Four takeaways that matter if you're building LLM evaluation pipelines.

1

Human preference is near-random at the feature level

All three models topped out at ~0.518 AUC-ROC. No feature set we extracted - length, readability, vocabulary richness, ROUGE-L - is sufficient to reliably predict human preference. The signal is in the semantics, not the surface form.

2

Logistic Regression collapsed worst under adversarial attack

On the Neighbor perturbation category, LogReg hit a 78.4% false preference rate, meaning it picked the adversarially crafted (worse) response nearly 4 out of 5 times. XGBoost degraded far less, holding near 50% across most categories.

3

Verbose responses are rejected more, despite being longer

The Verbose Inconsistent archetype had the lowest chosen rate (48.9%) despite the highest average token length (48 tokens). Humans do not associate length with quality. Short, conversational responses outperform long ones by preference rate.

4

GPT-generated adversarial inputs are harder than human-crafted ones

The GPTInst and GPTOut categories produced more challenging adversarial pairs than Manual rewrites. LogReg accuracy dropped to 23.9% on GPTInst. LLM-generated adversarial content is more effective at fooling other automated evaluators than human-written adversarial content.

About

Built by Kalyan Venkatesh

This project extends prior published research on LLM inference reliability, applying the same adversarial pressure lens to automated evaluators instead of model pipelines.

View Full Code on GitHub Prior Research LinkedIn