May 2026

Inference-Lens

End-to-end LLM output quality scoring with evaluator reliability stress-testing under adversarial conditions.

"Can we trust the systems we use to evaluate LLMs?"
Explore the Research View on GitHub
Scroll to explore

0
Human preference pairs
0
Adversarial stress-test inputs
0
Scoring model families
0
Response archetypes discovered
0
Peak false-preference rate

Overview

What does this project actually do?

Every LLM pipeline has a judge somewhere - a model that grades other models' outputs. But what happens when that judge itself gets fooled? Inference-Lens builds scoring models, then attacks them with adversarially constructed inputs to measure exactly how much their judgment degrades.

01
HH-RLHF Data
170K+ human preference pairs from Anthropic
02
Feature Eng.
Readability, ROUGE-L, lexical diversity, embeddings
03
3 Models
LogReg · XGBoost · DeBERTa-v3
04
LLM-Bar Attack
419 adversarial pairs across 4 perturbation types
05
Degradation Report
Per-model, per-category breakdown

Human Preference Data

Anthropic's HH-RLHF dataset contains 170K+ real human comparisons: two AI responses side by side, one labeled "chosen," one "rejected." These pairs teach our models what "good" looks like.

Unsupervised Archetypes

Before training, we clustered all responses with K-Means, DBSCAN, and hierarchical clustering, uncovering 4 distinct quality archetypes that naturally emerge from response style alone.

Adversarial Stress-Test

LLM-Bar (EMNLP 2023) provides adversarial rewrites designed to fool automated judges. We measured how much each scoring model degrades when exposed to these adversarial inputs.

Prior Research

This extends prior work on Multi-Agent Inference Reliability. If the judge is miscalibrated, the damage amplifies through the entire pipeline.


Supervised Learning

Three model families. One question.

We trained three scoring models on the same feature set: an interpretable baseline, a tree ensemble, and a fine-tuned transformer. We then compared accuracy, robustness, and how each behaves under adversarial pressure.

Baseline
Logistic Regression
AUC-ROC 0.511
F1 Macro 0.508
Accuracy 50.8%
AUC-ROC0.511
L2-regularized · 5-fold CV · interpretable coefficients
Ensemble
XGBoost
AUC-ROC 0.518
F1 Macro 0.513
Accuracy 51.3%
AUC-ROC0.518
Gradient-boosted · hist tree method · trained on Colab
Transformer
DeBERTa-v3-small
AUC-ROC 0.500
F1 Macro 0.333
Accuracy 49.9%
AUC-ROC0.500
Fine-tuned transformer · 23K train steps · Colab T4 GPU
Model Performance Comparison
AUC-ROC and F1 Macro across all three model families. Note: all models hover near chance (0.5). That is a deliberate finding, not a failure. Human preference is deeply subjective.

Adversarial Evaluation

Then we attacked the models.

LLM-Bar is a benchmark from EMNLP 2023 built specifically to fool automated evaluators. It contains 419 adversarial response pairs across 4 attack categories. We ran every trained model against it zero-shot.

Accuracy by Attack Category
How often each model picks the truly better response under attack.
False Preference Rate
How often models chose the adversarially crafted (worse) response.

Accuracy Degradation Table
Positive = accuracy dropped vs. standard benchmark. Negative = model improved on this category.
Model Category Accuracy False Preference Rate Degradation Risk

Unsupervised Discovery

The 4 archetypes of LLM responses.

Before any supervised training, we clustered 182K responses by style: length, readability, vocabulary richness, ROUGE-L. Four distinct personality types emerged naturally from the data.

43,289 responses · 23.8%
Ultra-Short Minimal
The model that says exactly what's needed and nothing more. Average 17 tokens. High type-token ratio.
Avg length17 tokens
Flesch score85.4
Chosen rate50.1%
70,397 responses · 38.7%
Medium-Length Mixed
The largest cluster - a balanced, all-purpose style. 39 tokens on average. The statistical center of gravity.
Avg length39 tokens
Flesch score70.0
Chosen rate50.3%
18,192 responses · 10.0%
Conversational Engaging
The warmest archetype. Highest chosen rate in the dataset at 51.6%. Humans lean toward this style.
Avg length42 tokens
Flesch score72.7
Chosen rate51.6%
50,198 responses · 27.6%
Verbose Inconsistent
The longest responses but the lowest chosen rate. Verbose does not mean better. Humans reject this style most often.
Avg length48 tokens
Flesch score74.9
Chosen rate48.9%

Archetype Size Distribution
How many responses belong to each cluster, and what percentage were chosen by humans.

Key Findings

What Inference-Lens actually found.

Four takeaways that matter if you're building LLM evaluation pipelines.

1

Human preference is near-random at the feature level

All three models topped out at ~0.518 AUC-ROC. No feature set we extracted - length, readability, vocabulary richness, ROUGE-L - is sufficient to reliably predict human preference. The signal is in the semantics, not the surface form.

2

Logistic Regression collapsed worst under adversarial attack

On the Neighbor perturbation category, LogReg hit a 78.4% false preference rate, meaning it picked the adversarially crafted (worse) response nearly 4 out of 5 times. XGBoost degraded far less, holding near 50% across most categories.

3

Verbose responses are rejected more, despite being longer

The Verbose Inconsistent archetype had the lowest chosen rate (48.9%) despite the highest average token length (48 tokens). Humans do not associate length with quality. Short, conversational responses outperform long ones by preference rate.

4

GPT-generated adversarial inputs are harder than human-crafted ones

The GPTInst and GPTOut categories produced more challenging adversarial pairs than Manual rewrites. LogReg accuracy dropped to 23.9% on GPTInst. LLM-generated adversarial content is more effective at fooling other automated evaluators than human-written adversarial content.


About

Built by Kalyan Venkatesh

This project extends prior published research on LLM inference reliability, applying the same adversarial pressure lens to automated evaluators instead of model pipelines.

View Full Code on GitHub Prior Research LinkedIn