Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Abstract
Large language models (LLMs) often appear to excel on public benchmarks, but these high scores may mask an overreliance on dataset-specific surface cues rather than true language understanding. We introduce the Chameleon Benchmark Overfit Detector (C-BOD), a meta-evaluation framework that systematically distorts benchmark prompts via a parametric transformation and detects overfitting of LLMs. By rephrasing inputs while preserving their semantic content and labels, C-BOD exposes whether a model's performance is driven by memorized patterns. Evaluated on the MMLU benchmark using 26 leading LLMs, our method reveals an average performance degradation of 2.15% under modest perturbations, with 20 out of 26 models exhibiting statistically significant differences. Notably, models with higher baseline accuracy exhibit larger performance differences under perturbation, and larger LLMs tend to be more sensitive to rephrasings indicating that both cases may overrely on fixed prompt patterns. In contrast, the Llama family and models with lower baseline accuracy show insignificant degradation, suggesting reduced dependency on superficial cues. Moreover, C-BOD's dataset- and model-agnostic design allows easy integration into training pipelines to promote more robust language understanding. Our findings challenge the community to look beyond leaderboard scores and prioritize resilience and generalization in LLM evaluation.
Community
This is a great idea. UP!!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy (2025)
- Breaking Focus: Contextual Distraction Curse in Large Language Models (2025)
- The Order Effect: Investigating Prompt Sensitivity in Closed-Source LLMs (2025)
- Preference Leakage: A Contamination Problem in LLM-as-a-judge (2025)
- QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs (2024)
- From Superficial Patterns to Semantic Understanding: Fine-Tuning Language Models on Contrast Sets (2025)
- DateLogicQA: Benchmarking Temporal Biases in Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper