PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models
Abstract
Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.
Community
This paper presents a new benchmark for reasoning models that reveals capability gaps and failure modes that are not evident in existing benchmarks. E.g., we find that o1 / o3-mini-high are significantly better at verbal reasoning than other models.
Spaces to explore model results: https://huggingface.co/spaces/nuprl/puzzle-reasoning-challenge
thanks for an interesting paper
additional kudos for naming :D
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HARP: A challenging human-annotated math reasoning benchmark (2024)
- Token-Hungry, Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH (2025)
- LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion (2025)
- UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models (2025)
- PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models (2025)
- Hint Marginalization for Improved Reasoning in Large Language Models (2024)
- ProcessBench: Identifying Process Errors in Mathematical Reasoning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thank you for the recommendations!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper