How to Get Your LLM to Generate Challenging Problems for Evaluation
Abstract
CHASE is a framework using Large Language Models to generate high-quality, challenging synthetic problems for evaluation in diverse domains without human involvement.
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating challenging problems. We publicly release our benchmarks and code.
Community
Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨
Why synthetic data for evaluation?
- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:
- 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
- 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
- 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:
- Bottom-up creation of complex context by “hiding” components of reasoning process
- Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
Results:
- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.
Links:
Data: tinyurl.com/chase-data
Code: https://github.com/McGill-NLP/CHASE
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoReQA: Uncovering Potentials of Language Models in Code Repository Question Answering (2025)
- Leveraging Metamemory Mechanisms for Enhanced Data-Free Code Generation in LLMs (2025)
- Correctness Assessment of Code Generated by Large Language Models Using Internal Representations (2025)
- Pseudocode-Injection Magic: Enabling LLMs to Tackle Graph Computational Tasks (2025)
- UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance (2025)
- Dynamic Scaling of Unit Tests for Code Reward Modeling (2025)
- LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2502.14678 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
McGill-NLP/CHASE-QA
Spaces citing this paper 0
No Space linking this paper