PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

Wang, Daoyu; Cheng, Mingyue; Liu, Qi; Yu, Shuo; Liu, Zirui; Guo, Ze

PaperArena

An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature

Daoyu Wang¹, Mingyue Cheng¹, Qi Liu¹, Shuo Yu¹, Zirui Liu¹, Ze Guo¹

¹State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China, Hefei, China

Paper Code Dataset arXiv

💪 Performance

Rank	Base LLM	Agent System	Slow Think	Open Source	Avg. Score	Avg. Eff.	Avg. Steps	MC	CA	OA
🏆	Ph.D. Experts	Human	-	-	83.50	75.48	6.52	81.05	81.75	85.51
1	Gemini 2.5 Pro	Multi-Agent	Yes	No	38.78	45.39	8.58	45.68	35.55	36.38
2	OpenAI o4-mini-high	Multi-Agent	Yes	No	37.37	43.41	8.30	42.48	34.90	34.59
3	Claude Sonnet 4	Multi-Agent	Yes	No	36.73	39.13	8.92	42.76	34.76	34.85
4	Gemini 2.5 Pro	Single-Agent	Yes	No	36.10	41.20	9.81	40.36	34.81	35.39
5	GPT-4.1	Multi-Agent	No	No	34.57	34.35	7.94	38.04	35.77	32.91
6	Qwen3-235B-Thinking	Multi-Agent	Yes	Yes	34.31	33.79	10.48	36.41	35.59	30.87
7	Claude Sonnet 4	Single-Agent	Yes	No	34.18	37.34	9.55	38.92	34.07	33.05
8	OpenAI o4-mini-high	Single-Agent	Yes	No	33.93	35.12	10.63	41.46	35.86	32.88
9	GLM-4.5	Multi-Agent	Yes	Yes	30.74	27.01	10.90	31.74	33.37	29.39
10	GPT-4.1	Single-Agent	No	No	30.61	35.26	8.27	34.83	33.94	30.52
11	Qwen3-235B-Thinking	Single-Agent	Yes	Yes	29.97	31.21	10.20	33.41	34.41	28.90
12	Claude 3.5 Sonnet	Multi-Agent	No	No	29.34	23.67	8.16	30.99	33.72	28.88
13	Qwen3-235B-Instruct	Multi-Agent	No	Yes	28.83	38.13	7.82	31.99	31.86	28.67
14	GLM-4.5	Single-Agent	Yes	Yes	27.17	25.13	11.82	29.30	30.91	27.29
15	Kimi-K2-Instruct	Multi-Agent	No	Yes	26.28	29.45	11.94	29.66	30.10	25.58
16	Claude 3.5 Sonnet	Single-Agent	No	No	24.62	21.98	7.89	28.90	28.35	24.52
17	Qwen3-235B-Instruct	Single-Agent	No	Yes	23.47	36.85	7.24	26.83	27.54	23.05
18	Kimi-K2-Instruct	Single-Agent	No	Yes	22.32	23.57	12.86	26.27	25.91	22.29

Table 1: LLM-based Agents Performance on PaperArena-Hub. Multi-Agent systems are highlighted with green background.
MC = Multiple Choice, CA = Concise Answer, OA = Open Answer (average scores across Easy, Medium, and Hard difficulties).

Key Findings: Gemini 2.5 Pro (Multi-Agent) achieves the highest overall performance among LLM-based agents at 38.78%, yet a substantial gap persists with the Ph.D. expert baseline at 83.50%. Agent accuracy consistently degrades with increased task difficulty and on open-ended formats. Generally, closed-source and "slow-thinking" LLMs outperform their counterparts. While multi-agent systems enhance both accuracy and efficiency, most agents demonstrate inefficient planning by invoking far more tools than necessary. This inefficiency highlights fundamental limitations in their multi-step reasoning and tool execution control.

📖 Abstract

Existing benchmarks inadequately test LLM agents on realistic scientific tasks. We introduce PaperArena, a new evaluation benchmark where agents must address complex research questions by reasoning across multiple papers and using tools. Our platform provides agents with tools for parsing, retrieval, and computation to synthesize well-grounded answers. Experiments reveal that even advanced agents achieve only 38.78% average accuracy and exhibit inefficient tool usage. PaperArena offers a crucial platform for developing more capable agents for scientific discovery.

✨ Motivation

Figure 1: Motivation of the PaperArena benchmark.

⚙️ Key Features

Figure 2: Overview of the PaperArena Features.

📊 Multi-Step Reasoning

PaperArena evaluates agents on multi-step reasoning that simulates scientific workflows. Tasks require the sequential use of tools, such as PDF parsers, context retrieval, and web search, to trace claims and validate results.

🖼️ Multi-Modal Understanding

PaperArena assesses multimodal understanding by requiring agents to reason over text, figures, tables, and equations. Tasks involve aligning visual data with textual claims or comparing chart trends to tabular metrics.

📚 Cross-Document Integration

PaperArena evaluates cross-document integration by requiring agents to synthesize information from multiple papers. Tasks include verifying a claim against a cited work or retrieving implementation details from another document.

🗄️ Database Interfacing

PaperArena tests an agent's ability to interface with a structured paper database. Agents must formulate precise queries for metadata, interpret the returned results, and incorporate the retrieved evidence into their reasoning steps.

🌟 Benchmark Construction

Figure 3: Pipeline of the PaperArena Construction and PaperArena-Hub Implementation.

The figure above illustrates the comprehensive pipeline for the construction of the PaperArena benchmark and the implementation of its evaluation platform, PaperArena-Hub.

On the left, the PaperArena construction process details a tool-centric pipeline for generating question-answer (QA) pairs. It begins with a Tool Library and predefined Tool Chain Experience guiding a Large Language Model (LLM) to create initial QA pairs from scientific papers. These pairs then undergo a Semi-Automated QA Verification stage—including question obfuscation, answer standardization, and manual review—to ensure high quality and difficulty.
On the right, the PaperArena-Hub platform serves as a modular framework for agent evaluation. It features an Agent Platform that supports both Single-Agent and Multi-Agent systems. When given a query, the agent's LLM interacts with the Tool Environment to generate an answer. The final answer is then assessed in the Evaluation module, where a Judge (typically a powerful LLM) scores its performance against a human Ph.D. expert baseline.

🔖 BibTeX

@inproceedings{wang2025paperarena,
  title={PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature},
  author={Daoyu Wang, Mingyue Cheng, Qi Liu, Shuo Yu, Zirui Liu, Ze Guo},
  year={2025},
  eprint={2510.10909},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2510.10909}
}

More Works from Our Lab

Science-Star: Platform for Building, Extending, and Experimenting with Scientific Agents

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning