DRE-Bench

Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and generalize rules in novel situations) remains an open question. Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability. To address these limitations, we propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework. DRE-Bench consists of 36 abstract reasoning tasks organized across four cognitive levels, with each task featuring multiple dynamic variants that test the same underlying latent rule. This design enables fine-grained, interpretable, and reliable assessments of fluid intelligence. We evaluate a range of state-of-the-art LLMs, including both general LLMs (GPT-4o, Claude 3.7) and reasoning LLMs (o1, DeepSeek-R1, QwQ, Skywork-OR1). Experimental results reveal that although most LLMs achieve competent and robust performance in low-level cognition, they struggle with high-level cognition and exhibit limited generalization as task complexity grows. Our findings highlight the gap between current LLMs and true human-like fluid intelligence and offer a new path for systematically tracking reasoning progress in LLMs.

the contributions of this paper are summarized as follows. 1) We propose an abstract reasoning benchmark with a cognition hierarchy, providing a more structural and comprehensive system to analyze the LLMs' true fluid intelligence. 2) We develop a verifiable and scalable data engine to dynamically generate abstract reasoning data with various complexities, by designing a generator and solver for each task. 3) We perform comprehensive evaluations on a variety of popular LLMs, indicating that the existing LLMs still struggle to solve the reasoning problem of high cognitive levels. Existing LLMs may not have truly internalized the underlying reasoning rules, which highlights that they remain far from achieving true fluid intelligence.

Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Abstract

Introduction

DRE-Bench Data Generation Pipeline: (1) Annotators identify task-specific constraints and rules, writing a generator and solver function. (2) The generator produces dynamic input data based on different variable values. (3) The solver analyzes and computes ground-truth output for each input.

Experiment

Model performance curves under varying complexities in four cognitive reasoning levels.

Scatter plots of model accuracy versus variance in cognitive reasoning levels and corre-sponding tasks, where points closer to the upper-left indicate higher accuracy and greater stability.

Accuracy of DeepSeek-R1 on different numbers of in-context training samples.

Changing trend in o1's accuracy and inference time as task complexity increases.