Questioning DeepSeek-R1 and Claude Thinking fundamentally do not reason! Did Apple's controversial paper fail?

Wallstreetcn
2025.06.09 05:41
portai
I'm PortAI, I can summarize articles.

A paper from the Apple team questions the reasoning capabilities of current AI inference models (such as DeepSeek-R1 and Claude 3.7 Sonnet), arguing that these models are actually just good at memorizing patterns rather than true reasoning. The research shows that although these models have acquired complex self-reflection mechanisms through reinforcement learning, their performance collapses when faced with highly complex problems. Apple's research uses a controlled puzzle environment, revealing that standard LLMs are more efficient on simple problems, while both perform poorly on complex issues

Currently, the "reasoning" ability of AI has been validated in large reasoning models represented by DeepSeek-R1, OpenAI o1/o3, and Claude 3.7 Sonnet, which demonstrate a very human-like thought process.

However, a recent paper from the Apple team has questioned the reasoning capabilities of LLMs and presented its own viewpoint — models like DeepSeek-R1 and o3-mini do not actually perform reasoning; they are just very good at memorizing patterns.

A related tweet has already surpassed 10 million views on x.

Next, let's see how Apple arrived at this conclusion:

Apple explored the reasoning mechanisms of cutting-edge reasoning models (LRMs) from the perspective of problem complexity, not using standard benchmarks (such as mathematical problems), but rather employing a controlled puzzle environment. By adjusting puzzle elements while retaining core logic, they systematically altered complexity and examined solutions and internal reasoning (top of Figure 1).

These puzzles: (1) allow for fine-grained control of complexity; (2) avoid common contamination found in existing benchmarks; (3) require only clearly provided rules, emphasizing algorithmic reasoning; (4) support simulator-based rigorous evaluation, enabling precise solution checks and detailed fault analysis.

Empirical research revealed several key findings regarding current reasoning models (LRMs):

First, although these models have learned complex self-reflection mechanisms through reinforcement learning, they have failed to develop generalized problem-solving abilities applicable to planning tasks, with their performance collapsing to zero beyond a certain complexity threshold.

Second, Apple compared LRMs and standard LLMs under equivalent reasoning computation conditions, revealing three different reasoning mechanisms (bottom of Figure 1). For simpler, low-combinatorial problems, standard LLMs exhibited higher efficiency and accuracy. As problem complexity moderately increased, the reasoning models gained an advantage. However, when problems reached high complexity and deeper combinatorial depth, the performance of both model types completely collapsed (bottom left of Figure 1). Notably, as they approached this collapse point, despite the LRM's operational speed being far below algebraic limits, they began to reduce reasoning workload (measured in reasoning time tokens) as problem complexity increased (bottom of Figure 1). This indicates that, relative to problem complexity, the reasoning capability of LRMs has fundamental limitations on the reasoning time scale.

Finally, Apple's analysis of intermediate reasoning trajectories or thoughts revealed patterns related to complexity: in simpler problems, reasoning models typically identify the correct solution early but inefficiently continue to explore incorrect alternatives — a phenomenon of "overthinking." Under medium complexity, the correct solution only emerges after extensive exploration of erroneous paths. Beyond a certain complexity threshold, the model will be completely unable to find the correct solution (Figure 1, bottom right). This indicates that LRM has limited self-correction capabilities, which, while valuable, also exposes its fundamental inefficiencies and apparent scalability limitations.

These findings highlight the strengths and limitations of existing LRMs and raise questions about the properties of reasoning in these systems, which have significant implications for their design and deployment.

In summary, the contributions of this work include the following:

Questioning the current evaluation paradigm of LRMs based on established mathematical benchmarks and designing a controllable experimental platform using algorithmic puzzle environments that can conduct controlled experiments based on problem complexity.

Experiments show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities. In different environments, when complexity exceeds a certain level, accuracy ultimately drops to zero.

Apple found that the reasoning capabilities of LRMs have an expansion limit regarding problem complexity, which can be seen from the counterintuitive decline in thinking tokens after reaching a certain complexity point.

Apple questions the current evaluation paradigm based on final accuracy and expands the assessment scope to intermediate solutions of thinking trajectories using deterministic puzzle simulators. Analysis indicates that as problem complexity increases, correct solutions systematically appear later in the thinking process, while incorrect solutions do not, providing quantitative insights into the self-correction mechanisms in reasoning models (LRM).

Apple discovered some surprising limitations in LRMs regarding precise calculations, including their inability to benefit from explicit algorithms and their inconsistent reasoning across different puzzle types.

Paper Title: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Among the authors of this paper, one co-author is Parshin Shojaee, who is currently a third-year PhD student at Virginia Tech and a research intern at Apple. Another co-author, Iman Mirzadeh, is a machine learning research engineer at Apple. Additionally, Samy Bengio, brother of Yoshua Bengio, also contributed to this work and is currently a senior director of AI and machine learning research at Apple

Mathematics and Puzzle Environment

Currently, it is unclear whether the recent performance improvements observed in reinforcement learning-based thinking models are attributed to "more exposure to established mathematical benchmark data," "significantly higher reasoning computational capacity allocated to thinking tokens," or "reasoning capabilities developed through reinforcement learning training"?

Recent research has explored this issue by comparing the upper limit capabilities (pass@k) of reinforcement learning-based thinking models with their non-thinking standard LLM counterparts, utilizing established mathematical benchmarks. They demonstrated that under the same reasoning token budget, non-thinking LLMs can ultimately achieve performance comparable to thinking models in benchmark tests such as MATH500 and AIME24.

Apple also conducted comparative analyses of cutting-edge LRM, such as Claude-3.7-Sonnet (with thinking vs. without thinking) and DeepSeek (R1 vs. V3). The results, as shown in Figure 2, indicate that on the MATH500 dataset, when provided with the same reasoning token budget, the pass@k performance of thinking models is comparable to that of non-thinking models. However, Apple observed that this performance gap widens on the AIME24 benchmark and further expands on AIME25. This continuously widening gap presents interpretative challenges.

This can be attributed to: (1) increasing complexity requiring more sophisticated reasoning processes, thereby revealing the true advantages of thinking models on more complex problems; or (2) reduced data contamination in newer benchmarks (especially AIME25). Interestingly, human performance on AIME25 is actually higher than on AIME24, suggesting that the complexity of AIME25 may be lower. However, models perform worse on AIME25 than on AIME24—this may indicate data contamination during the training process of cutting-edge LRM.

Given these unreasonable observations and the fact that mathematical benchmarks do not allow for controlled manipulation of problem complexity, Apple turned to puzzle environments that enable more precise and systematic experimentation.

Puzzle Environment

Apple evaluated the performance of LRM reasoning on four controllable puzzles that cover combinatorial depth, planning complexity, and distribution settings. The puzzles are shown in Figure 3.

The Tower of Hanoi puzzle consists of three pegs and n disks of different sizes, stacked in order of size (with the largest at the bottom) on the first peg. The goal is to move all disks from the first peg to the third peg. Valid moves include moving only one disk at a time, only taking the top disk from a peg, and never placing a larger disk on top of a smaller disk The difficulty of this task can be controlled by the number of initial disks, as the minimum number of moves required when the number of initial disks is n is 2^n − 1. However, in this study, Apple does not score the optimality of the final solution, but only measures the correctness of each move and whether the target state is achieved.

Checker Jumping is a one-dimensional puzzle that arranges red pieces, blue pieces, and an empty space in a line. The goal is to swap the positions of all red and blue pieces, effectively mirroring the initial configuration. Valid moves include sliding a piece into an adjacent empty space or jumping over exactly one opposing colored piece into the empty space. During the puzzle process, no piece can move backward. The complexity of this task can be controlled by the number of pieces: if the number of pieces is 2n, the minimum number of moves required is (n + 1)^2 − 1.

River Crossing is a constraint satisfaction planning problem involving n participants and their corresponding n agents who must cross a river by boat. The goal is to transport all 2n individuals from the left bank to the right bank. The boat can carry a maximum of k people and cannot be empty. An invalid situation arises when a participant is with another agent without their own agent, as each agent must protect their client from competing agents. The complexity of this task can also be controlled by the number of existing participant/agent pairs. When n = 2 or n = 3 pairs, a boat capacity of k = 2 is used; when the number of pairs is larger, k = 3 is used.

Blocks World is a stacking puzzle that requires rearranging blocks from an initial configuration to a specified target configuration. The goal is to find the minimum number of moves required to complete this transformation. Valid moves are limited to the topmost block of any stack, which can be placed on an empty stack or on top of another block. The complexity of this task can be controlled by the number of existing blocks.

Experiments and Results

The experiments in this paper were conducted on reasoning models and their corresponding non-reasoning models, such as Claude 3.7 Sonnet (thinking/non-thinking) and DeepSeek-R1/V3.

How does complexity affect model reasoning?

To study the impact of problem complexity on reasoning behavior, this paper conducted comparative experiments between reasoning and non-reasoning model pairs in a controlled puzzle environment, such as Claude-3.7-Sonnet (thinking/non-thinking) and DeepSeek (R1/V3).

Figure 4 shows the accuracy of the two types of models across all puzzle environments as the problem complexity changes.

Additionally, Figure 5 presents the performance upper limits (pass@k) of these model pairs under the same reasoning token computation amount (average across all puzzles)

The above results indicate that the behavior of these models exhibits three states in terms of complexity:

In the first state, where the problem complexity is relatively low, this paper observes that non-reasoning models can achieve performance comparable to or even better than reasoning models.

In the second state, where the complexity is moderate, the advantages of reasoning models that can generate long chains of thought begin to emerge, and the performance gap between reasoning and non-reasoning models starts to widen.

The most interesting state is the third state, where the problem complexity is higher, and the performance of both models collapses to zero.

These results suggest that while reasoning models delay this collapse, they ultimately encounter the same fundamental limitations as non-reasoning models.

Next, this paper investigates the effects of different reasoning models as the problem complexity changes. The tested models include o3-mini (medium/high configuration), DeepSeek-R1, DeepSeek-R1-Qwen-32B, and Claude-3.7-Sonnet (thinking).

Figure 6 shows that all reasoning models exhibit a similar pattern when faced with changes in complexity: as the problem complexity increases, the model accuracy gradually declines until it completely collapses (accuracy drops to zero) after exceeding a specific complexity threshold for the model.

This paper also finds that reasoning models initially increase their use of thinking tokens in proportion to the problem complexity. However, as they approach the critical threshold (which closely aligns with their accuracy collapse point), the models counterintuitively reduce their reasoning input despite the continued increase in problem difficulty. This phenomenon is most pronounced in the o3-mini series variants and relatively lighter in the Claude-3.7-Sonnet (thinking) model. Notably, although the reasoning generation length of these models has not yet reached its limit and they have sufficient reasoning computational budget, they fail to effectively utilize the additional computational resources during the thinking phase as the problem complexity increases. This behavior indicates a fundamental limitation in the reasoning capabilities of current reasoning models relative to problem complexity.

What happens internally in the reasoning models' thinking?

To gain a deeper understanding of the reasoning process of the models, this paper conducts a fine-grained analysis of the reasoning trajectories. The focus is on Claude-3.7-Sonnet-Thinking.

The analysis based on reasoning trajectories further validates the three complexity patterns mentioned earlier, as shown in Figure 7a.

For simple problems (low complexity): reasoning models typically find the correct solution early in the thinking process (green distribution), but then continue to explore incorrect solutions (red distribution). Notably, compared to the correct solutions (green), the distribution of incorrect solutions (red) tends to be more concentrated towards the end of the thinking process. This phenomenon, referred to in the literature as overthinking, leads to wasted computation.

When problems become slightly more complex, this trend reverses: the model first explores incorrect solutions before arriving at the correct one. At this point, the distribution of incorrect solutions (red) is significantly lower compared to the correct solutions (green).

Finally, for problems of even higher complexity, a collapse occurs, meaning the model fails to generate any correct solutions in its reasoning.

Confusing Behavior of Reasoning Models

As shown in Figures 8a and 8b, in the Tower of Hanoi environment, even when this paper provides an algorithm in the prompts—so that the model only needs to execute the prescribed steps—the model's performance does not improve, and the observed collapse still occurs around the same point.

Additionally, in Figures 8c and 8d, this paper observes that the Claude 3.7 Sonnet thinking model exhibits a distinctly different behavioral pattern. This model often makes its first error in the proposed solutions relatively late, and in the river crossing puzzle, it can only generate valid solutions up to the 4th step. Notably, this model achieves nearly perfect accuracy when solving a problem that requires 31 steps (N=5), yet fails to solve the river crossing puzzle that only requires 11 steps (N=3). This may indicate that examples of river crossing puzzles with N>2 are relatively scarce on the network, suggesting that LRMs may have had limited exposure to or memory of such instances during training.

Controversy Surrounding the Research

Regarding Apple's research, some have questioned how to explain the performance of o3-preview on the ARC benchmark if this is indeed the case?

Some believe Apple's research is misleading, as they only tested DeepSeek R1 and Claude 3.7. While other models may fail, it is unfair to say "ALL reasoning models fail."

Additionally, some (user @scaling01) replicated the Tower of Hanoi puzzle and the exact prompts used in Apple's paper, leading to some interesting findings:

You need at least 2^N - 1 steps, and the output format requires each step to contain 10 tokens and some constants.

In addition, the output limit for Sonnet 3.7 is 128k, DeepSeek R1 is 64k, and o3-mini is 100k. This includes the reasoning tokens they use before outputting the final answer!

All models will have an accuracy of 0 when the number of disks exceeds 13, simply because they cannot output that many!

The maximum solvable scale with no reasoning space: DeepSeek: 12 disks; Sonnet 3.7 and o3-mini: 13 disks. If you closely observe the model's output, you will find that if the problem scale is too large, they won't even attempt reasoning.

Due to the excessive number of moves, the solving algorithm will be explained instead of listing all 32,767 move counts one by one.

Thus, it can be found that:

At least for Sonnet, once the problem scale exceeds 7 disks, it will not attempt reasoning. It will state the problem itself and the solving algorithm, then output the solution without even considering each step.

Interestingly, these models have an X% probability of selecting the correct token with each move. Even with a 99.99% probability, due to the exponential growth of the problem scale, the model will ultimately make mistakes.

Moreover, the interpretation of game complexity in the Apple paper is also very confusing. Just because the number of steps required for the Tower of Hanoi puzzle is much greater than that of other towers, which only require quadratic or linear more steps, does not mean that the Tower of Hanoi puzzle is more difficult.

This user bluntly called this work "nonsense," stating that the model is not actually limited by reasoning ability, but rather by the limitations of output tokens.

In simple terms, this user's point is that all models have an accuracy of 0 when the number of disks exceeds 13, simply because they cannot output that many.

OpenAI employees also joined the discussion, stating, "This deep dive into Apple's research is great."

Some even mentioned that if this analysis is correct, then Apple's research will be meaningless.

Article author: Machine Heart, source: Machine Heart, original title: "Questioning DeepSeek-R1, Claude Thinking fundamentally does not reason! Did Apple's controversial paper fail?"

Risk warning and disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at your own risk