Apple Study Asks Whether AI Can Think For Itself: Experts Say Its Limits Are Human-Made

Benzinga
2025.06.12 10:17
portai
I'm PortAI, I can summarize articles.

A new study by Apple Inc. questions the reasoning capabilities of AI models, suggesting they often mimic intelligent behavior rather than genuinely reason. Testing popular models like GPT-4 and Claude on logic puzzles revealed significant failures in complex problem-solving. While some experts argue these limitations stem from design constraints rather than inherent flaws, others criticize the study's methodology. Following the study, Apple's stock fell, reflecting concerns about the company's AI advancements after the recent Worldwide Developers Conference.

A new study from Apple Inc. AAPL is stirring debate about whether AI models can genuinely reason or simply mimic intelligent behavior. By testing systems like GPT-4 variants and Claude on classic logic puzzles, the research suggests that these tools may stumble when real problem-solving is required.

What Happened: Apple released a study challenging the notion that large language models (LLMs) can logically reason through complex tasks. Ars Technica explains that by testing popular models like OpenAI's o1 and o3, Claude 3.7 Sonnet, and DeepSeek-R1 on classic logic puzzles such as Tower of Hanoi and river crossing tasks, the research team discovered that these systems often fail when they encounter unfamiliar challenges that demand systematic thinking.

Even when equipped with established algorithms, the models struggled—highlighting a key gap between performing intelligently and actually thinking logically.

"It is truly embarrassing that LLMs cannot reliably solve Hanoi," said AI researcher Gary Marcus, with co-lead Iman Mirzadeh adding the models' behavior shows "their process is not logical and intelligent."

The study also found that while some models performed better on moderately difficult tasks by implementing step-by-step reasoning, they failed completely as complexity increased, often reducing their reasoning effort instead of expanding it.

This odd drop-off in effort, despite ample computing resources, shows what the researchers call a "counterintuitive scaling limit." Inconsistencies were also seen across a variety of puzzles, suggesting the failures are task-specific rather than purely technical.

Why It Matters: Some experts are countering Apple's conclusions, arguing that the apparent reasoning failures in AI models may originate from built-in constraints rather than inherent flaws.

Pierre Ferragu, an analyst at New Street Research, said that the paper is riddled with "ontological nonsense."

Economist Kevin A. Bryan suggested that these systems are trained to use shortcuts under tight computational budgets. He and others note that internal benchmarks show models perform better when allowed more tokens, but production systems restrict this on purpose to avoid inefficiency, meaning the Apple findings might be discovering limits by design, not nature.

Others, like software engineer Sean Goedecke and AI researcher Simon Willison, question whether logic puzzles are even fair tests for language models. Goedecke described DeepSeek-R1's failure on the Tower of Hanoi as a conscious decision to avoid impractical output, not a lack of ability.

Willison added that the test may simply run into token limits, hinting that the paper is more sensational than conclusive. Even Apple's researchers admit the puzzles represent a narrow slice of reasoning challenges and caution against generalizing their results too widely.

The study comes on the heels of the Worldwide Developers Conference (WWDC), where Apple made a range of new announcements about its products. Experts noted the absence of any new AI features and expressed disappointment, downgrading the company's stock. Shares fell after the event, with many raising questions about Apple's AI future.

Price Action: Apple stock is currently trading at $198.76, down -0.01% pre-market.

Benzinga Edge Rankings show that Momentum stands at 29.72, Value at 9.02, Growth at 32.90, and Quality leads the group with a score of 76.94. For more details, click here.

  • Steve Jobs Would ‘Have Fired Everyone’: Apple’s Liquid Glass In iOS 26 Gets Roasted Online — Dan Ives Calls WWDC 2025 A ‘Yawner’

Photo courtesy: jamesteohart / Shutterstock.com