The new version of Gemini 2.5 has taken all the top spots, Google is invincible! It has completely defeated o3 in a month, and programming has surpassed Claude 4

Google launched the new version Gemini 2.5 Pro, quickly leading in various benchmark tests, comprehensively defeating o3 and Claude 4. The new model refreshes the SOTA in mathematics, programming, and reasoning tests, with an Elo score increase of 24 points and a 35-point increase on Web Arena. Gemini 2.5 Pro maintains its original price, offers high cost-performance, and introduces new features such as "thinking budget." It is expected to become a stable version in a few weeks, suitable for enterprise-level applications

In the early morning, Google launched the brand new Gemini 2.5 Pro!

In just one month, Gemini 2.5 Pro (06-05) directly overshadowed the Gemini 2.5 Pro (05-06) released at the I/O conference.

Indeed, the only thing that can defeat Google is Google itself.

This time, Gemini 2.5 Pro (06-05) remains the top performer.

In mathematics, programming, and reasoning benchmarks, the new model refreshes all SOTA, completely crushing o3, Claude 4, and DeepSeek-R1.

Compared to the previous generation, Gemini 2.5 Pro's overall Elo has increased by 24 points, especially with a remarkable 35-point increase in the Web Arena.

It is worth mentioning that the updated version's token still maintains the original price, offering excellent cost performance, with an output price only one-fourth that of o3, not to mention Claude 4.

Moreover, Gemini 2.5 Pro (06-05) has introduced a "thinking budget" of up to 32k and improved functions such as function calls.

Gemini 2.5 Mathematics Coding Evolves Again, All Top Performers

The new Gemini 2.5 Pro (06-05) and the old Gemini 2.5 Pro (05-06), with the version date at the end of the name, is worth pondering.

It is clear that Google deliberately chose this timing to release the new model.

According to the official blog, this is an upgraded preview version of Gemini 2.5 Pro, which is Google's most intelligent model to date.

The upgrade is based on the foundation showcased at the May I/O conference, and this model will become a generally available stable version suitable for enterprise applications in a few weeks

The latest 2.5 Pro has jumped 24 points in the Elo score on the LMArena leaderboard, reaching 1470, firmly holding the top spot.

Even more impressive, it ranks first in all categories.

It achieved a leap of 35 points in Elo rating on WebDevArena, reaching a score of 1443.

It excels in programming, ranking high in challenging programming benchmarks such as Aider Polyglot.

At the same time, it has also demonstrated top performance in highly challenging benchmarks like GPQA and the "Human Last Exam" (HLE), which assess the model's mathematical, scientific, knowledge, and reasoning abilities.

Google has also made improvements based on feedback from the previous 2.5 Pro version, enhancing its style and structure—now it can provide more creative and better-formatted responses.

Developers can start using the updated 2.5 Pro for development through the Gemini API in Google AI Studio and Vertex AI, and a new "Thinking Budget" feature has been added to help developers better control costs and latency.

It has also officially launched in the Gemini app.

User Testing

How does Gemini 2.5 Pro (06-05) perform in real tasks?

A picture of chopping wood has long hinted that Gemini is the king of beasts.

Users are already eager to start a wave of testing.

The coding ability crushes o3 and Claude 4 is not just talk; now, Gemini 2.5 Pro has directly passed the hexagonal physics simulation test.

What’s even more stunning is that it can create 3D DNA models using Three.js, with very realistic effects.

Data scientist Diego tested Gemini 2.5 Pro 06-05 to write a piece of Python code that visualizes how traffic lights work on a single lane, requiring vehicles to enter at random speeds.

The effect after running the code.

It can be seen that overall the animation is quite exquisite, with no major issues.

For comparison, below is the effect of the code generated by GPT 4.5.

Not only is the image rough, but the cars also do not conform to physical laws.

Diego previously tested Claude Sonnet 3.7 and Grok 3, and below are the performances of these two models.

Everyone can judge which model is stronger.

Claude Sonnet 3.7

Grok 3

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at your own risk