A 6,000-word review: Google AI's fierce comeback - from Nano Banna, Genie 3, Veo 3 to Gemini 2.5

Wallstreetcn
2025.09.04 02:00
portai
I'm PortAI, I can summarize articles.

Google's progress in the AI field has been significant, especially with the launch of Gemini 2.5 Pro, which has brought it back to the center of the industry. Over the past year, Google has transformed from a follower to a leader, launching powerful products such as Nano Banana, Veo 3, and Genie 3, showcasing the conversion of its technological strength. This article will analyze why Google has suddenly risen in the AI race and explore its technological accumulation and productization process

A year ago, Google was still seen as a "follower" in the AI race. When ChatGPT swept through Silicon Valley, it appeared sluggish.

But just a few months later, the situation changed dramatically.

Gemini 2.5 Pro topped various rankings, and the "banana" model Nano Banana made generating and editing images a breeze; the video model Veo 3 showcased an understanding of the physical world; Genie 3 could even generate a virtual world in a single sentence.

With a series of "game-changing" products, Google has reestablished its position at the center of the table.

This inevitably raises the question:

Why has Google suddenly become so formidable recently?

This is not a sudden explosion, but rather a "turning of the elephant, monetizing technology." Google is transforming its decades of accumulated AI technology reserves into product strength with unprecedented determination and efficiency.

To put it more bluntly: Google hasn't suddenly become stronger; rather, the giant that pioneered the era of the Transformer model architecture is making a swift return.

Next, this article will delve into Google's progress in the AI field and analyze why Google has recently become "suddenly so formidable" in the AI race.

The full text will revolve around the following four core sections:

【1】Gemini 2.5 Pro's dominance, gold acquisition, and return to the throne

【2】Left hand "banana" image generation, right hand Veo 3 directing

【3】World model Genie 3

【4】Turning of the elephant, monetizing technology

Gemini 2.5 Pro's dominance, gold acquisition, and return to the throne

First, let's take a look at the foundational large language model. For most people, the "sensory starting point" of Google's sudden surge in strength is the launch of the Gemini 2.5 Pro series.

In the winter of 2022, OpenAI's experimental ChatBot sparked a storm with a daily user growth rate of a million. Despite frequently making factual errors and simple calculation mistakes, its potential began to "shock" the entire Silicon Valley, putting pressure on Google for the first time, making it feel the heat of "fires in its backyard."

In the year that followed, Google's stance resembled that of a somewhat clumsy "follower." From the hastily launched Bard to the initial attempts with Gemini 1.0, despite ongoing efforts, it faced constant skepticism.

For instance, the narrative at that time was as follows:

From "How will Google respond?" to "Can Google still keep up?"

Until a critical moment arrived — the official launch of Gemini 2.5 Pro. Although the previously launched Gemini 2.0 was already powerful enough, it had not yet reversed user perception.

At this point, Google can truly say it has "found its former position, the one that defined the technological giant of the internet era." 1) Dominating the Rankings

Six months ago, in March 2025, on the third-party authoritative model evaluation platform LMSys Chatbot Arena, the Gemini 2.5 Pro, codenamed "nebula," made its debut and strongly topped the rankings, with its Elo rating surpassing all competitors, including GPT-4o and Claude 3 Opus.

It achieved a true domination of the rankings.

This performance has been widely interpreted by various media as Google having caught up with or even surpassed its competitors in terms of overall model strength.

According to the LMSys team, this is "the first time in history that a model has simultaneously dominated the three major rankings of text, vision, and web development," earning a well-deserved "triple crown."

It is worth noting that the LMSys team's "web development" simulates real development tasks, not limited to coding capabilities, but rather constructing interactive web applications, covering front-end (UI), functional interaction, dependency management, and complete application structure.

In terms of programming capabilities, although the claim of "comprehensive crushing" is debatable from a practical perspective, multiple benchmark tests and developer feedback indicate that Gemini 2.5 Pro's abilities in code generation, understanding, and debugging are on par with the industry's top Claude 3.7, and even outperform in certain specific tasks (such as LeetCode-style problems).

Moreover, at every subsequent release, whether large or small, the Gemini series has undergone comprehensive upgrades.

At this point, all doubts have dissipated.

Google has returned to the first tier in the most critical core capabilities of large language models.

2) Winning Gold

In addition to the BenchMark ranking, the AI community is very concerned about how a foundational large model performs in areas of broad social interest.

Simply put, how does Gemini perform in more challenging professional fields?

Here, it is worth mentioning the International Mathematical Olympiad (IMO), a competition often highlighted by AI manufacturers for its "shock value."

A Gemini model specially trained with "Deep Think" capabilities achieved gold medal level in the 2025 International Mathematical Olympiad (IMO). Gemini 2.5 Deep Think scored 35 out of a perfect 42 in IMO 2025, winning the gold medal by solving 5 out of 6 problems, directly surpassing competitors like Grok 4 and OpenAI o3 At the same time, before the official release of GPT-5, OpenAI also utilized the latest experimental internal reasoning model to win a gold medal in the IMO, while Gemini achieved scores consistent with that model.

This achievement showcases the potential of Google AI in complex tasks that require deep logical reasoning.

Thirty days ago, this IMO gold medal model was launched in Gemini ChatBot, and its actual performance is considered to exceed the level during the competition at IMO.

Belgian mathematician Michel van Garrel even demonstrated online how to use deep thinking abilities to prove conjectures with it.

Overall, the high scores in foundational model evaluations intuitively showcase the model's strong capabilities to developers and the tech community. The success in competitions like the IMO represents significant progress for AI, especially Google’s AI, in the field of cutting-edge reasoning.

Therefore, the release of the Gemini 2.5 Pro series can be seen as a clear turning point for Google in this AI competition.

The current Gemini is not only one of the best consumer products but also a project capable of challenging cutting-edge technology.

Google has begun to announce to the community and the market: they are no longer the followers, and their foundational models have officially taken the lead in the industry.

Left hand "banana" raw image, right hand Veo 3 director

If Google is "catching up" in the pure text large model space, then in the multimodal field, Google, with its deep technological accumulation, has demonstrated an "almost absolute lead."

Although the Gemini model was designed from the beginning as a native multimodal model capable of seamlessly understanding and processing text, code, images, audio, and video, Google also possesses a series of powerful dedicated multimodal models besides Gemini.

Let's first look at the progress in the image field.

1) The story of a banana

In visual reasoning, Google has not paused its research since Gemini 1.5 Pro, and by the time of Gemini 2.5 Pro, its visual reasoning capabilities have already shown excellent levels.

This is very clearly integrated into Google’s Image model.

Just six months ago, in March 2025, after the highly praised open-source Gemma 3 model, Google quickly rolled out a Gemini 2.0 "image modification by mouth" — Gemini 2.0 Flash Experimental, which became a sensation across the internet This is mainly because everyone has discovered that it can understand natural language input and has strong controllability for modifications.

This feature was very popular at the time, with numerous internal and external reviewers exploring its potential. It was precisely because of the gathering of various netizens' innovative uses in different fields that Gemini 2.0 became one of the coolest trendy toys at that time.

Interestingly, just before this feature was released, Google's proprietary image generation model Imagen 4 did not create the huge waves in the industry that were expected. Many people thought that the update of the "mouth-to-image modification" feature was merely a clever but small-scale product optimization.

However, Google did not give up on its strong push in the image field; instead, it accelerated its efforts.

Gemini 2.5 Flash Image (Nano Banana)

This anticipated "charge" did not keep people waiting too long.

Just a week ago, a mysterious image model codenamed "Nano Banana" appeared in the global AI large model arena, and its performance in various generation and editing tasks quickly sparked heated discussions and speculations in various communities.

At that time, a mainstream viewpoint was:

Could this possibly be another Google model?

The reason for such speculation was that its performance almost "overwhelmed" the vast majority of similar products on the market. The community generally believed that only Google, with such a deep accumulation in the multimodal field, could produce such a "monster-level" work.

Ultimately, the mystery was revealed: "Nano Banana" is indeed Gemini 2.5 Flash Image.

It demonstrated an accurate understanding of "object replacement," no longer just being able to "draw," but being able to understand the relationships within images and make modifications while maintaining logical consistency, achieving a significant quality improvement in image and editing capabilities.

A very popular case is: using the Nano Banana model, 13 input images were fused into a complete, stylistically consistent image:

In addition to image editing capabilities, it also demonstrated excellent location reasoning abilities:

In summary, the emergence of Gemini 2.5 Flash Image means that while other manufacturers are still pondering how to generate a good-looking image, Google has already begun to let AI understand and reconstruct the real visual world That said, it is not an exaggeration, as the capabilities of Google's video generation model Veo 3 confirm this statement.

2) Veo 3

In addition to Text and Image, Google has also made significant strides in the Video modality field.

With the dynamic AI video generation, Google has completed its multimodal puzzle with Veo 3, which is the last and most important piece.

Before the launch of Veo 3, all video generation models on the market (including OpenAI's Sora, Google's earlier versions of Veo, Runway's Gen-3, and LUMA's DreamMachine) were impressive but generally limited by three bottlenecks: short duration, poor logical consistency, and weak controllability.

What they generated resembled high-quality "dynamic images" rather than true "film narratives."

However, in May 2025, Google officially released Veo 3 at the I/O conference, changing the game.

Its biggest technological innovation is the realization of high-fidelity video and audio synchronized generation, including dialogue, sound effects, and ambient sounds, marking the transition of AI video generation out of the "silent film era."

A particularly realistic talk show segment created with Veo 3 went viral online, leaving a strong impression on us:

To this day, despite several months having passed since its release, Veo 3 remains unmatched in the industry in terms of long video generation, logical coherence, and audio-visual synchronization.

The Hollywood Reporter even stated:

The emergence of Veo 3 marks the evolution of AI video generation technology from an expensive "toy" to a tool that can be integrated into professional production processes.

Now, advertising companies are using it to quickly generate visual samples of creative scripts, while independent filmmakers are using it to create fantastical visual effects that traditional filming cannot achieve.

A year of catching up has allowed Google to keep pace with top AI foundational model companies like OpenAI in the multimodal direction, and even surpass them.

Just a week ago, the well-known venture capital firm a16z released a latest report on the top 100 generative AI consumer applications. In this ranking, we see that Gemini's user activity has risen to second place, only behind ChatGPT, both on the web and mobile:

Google's "lead" in the multimodal field is not only reflected in a single model's metrics but also in its comprehensive ability to rapidly productize cutting-edge technology and create disruptive user experiences Looking back over the past six months, users have repeatedly experienced "Aha Moments" through Google's AI foundational models, which itself is the best amplifier for dissemination.

World Model Genie 3

If Gemini represents Google's deep investment in language and multimodal understanding, then Genie 3 showcases its "investment in the future" in generative AI and simulating reality. This is a pure, forward-looking investment.

This is also the expectation that all those concerned with AI and technology have for major companies:

This is what tech giants should be doing, and Google should do even more.

The "General Purpose World Model" Genie 3 launched by Google DeepMind is precisely the product of this expectation.

It can generate an explorable and manipulable 3D virtual world from a text prompt, supporting 720p resolution, 24 FPS real-time rendering, and maintaining consistency and interactive experience for several minutes.

It has even been referred to by media both domestically and internationally as: the most advanced world simulator ever.

Users can move and interact in this dynamically generated world in real-time, experiencing a virtual environment that lasts for several minutes while maintaining consistency.

The revolutionary aspect of this technology is that it opens up infinite possibilities for training more general AI agents.

Traditional AI training requires a large number of pre-built environments, while Genie 3 can "create" endless, diverse training grounds out of thin air.

This capability will fundamentally change the processes of game development and film production. More importantly, it lays the groundwork for achieving general AI that can understand and adapt to complex physical worlds.

For example, the general world model will also play a significant role in the autonomous driving training within the automotive industry.

From the paper "Genie: Generative Interactive Environments" that emerged when Genie 1 was born in early 2024, to now with Genie 3, the outside world has been amazed at Google's performance in the "multithreaded AI competition."

Many people might exclaim:

How does Google still have the energy to work on world models? And doing it so well?

It can be said that in the field of world models, Google has also taken another "flag" on the path to AGI ahead of others As described by DeepMind CEO Demis Hassabis:

This simulated environment will allow the Agent to "learn in a virtual mental world, accelerating the path to AGI."

At this point, it can be said that a "peak AI Google" is on the way.

The Elephant Turns, Technology Monetizes

Behind Google's push in AI, it is certainly also related to adjustments in organizational structure and changes in talent strategy.

For the past 10 years, Google has actually had two top technology teams:

【1】Google Brain, initiated by Jeff Dean, Stanford professor Andrew Ng, and Greg Corrado;

【2】Google DeepMind, a UK AI startup acquired by Google in 2014.

The two teams did not achieve a "Harmony" state internally at Google as outsiders might imagine.

At the end of 2022, the release of ChatGPT by OpenAI made it impossible for Google to continue ignoring this discord.

Just a few months later, in April 2023, Google announced the merger of the original Google Brain team with the DeepMind team to form a new Google DeepMind department, with DeepMind co-founder Demis Hassabis serving as CEO. Meanwhile, Google Brain head "big shot" Jeff Dean was promoted to Google Chief Scientist, focusing on long-term AI research work.

This merger was seen at the time as Google's response to the challenge from OpenAI, aiming to concentrate strengths, avoid internal competition, and accelerate the productization of AI research results.

1) Google Labs

It is not new for executives with technical backgrounds to be promoted to management positions; what is more noteworthy is the status of Google Labs within Google. This department is now seen as more than just an internal laboratory; it is regarded as the "AI innovation gene pool" driving Google's future.

The history of Google Labs can be traced back to 2002. It was once a symbol of engineer culture and the "20% time" work system, giving birth to classic products like Google Maps and iGoogle. However, after years of silence, it was reactivated at the 2023 Google I/O conference and quickly incubated various "quirky" AI projects.

Today's Google Labs is no longer just a creative incubator, but a complete methodology of "big company Native":

【1】It provides a fertile ground for any team within Google that has innovative ideas to quickly validate them, encouraging them to create seemingly "outlandish" AI projects.

【2】It connects the shortest path from a prototype concept to a product available for public experience, ensuring that innovation does not remain in the demonstration stage.

【3】This is a "free experimentation field" for Google employees.

As we reviewed 8 very interesting products, the platform has nurtured a series of "small but beautiful" yet highly potential products, such as NotebookLM and Whisk.

These successful projects prove that when innovators are given enough freedom and resources, their imagination can create immense value. And Google is willing to provide such a platform.

So, why mention Google Labs?

Because Google has once again prioritized "innovation."

In April 2025, Sissie Hsiao, the executive originally responsible for the integration of Bard and Gemini applications, stepped down, and was succeeded by Josh Woodward, the vice president of Google Labs.

Woodward's background is closely connected to the spirit of Google Labs. He is one of the behind-the-scenes promoters of the NotebookLM project, which "has been making waves in various tech communities and media platforms since its inception."

By placing such a "product geek" and innovator in charge of Gemini, Google's intention is very clear:

Google can no longer be satisfied with merely showcasing its model's technical capabilities; it urgently needs to transform these capabilities into user-perceptible, market-winning super applications.

In summary, even at the senior level within Google, there is continuous adjustment in the horse racing mechanism, placing talents who can better execute the "innovation strategy" in key positions.

2) Technology is no longer just for scientific research

Previously, DeepMind excelled in academic research, publishing many groundbreaking papers (such as AlphaGo, Transformer, etc.), and the Brain team also contributed a large number of open-source results (such as TensorFlow).

However, Google now places greater emphasis on commercial competitiveness. Reports indicate that Google DeepMind has begun to impose stricter reviews on research publications to avoid leaking valuable innovations or exposing weaknesses to competitors.

The Transformer model architecture, known as the "foundation of ChatGPT," saw its eight famous authors (referred to as the "Eight Sons of Transformer") leave Google in 2023 shortly after its launch to start their own companies

In the past, this might have been seen as Google's regret of "making wedding clothes for others."

But now, the perspective has changed: on one hand, this proves that Google, as the "Huangpu Military Academy of the AI world," has nurtured core talents for the entire industry, and its technological influence has long surpassed company boundaries; on the other hand, this also prompts Google to reflect deeply and place greater importance than ever on "not losing a key talent."

In the talent competition with Meta, Google has begun to change its attitude, doing everything possible to "retain talent."

For example, reports indicate that Google DeepMind offers core researchers compensation packages of up to $20 million per year and has shortened the equity vesting period to 3 years.

3) AI-First Company

In terms of organizational structure, Google has elevated AI to an unprecedented strategic height.

CEO Sundar Pichai has repeatedly emphasized that Google is an "AI-first" company, and now sees AI as the core of the company's future. Google has established various AI working groups internally, redirecting resources from search, advertising, and cloud departments towards AI.

The best engineers and the largest TPU computing clusters are prioritized for core AI projects like Gemini. All core product lines, from search, advertising, and cloud to Android, YouTube, and hardware (Pixel), must answer one question:

What is your AI strategy?

Then, the old departmental walls began to be broken down.

Engineers from the Google Search department sit together with the DeepMind team to jointly develop the Search Generative Experience (SGE); Google Cloud integrates all AI capabilities, from AutoML to algorithmic trading, into a unified platform called Vertex AI (Google Cloud AI Platform), providing end-to-end AI solutions for enterprise customers. This deep cross-departmental collaboration greatly enhances collaborative efficiency and avoids the past situation of fighting alone.

As a Bloomberg article titled states, Google DeepMind is transforming from a "research laboratory" to an "AI product factory."

This transformation has shown good results for Google in responding to external competition and integrating internal forces. Because even if Google has launched so many AI models and product updates in a short period, it would be difficult to achieve this without good coordination and execution In summary, after integrating all possible forces, Google's AI organizational culture has also undergone some changes, which is exactly what we mentioned at the beginning:

Google has begun to monetize all its technological accumulation.

What we see now is a new Google that has shed its glamour, has clear goals, and possesses incredible execution power.

It is foreseeable that in the next six months to a year, we will welcome a "more high-profile, faster, and stronger" Google.

Zhang Lu, founding partner of Fusion Fund, mentioned a detail:

On the surface, it seems that OpenAI has taken the lead, but many people overlook that Google is the deepest among large companies — it has both vertical research depth and horizontal technological breadth.

Therefore, when Google truly transforms this depth and breadth into product potential, its return will no longer be surprising.

From the skepticism that "Google has not produced innovative products in the past five years," to "in the AI era, Google will be left behind by OpenAI and become a traditional company in people's mouths," to now advancing simultaneously in foundational models, multimodal, world models, and application products.

In less than a year, Google has re-proven to the world one thing: Google is still that Google, and it is injecting its long-accumulated strength into its products without reservation.

This time, it not only returned to the table but also brought back that long-lost "composure" of letting technology speak.

Author of this article: Jing Shan, Source: Crossing, Original title: "A 6000-word review: Google AI's fierce comeback — from Nano Banna, Genie 3, Veo 3 to Gemini 2.5"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial conditions, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investment based on this is at one's own risk