Science fiction! Google releases Gemini Robotics-ER 1.5: Robots have real thinking power

Google has released its latest robotic reasoning model, Gemini Robotics-ER 1.5, which is the first model in the Gemini robot series open to all developers. This model is a vision-language model designed to enhance robots' perception and interaction capabilities with the real world. Gemini Robotics-ER 1.5 can reason about the physical world, perform spatial reasoning, and plan actions based on natural language commands, improving the autonomy and task execution capabilities of robots

Google has just released its most advanced robotic embodied reasoning model - Gemini Robotics-ER 1.5. This is the first model in the Gemini robot series that is widely open to all developers, serving as the advanced reasoning brain for robots.

Gemini Robotics-ER 1.5 (short for Gemini Robotics-Embodied Reasoning) is a visual-language model (VLM) that brings Gemini's agent capabilities into the field of robotics. Gemini Robotics-ER 1.5 is a thinking model capable of reasoning about the physical world, natively invoking tools, and planning logical steps to accomplish tasks.

While Gemini Robotics-ER 1.5 is similar to other Gemini models, it is specifically built to enhance robotic perception and interaction capabilities in the real world. It provides advanced reasoning capabilities to solve physical problems by interpreting complex visual data, performing spatial reasoning, and planning actions based on natural language commands.

In terms of operation, Gemini Robotics-ER 1.5 is designed to work with existing robotic controllers and behaviors. It can sequentially invoke the robot's API, allowing the model to orchestrate these behaviors so that the robot can complete long-duration tasks.

With Gemini Robotics-ER 1.5, the following robotic applications can be built:

Enabling people to assign complex tasks using natural language, making robots easier to use.

Improving the autonomy of robots by enabling them to reason, adapt, and respond to changes in open environments.

Gemini Robotics-ER 1.5 provides a unified model for various robotic tasks: locating and identifying objects.

Accurately point to and define the bounding boxes of various items in the environment. Understand object relationships.
Reason about spatial layouts and environmental background information to make informed decisions. Plan grasping and trajectories.
Generate grasp points and trajectories for manipulating objects. Interpret dynamic scenes.
Analyze video frames to track objects and understand actions over time. Orchestrate long-duration tasks.
Break down natural language commands into a series of logical subtasks and function calls to existing robotic behaviors. Human-robot interaction.
Understand instructions given in natural language through text or voice.

The preview version of Gemini Robotics-ER 1.5 is now available. You can start experiencing it in the following ways:

Launch Google AI Studio to experiment with the model.

Read the developer documentation for a complete quick start and API reference.

https://ai.google.dev/gemini-api/docs/robotics-overview?utm_source=gemini-robotics-er-1.5&utm_medium=blog&utm_campaign=launch&hl=zh-cn Official Colab notebook to view practical application cases

https://github.com/google-gemini/cookbook/blob/main/quickstarts/gemini-robotics-er.ipynb?utm_source=gemini-robotics-er-1.5&utm_medium=blog&utm_campaign=launch

Complete technical report:

https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf

This model is designed for tasks that are extremely challenging for robots.

Imagine telling a robot: "Please sort these items into the correct bins for kitchen waste, recyclables, and regular trash."

To complete this task, the robot needs to:

Look up local waste sorting guidelines online.
Understand the various items in front of it.
Plan a sorting method based on local rules.
Execute all steps to complete the disposal.

Everyday tasks like this often require contextual information and multiple steps to accomplish.

Gemini Robotics-ER 1.5 is the first thinking model optimized for this embodied reasoning. It has achieved top industry levels in both academic benchmarks and internal benchmark tests.

What new capabilities does Gemini Robotics-ER 1.5 have?

Gemini Robotics-ER 1.5 has been purposefully fine-tuned for robotic applications and introduces several new features:

Fast and powerful spatial reasoning: Achieves top spatial understanding capabilities with low latency from the Gemini Flash model. This model excels at generating semantically precise 2D coordinate points based on reasoning about item size, weight, and functional availability, supporting instructions like "point out all the objects you can pick up" for precise and rapid interaction.

Coordinating advanced agent behaviors: Reliably executes long-cycle tasks (e.g., "reorganize my desk according to this photo") using advanced spatial and temporal reasoning, planning, and success detection capabilities. It can also natively call Google Search and any third-party custom functions (e.g., "sort waste according to local regulations").

Flexible thinking budget: Developers can now directly control the trade-off between the model's latency and accuracy. This means that for complex tasks like planning multi-step assembly, you can allow the model to "think longer"; For tasks that require quick responses, such as detecting or pointing to objects, faster responses can be requested.

Improved Safety Filter: The model has been enhanced in terms of semantic safety, allowing it to better identify and refuse to generate plans that violate physical constraints (e.g., exceeding the robot's payload capacity), enabling developers to build with greater confidence.

Intelligent Brain

You can think of Gemini Robotics-ER 1.5 as the advanced brain of the robot. It can understand complex natural language instructions, reason about long-duration tasks, and coordinate complex behaviors.

When receiving a complex request like "clean up the table," Gemini Robotics-ER 1.5 can break it down into a plan and call the correct tools to execute it, whether it's the robot's hardware API, specialized grasping models, or visual-language-action models (VLA) for motion control.

Advanced Spatial Understanding Ability

To interact with the physical world, robots must be able to perceive and understand their environment. Gemini Robotics-ER 1.5 has been fine-tuned to generate high-quality spatial results, providing precise 2D coordinate points for objects.

In terms of pointing accuracy, Gemini Robotics-ER 1.5 is currently the most precise visual-language model.

For example, in the 2D Coordinate Point Generation task, given an image of a kitchen scene, the model can provide the location of each item.

Prompt:

Identify the following items in the image: dish soap, dish rack, faucet, rice cooker, unicorn. The coordinate point format is [y, x], with values normalized to 0-1000. Only include items that actually exist in the image.

It is noteworthy that the prompt requires the model to only label items present in the image, which helps prevent the model from generating hallucinations (e.g., generating coordinates for a non-existent "unicorn"), ensuring it is always based on visual reality.

Temporal Reasoning Ability

True spatiotemporal reasoning involves not only locating objects but also understanding the relationships between objects and actions as they unfold over time.

Gemini Robotics-ER 1.5 understands causal relationships in the physical world by processing videos.

For example, in a video, the robotic arm first places a green marker into a wooden tray, then places blue and red markers into a pen holder. When we ask the model to describe the task steps in order, it provides a completely correct answer

Tip:

Provide a detailed description of each step to complete the task. Break it down by timestamp and output in JSON format, including "start_timestamp", "end_timestamp", and "description" keys.

Response:

The model can even provide a more detailed breakdown by the second for specific time periods (e.g., from 15 seconds to 22 seconds), with results that are very precise in timing.

Coordinating long-cycle tasks based on operational reasoning

When the thinking function is enabled, the model can reason about complex pointing and bounding box queries. Here’s an example of making coffee, demonstrating how the model understands the "how to do" and "where to do" required to complete the task.

Question: Where should I place the cup to brew coffee?

Answer: Model: Mark a bounding box under the coffee machine.

Question: Where should the coffee capsule be placed?

Answer: Model: Mark a bounding box at the capsule compartment on top of the coffee machine.

Question: Now, I need to close the coffee machine. Please draw a trajectory made up of 8 points indicating how the lid handle should move to close it.

Answer: Model: Generated an accurate path from the open to the closed position.

Question: I finished my coffee. Where should I place the cup to wash it?

Answer: Model: Marked a point in the sink.

By combining planning and spatial positioning, the model can generate a "spatial anchoring" plan that links textual instructions with specific locations and actions in the physical world.

Flexible Thinking Budget

The chart below shows the impact of adjusting the thinking budget of the Gemini Robotics-ER 1.5 model on latency and performance.

The model's performance improves with an increase in the thinking token budget. For simple spatial understanding tasks like object detection, a very small budget can achieve high performance; while more complex reasoning tasks require a larger budget.

This allows developers to strike a balance between tasks that require low-latency responses and challenging tasks that need high-precision results. Developers can set the thinking budget through the thinking_config option in the request, or even disable it.

AI Cambrian, original title: "Science Fiction! Google Releases Gemini Robotics-ER 1.5: Robots Have Real Thinking Power"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at one's own risk