Page Reader AI | Google’s Gemini Robotics AI Model Reaches Into the Physical World

Google’s Gemini Robotics AI Model Reaches Into the Physical World | WIRED

AI Summary Hide AI Generated Summary

Key Advancements in Gemini Robotics

Google DeepMind unveiled Gemini Robotics, a new AI model that merges language processing, visual understanding, and physical action control for robots. This advancement allows robots to perform tasks based on verbal instructions, demonstrated through various examples in provided videos.

Demonstrated Capabilities

The model successfully controlled robots in performing diverse tasks such as folding paper, handling objects, and moving letters around a table. The model's ability to generalize across various robotic hardware was highlighted.

Gemini Robotics-ER

A variant, Gemini Robotics-ER (Embodied Reasoning), focusing solely on visual and spatial understanding, is also introduced. This is intended for researchers to train their own robot control models.

Impact and Future Implications

This integration of advanced AI capabilities into robotics holds significant promise for revolutionizing the field, potentially leading to more versatile and useful robots. The researchers emphasize the improved generalizability and usefulness of robots with this type of AI understanding. While significant strides are made, the article acknowledges that major challenges remain in the broader field of AI-powered robotics.

In sci-fi tales, artificial intelligence often powers all sorts of clever, capable, and occasionally homicidal robots. A revealing limitation of today’s best AI is that, for now, it remains squarely trapped inside the chat window.

Google DeepMind signaled a plan to change that today—presumably minus the homicidal part—by announcing a new version of its AI model Gemini that fuses language, vision, and physical action together to power a range of more capable, adaptive, and potentially useful robots.

In a series of demonstration videos, the company showed several robots equipped with the new model, called Gemini Robotics, manipulating items in response to spoken commands: Robot arms fold paper, hand over vegetables, gently put a pair of glasses into a case, and complete other tasks. The robots rely on the new model to connect items that are visible with possible actions in order to do what they’re told. The model is trained in a way that allows behavior to be generalized across very different hardware.

Google DeepMind also announced a version of its model called Gemini Robotics-ER (for embodied reasoning), which has just visual and spatial understanding. The idea is for other robot researchers to use this model to train their own models for controlling robots’ actions.

In a video demonstration, Google DeepMind’s researchers used the model to control a humanoid robot called Apollo, from the startup Apptronik. The robot converses with a human and moves letters around a tabletop when instructed to.

“We've been able to bring the world-understanding—the general-concept understanding—of Gemini 2.0 to robotics,” said Kanishka Rao, a robotics researcher at Google DeepMind who led the work, at a briefing ahead of today’s announcement.

Google DeepMind says the new model is able to control different robots successfully in hundreds of specific scenarios not previously included in their training. “Once the robot model has general-concept understanding, it becomes much more general and useful,” Rao said.

The breakthroughs that gave rise to powerful chatbots, including OpenAI’s ChatGPT and Google’s Gemini, have in recent years raised hope of a similar revolution in robotics, but big hurdles remain.