Alphabet-C's RT-2 model is trained based on online text and images, directly instructing the robot's actions, such as enabling the robot to understand what garbage is without the need for training, and even knowing how to throw away garbage. Alphabet-C claims that when faced with new tasks that have never appeared in training, RT-2's performance is almost doubled compared to the previous generation; RT-2 can respond to user instructions based on basic reasoning.
Alphabet-C is implanting advanced artificial intelligence (AI) models into robots, giving them an AI brain.
On Friday, July 28th, Alphabet-C announced the launch of a new product called Robotics Transformer 2 (RT-2), an AI model designed for the field of robotics. It is a new "vision-language-action" (VLA) model that can help train robots to understand tasks such as throwing away garbage.
According to Alphabet-C, RT-2 is based on the Transformer model and is trained using text and images from the internet to directly instruct robots to perform actions. Just as language models learn human society's thoughts and concepts through textual training data, RT-2 can also use online data to provide relevant knowledge to robots and guide their behavior.
Alphabet-C gives an example that in order to make previous robot systems perform the action of throwing away garbage, it was necessary to explicitly train the robot to understand what garbage is, as well as the actions of picking up and throwing away garbage. However, RT-2 can provide relevant knowledge from the internet to the robot, allowing it to understand what garbage is without explicit training. Even if it has never been trained on how to throw away garbage, it knows how to do it.
Alphabet-C states that RT-2 has the ability to transform information into actions, and with its help, robots are expected to adapt more quickly to new situations and environments.
After conducting over 6,000 tests on the RT-2 model, Alphabet-C's team found that when faced with tasks that were already present in the training data, or tasks that they had "seen" before, RT-2 performed similarly to its predecessor, RT-1, with no difference in functionality. However, in novel situations that had not been encountered before, RT-2's performance nearly doubled, achieving a success rate of 62%, far surpassing RT-1's 32%.
In other words, through RT-2, robots can learn more content like humans and apply the learned concepts to completely new situations.
Alphabet-C claims that RT-2 demonstrates the ability to generalize beyond the robot data it has been exposed to and has semantic and visual understanding capabilities. This includes interpreting new commands and responding to user instructions through basic reasoning, such as reasoning about object categories and higher-level descriptions.
Alphabet-C's research also shows that by combining the chain of thought with reasoning, RT-2 can perform multi-stage semantic reasoning, such as determining which objects can be temporarily used as hammers or which type of beverage is most suitable for a fatigued person. According to media reports on Friday, Alphabet-C currently has no immediate plans to mass produce or sell the RT-2 robot, but eventually these robots may be used in warehouses or as home assistants.