Today, AgiBot launches Genie Operator-1 (GO-1), an innovative generalist embodied foundation model. GO-1 introduces the novel Vision-Language-Latent-Action (ViLLA) framework, combining a Vision-Language Model (VLM) and Mixture of Experts (MoE). The VLM utilizes internet-scale heterogeneous data to establish a solid foundation for scene and object understanding.
The MoE consists of two key components: the Latent Planner, which learns from cross-embodiment and human operation data to develop general action understanding, and the Action Expert, which uses over a million real robot demonstrations to achieve high-frequency and dexterous manipulation.
These components work in synergy, providing GO-1’s unique capabilities:
Paper: https://agibot-world.com/blog/agibot_go1.pdf
YouTube Link: https://youtu.be/9dvygD4G93c
At the end of 2024, AgiBot launched the AgiBot World dataset, a large-scale, high-quality real world robotics dataset comprising over 1 million trajectories across 217 tasks in five application domains. Building on top of AgiBot World, today AgiBot introduces Genie Operator-1 (GO-1), a generalist embodied foundation model.
GO-1: An Evolution from VLA to ViLLA
To maximize the value of the high-quality AgiBot World dataset as well as web-scale heterogeneous videos while improving the policy’s generalization capability, AgiBot proposes a hierarchical Vision-Language-Latent-Action (ViLLA) framework. Compared to the Vision-Language-Action (VLA) model, where actions are directly conditioned on vision and language inputs, the ViLLA model predicts latent action tokens, bridging the gap between image-text inputs and robot actions generated by the action expert.
The ViLLA framework consists of a VLM and MoE. The VLM uses massive multimodal data on the internet to obtain general scene understanding and language comprehension. The Latent Planner in MoE harnesses data from various embodiments and human actions to build action comprehension. Meanwhile, the Action Expert, trained with over a million real world robot demonstrations, refines action execution. During inference, the VLM, Latent Planner, and Action Expert cooperate as follows:
The following is an introduction to the two key components of MoE: Latent Planner and Action Expert.
Expert 1: Latent Planner
Although the AgiBot World dataset is the largest real world robot dataset globally, the volume of action-labeled robot data remains limited relative to internet-scale datasets. To address this, AgiBot employs latent actions to model the inverse dynamics of consecutive frames. This approach enables the transfer of real-world dynamics from heterogeneous data sources into universal manipulation knowledge.
○ The encoder employs a spatial-temporal transformer with causal temporal masks.
○ The decoder uses a spatial transformer, taking the initial frame and discretizing Latent Action Tokens as input.
○ Latent Action Tokens are quantized using VQ-VAE.
Expert 2: Action Expert
To achieve high-frequency and dexterous manipulation, AgiBot integrates an action expert that utilizes a diffusion objective to model the continuous distribution of low-level actions.
Experimental Results
Using the novel Vision-Language-Latent-Action (ViLLA) framework, AgiBot evaluated GO-1 across five tasks of varying complexity. Compared to current state-of-the-art models, GO-1 significantly outperforms them, increasing success rates by 32% (46% → 78%). Notably, tasks like “Pour Water” and “Restock Beverage” showed remarkable improvements. Furthermore, AgiBot validated the contribution of the Latent Planner within the ViLLA framework, showing a 12% success rate improvement (66% → 78%).
GO-1: Comprehensive Innovation of Embodied Intelligence
AgiBot GO-1 leverages human and diverse types of robot data, enabling robots to acquire revolutionary learning capabilities. It can generalize across various environments and objects, quickly adapt to new tasks, and learn new skills. At the same time, it can be deployed across various robotic embodiments, enabling efficient implementation and continuous evolution in real-world environments.
The key characteristics of GO-1 can be summarized as follows:
The launch of GO-1 marks a rapid advancement of embodied intelligence towards generalization, openness, and enhanced capabilities:
AgiBot GO-1 will accelerate the widespread adoption of embodied intelligence, transforming robots from task-specific tools into autonomous agents with general intelligence. It will play a greater role across various domains, including manufacturing, service, and household applications, paving the way for a more versatile and intelligent future.
AgiBot official website:
https://www.linkedin.com/feed/update/urn:li:activity:7304747190139150338
https://fb.watch/yefx6B0bsC/
Media ContactCompany Name: AgiBotContact Person: William PengEmail: Send EmailCountry: ChinaWebsite: https://www.agibot.com/