[Sep 2022] Started at Stanford as a PhD student, generously supported by Stanford SoE Fellowship.
[Feb 2022] Invited Talks @ Google, FAIR, Sea AI Lab, ByteDance AI Lab on our LLMs as Zero-Shot Planners project.
[Dec 2021] Invited Talk @ Intel AI Lab on "Generalization across Objects and Morphologies in Robot Learning".
Research
The goal of my research is to endow robots with broad generalization capabilities for open-world manipulation tasks, especially in household environments.
Towards this goal, I am interested in 1) developing structured representations that leverage foundation models and Internet-scale data, and 2) developing algorithms that exhibit broadly generalizable behaviors.
Large vision models and vision-language models can generate keypoint-based constraints, which can be optimized to achieve multi-stage, in-the-wild, bimanual, and reactive behaviors, without task-specific training or environment models.
Large language models and visual-language models can be used to directly label affordances and constraints in the 3D perceptual space. Combined with motion planning, this enables robots to perform diverse everyday manipulation tasks in a zero-shot manner.
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence
International Conference on Machine Learning (ICML), 2023.
Project Page /
Paper /
Google AI Blog /
Summary
Language models can digest real-world sensor modalities (e.g., images) to be embodied in the physical world. The largest 562B model is a generalist agent across language, vision, and task planning.
Large language models can be grounded in embodied environments by using continuous probabilities to guide their token decoding, where the guidance is provided by a set of grounded models, such as affordance, safety, and preference functions.
Using hierarchical code generation, language models can write robot policy code that exhibits spatial-geometric reasoning given abstract natural language instructions.
Provided with textual embodied feedback, language models can articulate a grounded "thought process", solving challenging long-horizon robotic tasks, even under disturbances.
Large language models (e.g. GPT-3, Codex) contain rich actionable knowledge that can be used to plan actions for embodied agents, even without additional training.
With appropriate object representation, a multi-task RL policy can control an anthropomorphic hand to manipulate 100+ diverse objects and achieve SOTA performance on unseen ones.
Expressing robots as collections of modular components that share a control policy can lead to zero-shot generalization across diverse unseen robot morphologies.