Research

Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it. The deliberate process we call reasoning is, I believe, the thinnest veneer of human thought, effective only because it is supported by this much older and much more powerful, though usually unconscious, sensorimotor knowledge. We are all prodigious olympians in perceptual and motor areas, so good that we make the difficult look easy. Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet mastered it. It is not all that intrinsically difficult; it just seems so when we do it. -- Hans Moravec

Projects

Curricular Data Visualization

Cross-Episodic Curriculum for Transformer Agents

Lucy Xiaoyang Shi*, Yunfan Jiang*, Jake Grigsby, Linxi 'Jim' Fan†, Yuke Zhu†

Neural Information Processing Systems (NeurIPS), 2023

paper / project / code / twitter

CEC enhances Transformer agents' learning efficiency and generalization by structuring cross-episodic experiences in-context.small

Curricular Data Visualization

Cross-Episodic Curriculum for Transformer Agents

Lucy Xiaoyang Shi*, Yunfan Jiang*, Jake Grigsby, Linxi 'Jim' Fan†, Yuke Zhu†

Neural Information Processing Systems (NeurIPS), 2023

paper / project / code / twitter

CEC enhances Transformer agents' learning efficiency and generalization by structuring cross-episodic experiences in-context.small

Idea and Thinking

Minecraft is a good place to do RL research:
  • Can do code as action: no motion/control difficulties because code rules minecraft. State can be retrieved by API. So can focus on lifelong embodied agent learning ratehr than solving the 3D perception or sensorimotor control problems.
  • Does not impose a predefined end goal or a fixed storyline but rather provides a unique playground with endless possibilities.
  • Offline pre-training is ok because huge amounts of videos on Youtube.
  • Curriculum is doable because clear hierachical level of tasks (resources needed & step length).
Questions in my head:
  • What kind of data, if scaled up, can make robots do effective learning?
  • What's the meaning of GPT for robot learning community?
  • How to determine whether a self-setting goal is appropriate for an agent in the context of exploration and self-learning without command? Any difference in the context of subgoal-setting to finish a long-horizon commanded goal?

Good Writing and Ideas From Others

An effective lifelong learning agent should have similar capabilities as human players: (1) propose suitable tasks based on its current skill level and world state, e.g., learn to harvest sand and cactus before iron if it finds itself in a desert rather than a forest; (2) refine skills based on environmental feedback and commit mastered skills to memory for future reuse in similar situations (e.g. fighting zombies is similar to fighting spiders); (3) continually explore the world and seek out new tasks in a self-driven manner. VOYAGER

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. SayCan

Current (CV) datasets and models represent only a limited definition of visual perception. First, today’s influential Internet datasets capture brief, isolated moments in time from a third-person “spectactor” view. However, in both robotics and augmented reality, the input is a long, fluid video stream from the first-person or “ego-centric” point of view—where we see the world through the eyes of an agent actively engaged with its environment. Second, whereas Internet photos are intentionally captured by a human photographer, images from an always-on wearable egocentric camera lack this active curation. Finally, first-person perception requires a persistent 3D understanding of the camera wearer’s physical surroundings, and must interpret objects and actions in a human context—attentive to human-object interactions and high-level social behaviors. Ego4D


© 2024. All rights reserved.

Powered by Hydejack v9.1.6