Title: Towards multi-modal AI systems with `open-world' cognition.
Date: Thursday, April 27th, 2023
Time: 11:00 AM – 1:00 PM EST
Location: Zoom Meeting (https://gatech.zoom.us/j/96011436910)
Harsh Agrawal
Ph.D. Student in Computer Science,
School of Interactive Computing,
College of Computing,
Georgia Institute of Technology
Committee
Dr. Dhruv Batra (Advisor, School of Interactive Computing, Georgia Institute of Technology)
Dr. Devi Parikh (School of Interactive Computing, Georgia Institute of Technology)
Dr. James Hays (School of Interactive Computing, Georgia Institute of Technology)
Dr. Alexander Schwing (Department of Electrical and Computer Engineering, UIUC)
Dr. Peter Anderson (Google)
Abstract
A long-term goal in AI research is to build intelligent systems with 'open-world' cognition. When deployed in the wild, AI systems should generalize to novel concepts and instructions. Such an agent would need to perceive both familiar and unfamiliar concepts present in the environment, combine the capabilities of models trained on different modalities, and incrementally acquire new skills to continuously adapt to the evolving world. In this thesis, we look at how we can combine complementary multi-modal knowledge with suitable forms of reasoning to enable novel concept learning. In Part 1, we show that agents can infer unfamiliar concepts in the presence of other familiar concepts by combining multi-modal knowledge with deductive reasoning. Furthermore, agents can use newly inferred concepts to update their vocabulary of known concepts and infer additional novel concepts incrementally. In Part 2, we will look at how we can use task-dependent augmentations for improving robustness in unseen environments.
In Part 3, we develop realistic tasks that require understanding novel concepts. First, we present a benchmark to evaluate the AI system's capability to describe novel objects present in an image. Second, we show how embodied agents can combine perception with common-sense knowledge to perform household chores like tidying up the house, without any explicit human instruction, even in the presence of unseen objects in unseen environments. Finally, in Part 3, we show that multi-modal knowledge stored in large pre-trained models trained can be used to teach agents new skills, allowing the agent to perform novel tasks with increasing difficulty in a zero-shot manner.