Title: Robust and Efficient Vision-Language Learning for Equity, Safety, and Well-Being
Date: Thursday, May 02, 2024
Time: 11:15 AM to 1:00 PM Eastern Time (US)
Location: Coda C1315
Virtual Meeting: Zoom
Gaurav Verma
https://gaurav22verma.github.io/
CS Ph.D. Student
School of Computational Science and Engineering
College of Computing
Georgia Institute of Technology
Committee:
Dr. Srijan Kumar - Advisor, Georgia Tech, Computational Science & Engineering
Dr. Munmun De Choudhury - Georgia Tech, School of Interactive Computing
Dr. Duen Horng (Polo) Chau - Georgia Tech, Computational Science & Engineering
Dr. Ani Nenkova - Adobe Research
Abstract:
The long-term goal of developing Artificial Intelligence (AI) systems is to enable broadly useful human-AI interactions for individuals, groups, and societies. The future of such AI systems is inherently multimodal, and the current shift in the landscape of AI research and development is a great illustration–powerful systems that reason over and generate vision, language, audio, and other forms of unstructured data are emerging rapidly. The robustness and efficiency of multimodal AI are imperative for enabling its widespread adoption. Furthermore, since AI tools are socio-technical systems, it is also critical that societal dimensions like equity, safety and well-being are prioritized among its applications. To this end, the objective of this thesis proposal is to evaluate and efficiently strengthen the robustness of vision-language learning to steer AI development towards enhancing language equity, online safety, and individual & public well-being.
How far are we from achieving 'three nines' reliability in multimodal AI systems? How do we get there? It is important that the underlying vision-language models that power AI applications are robust to both unintentional and intentional variations in input data. This requires systematic and efficient approaches to evaluate their robustness and overcome observed vulnerabilities. This thesis proposal presents work that develops systematic methods to evaluate the robustness of vision-language learning models to plausible changes in the input data – specifically, cross-modal dilutions and insertions. Furthermore, we propose modeling of text visualness as an efficient approach to perform text-to-image mapping in long-form content generation tasks.
Artificial Intelligence for Equity, Safety, and Well-Being. This thesis proposal aims to deliver three-pronged advances along important societal dimensions: (i) highlighting the language inequities that could be propagated by the use of language-only models and proposing vision-language learning as an approach to enable more equitable outcomes across English and non-English languages, (ii) developing AI approaches to enhance the safety of online spaces in a community-centric manner, specifically by characterizing and detecting malicious speech and actors, and (iii) using language models to discover insights that can inform policies for improving individual and public well-being.