Title: Information Extraction on Scientific Literature under Limited Supervision
Date: Thursday, April 20, 2023
Time: 1:00 PM – 3:00 PM EST
Location: Zoom Link
Fan Bai
Ph.D. Student in Computer Science
School of Interactive Computing
College of Computing
Georgia Institute of Technology
Committee:
Dr. Alan Ritter (Advisor) – School of Interactive Computing, Georgia Institute of Technology
Dr. Wei Xu – School of Interactive Computing, Georgia Institute of Technology
Dr. Zsolt Kira – School of Interactive Computing, Georgia Institute of Technology
Dr. Gabriel Stanovsky – School of Computer Science and Engineering, Hebrew University of Jerusalem
Dr. Hoifung Poon – Microsoft Health Futures
Abstract:
The exponential growth of scientific literature presents both challenges and opportunities for researchers across various disciplines. Effectively extracting pertinent information from this extensive corpus is crucial for advancing knowledge, enhancing collaboration, and driving innovation. However, manual extraction is a laborious and time-consuming process, underscoring the demand for automated solutions. Information extraction (IE), a sub-field of natural language processing (NLP) focused on automatically extracting structured information from unstructured data sources, addresses this challenge by employing cutting-edge deep learning techniques. Despite their success, these IE methods often require substantial human-annotated data, which might not be easily accessible, particularly in specialized scientific domains. This highlights the need for adaptable and robust techniques capable of functioning with limited supervision.
In this thesis, I focus on the task of information extraction on scientific literature under limited (human) supervision. My prior work has delved into three key dimensions of this problem: 1) harnessing easily accessible sources, such as knowledge bases, to train information extraction systems without direct human supervision; 2) examining the trade-off between the labor cost of human annotation and the computational expense of domain-specific pre-training, with the aim of achieving optimal performance within budget constraints; and 3) leveraging the emerging capabilities of large pre-trained language models for few-shot information extraction, and subsequently distilling these capabilities into more compact student models for enhanced efficiency. In the proposed work, I will explore information extraction on semi-structured tabular data within scientific literature in the zero-shot setting. To accomplish this, I will work with different instruction-tuned language models, and develop effective and efficient prompting strategies for table extraction.