Title: User-Centered Programmatic Data Labeling
Date: Tuesday, May 2, 2023
Time: 14:30 – 16:30 EST
Location: Teams Link
Renzhi Wu
Ph.D. Student in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Dr. Xu Chu (Advisor) – School of Computer Science, Georgia Institute of Technology
Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology
Dr. Kexin Rong – School of Computer Science, Georgia Institute of Technology
Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology
Dr. Chao Zhang – School of Computational Science and Engineering, Georgia Institute of Technology
Abstract:
The lack of labeled training data is a major challenge impeding the practical application of machine learning (ML) techniques. Therefore, ML practitioners have increasingly turned to programmatic supervision methods, in which a larger volume of programmatically generated, but often noisier, labeled examples is used in lieu of hand-labeled examples. In this paradigm, supervision sources are expressed as labeling functions (LFs), and a label model aggregates the output of multiple LFs to produce training labels. However, existing methods provide little support for writing LFs, which can be difficult for common users, especially on non-texture data. In addition, existing label models require hyperparameters and dataset-specific training for each dataset and can yield non-deterministic results, further complicating the process for non-expert users.
This thesis aims to improve the usability of programmatic data labeling through a three-part research approach. First, I examine a specific task (entity matching) as a case study to develop an integrated development environment (IDE) to support users to write, manage, and aggregate LFs. On top of this, I also explore ways to tailor programmatic data labeling to the specific task for better performance. Second, to obviate user involvement in the label model, I present a hyper label model that requires neither hyperparameters nor dataset-specific training, while producing deterministic results with superior accuracy and efficiency. The proposed method also offers the first analytical optimal solution to the problem. Third, I extend the labeling function interface by introducing a visual interface, allowing users to create LFs for video data intuitively without any coding. Specifically, I propose a visual query language for retrieving video clips across datasets, enabling non-expert users to easily develop LFs with mouse drag-and-drop.