Title: User-Centered Programmatic Data Labeling

 

Date: Tuesday, May 2, 2023

Time: 14:30 – 16:30 EST

Location: Teams Link

 

 

Renzhi Wu

Ph.D. Student in Computer Science

School of Computer Science

College of Computing

Georgia Institute of Technology

 

Committee

Dr. Xu Chu (Advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology

Dr. Kexin Rong – School of Computer Science, Georgia Institute of Technology

Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology

Dr. Chao Zhang –  School of Computational Science and Engineering, Georgia Institute of Technology

 

Abstract:

The lack of labeled training data is a major challenge impeding the practical application of machine learning (ML) techniques. Therefore, ML practitioners have increasingly turned to programmatic supervision methods, in which a larger volume of programmatically generated, but often noisier, labeled examples is used in lieu of hand-labeled examples. In this paradigm, supervision sources are expressed as labeling functions (LFs), and a label model aggregates the output of multiple LFs to produce training labels.  However, existing methods provide little support for writing LFs, which can be difficult for common users, especially on non-texture data. In addition, existing label models require hyperparameters and dataset-specific training for each dataset and can yield non-deterministic results, further complicating the process for non-expert users.

 

This thesis aims to improve the usability of programmatic data labeling through a three-part research approach. First, I examine a specific task (entity matching) as a case study to develop an integrated development environment (IDE) to support users to write, manage, and aggregate LFs. On top of this, I also explore ways to tailor programmatic data labeling to the specific task for better performance. Second, to obviate user involvement in the label model, I present a hyper label model that requires neither hyperparameters nor dataset-specific training, while producing deterministic results with superior accuracy and efficiency. The proposed method also offers the first analytical optimal solution to the problem. Third, I extend the labeling function interface by introducing a visual interface, allowing users to create LFs for video data intuitively without any coding. Specifically, I propose a visual query language for retrieving video clips across datasets, enabling non-expert users to easily develop LFs with mouse drag-and-drop.