Title: Cleaning and Learning over Dirty Tabular Data

 

Date: Wednesday, May 3, 2023

Time: 14:00 – 16:00 EST

Location: Teams Link

 

Peng Li

Ph.D. Student in Computer Science

School of Computer Science

College of Computing

Georgia Institute of Technology

 

Committee:

Dr. Xu Chu (Advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology

Dr. Kexin Rong – School of Computer Science, Georgia Institute of Technology

Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology

 

Abstract:

The quality of machine learning (ML) applications is only as good as the quality of the data they train on. Due to noisy inputs from manual data curation or errors from automatic data collection programs, in reality, training data is, unfortunately, seldom free of errors. For this reason, data cleaning is widely regarded as an essential step in an ML workflow and an effective way for improving model quality. However, data cleaning is still a time-consuming task that heavily depends on human experts. Moreover, most existing data cleaning works treat data cleaning as a standalone task that is independent of its downstream applications. This separation of data cleaning from ML applications is problematic as it may not necessarily improve the ML performance and it can incur unnecessary data cleaning costs.

 

In this proposed thesis, I aim at designing theories, algorithms, and systems for cleaning and learning over dirty tabular data to (1) maximize the downstream ML model performance; and (2) minimize human cleaning costs. To achieve this goal, my research roadmap consists of four stages. First, I propose to establish the feasibility of this research by building a CleanML benchmark that empirically investigates the impact of data cleaning on downstream ML model performance. Second, I propose to design theories and algorithms for human-involved data cleaning, where the goal is to both maximize the ML model performance and minimize human cleaning costs. Third, I propose to design algorithms and systems to automate data cleaning for ML for structural tabular data, where the goal is to automatically select the data cleaning algorithm that maximizes the ML model performance without any human effort. Finally, I propose to design algorithms and systems to automatically transform non-structural tables into structural tables, which will extend the application of this work to all tabular data.