Title: Cleaning and Learning over Dirty Tabular Data
Date: Wednesday, May 3, 2023
Time: 14:00 – 16:00 EST
Location: Teams Link
Peng Li
Ph.D. Student in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Dr. Xu Chu (Advisor) – School of Computer Science, Georgia Institute of Technology
Dr. Joy Arulraj – School of Computer Science, Georgia Institute of Technology
Dr. Kexin Rong – School of Computer Science, Georgia Institute of Technology
Dr. Shamkant Navathe – School of Computer Science, Georgia Institute of Technology
Abstract:
The quality of machine learning (ML) applications is only as good as the quality of the data they train on. Due to noisy inputs from manual data curation or errors from automatic data collection programs, in reality, training data is, unfortunately, seldom free of errors. For this reason, data cleaning is widely regarded as an essential step in an ML workflow and an effective way for improving model quality. However, data cleaning is still a time-consuming task that heavily depends on human experts. Moreover, most existing data cleaning works treat data cleaning as a standalone task that is independent of its downstream applications. This separation of data cleaning from ML applications is problematic as it may not necessarily improve the ML performance and it can incur unnecessary data cleaning costs.
In this proposed thesis, I aim at designing theories, algorithms, and systems for cleaning and learning over dirty tabular data to (1) maximize the downstream ML model performance; and (2) minimize human cleaning costs. To achieve this goal, my research roadmap consists of four stages. First, I propose to establish the feasibility of this research by building a CleanML benchmark that empirically investigates the impact of data cleaning on downstream ML model performance. Second, I propose to design theories and algorithms for human-involved data cleaning, where the goal is to both maximize the ML model performance and minimize human cleaning costs. Third, I propose to design algorithms and systems to automate data cleaning for ML for structural tabular data, where the goal is to automatically select the data cleaning algorithm that maximizes the ML model performance without any human effort. Finally, I propose to design algorithms and systems to automatically transform non-structural tables into structural tables, which will extend the application of this work to all tabular data.