Data Cleaning for ML
§1. Overview
A survey about the state of data science and machine learning reveals that dirty data is the most common barrier faced by workers dealing with data. With the popularity of data science, it has become increasingly evident that data curation, unification, preparation, and cleaning are key enablers in unleashing the value of data, according to New York Times. Not surprisingly, developing effective and efficient data management solutions that address the data variety challenges is extremely timely and important topic, and is rife with deep theoretical and engineering problems.
§2. Data Cleaning
Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Common data cleaning activities include rule-based data cleaning, data transformation and wrangling, outlier detection, and missing value imputation. For a comprehensive survey on this topic, please refer to our book below.
§4. Data Cleaning for ML
Innocuous looking training data is often plagued with various types of errors (e.g., missing values, outliers). While it is widely recognized that dirty training data affects model performance, there is no systematic study of such impact, let alone principled approaches to best handle different errors. We build a first public benchmark CleanML that systematically investigates the impact of data cleaning on downstream ML models. CleanML includes 13 real-world datasets that exhibit 5 error types, 7 ML algorithms, and dozens of cleaning algorithms. We built tens of thousands of models to control randomness in ML experiments, which lead to many interesting findings.
Based on observations from CleanML, we are developing a new theoretical framework to characterize when cleaning data errors can help ML. The framework, termed certain prediction, draws inspirations from the database area of certain query answering. The certain predictions (CP) framework states that a test data example can be certainly predicted (CP’ed) if all possible classifiers trained on top of all possible worlds induced by data errors would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP’ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction. Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study for nearest neighbor classifiers, where efficient solutions to CP queries are developed. We also propose a new cleaning algorithm CPClean based on CP framework that significantly outperforms existing cleaning techniques in terms of classification accuracy with mild manual cleaning effort.
§4. Publications
- Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions
Bojan Karlaš*, Peng Li*, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, Ce Zhang. (* denotes equal contributions)
VLDB 2021 - CleanML: A Benchmark for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Peng Li, Xi Rao, Jeffinifer Blase, Yue Zhang, Xu Chu, Ce Zhang
ICDE 2021 - Data Cleaning (Book)
Ihab F. Ilyas, Xu Chu
ACM Book Series 2019 [Amazon Link] - Trends in Cleaning Relational Data: Consistency and Deduplication (Book)
Ihab F. Ilyas, Xu Chu
In Foundations and Trends® in Databases, Volume 5, Issue 4, 2015 [PDF]