ML for Data Integration
Data Integration refers to the process of unifying data from multiple data sources. Entity resolution (also known as duplicate detection, data matching etc) refers to the problem of identifying tuples in one or more relations that refer to the same real world entity, and is one of the most important tasks in data integration. For example, an e-commerce website would want to identify duplicate products (such as from different suppliers) so that they could all be listed in the same product page. ER has been extensively studied in many research communities, including databases, statistics, NLP, and data mining.
§2. Challenges in ER
- Performance. Precision (the percentage of correctly labeled matches) and Recall (the percentage of true matches that are correctly labeled) are often used to measure the performance of an ER algorithm. Achieving high precision and recall is often difficult because true matches can look dissimilar while false matches can look similar. Recent techniques often use ML or even deep learning to train a supervised binary classifier. However, they often require human supervision in the form of feature engineering (which similar functions to use) and labeled examples (labeled matches and non-matches).
- Scalability. Consider a dataset that has 1 million records, where each record stores the name of a restaurant and the city of the restaurant. Assume there are 1,000 unique cities and each city has 1,000 restaurants; hence, there are 1 million records in total. This dataset would require about 106×106 tuple pair comparisons. Assuming each 2 comparison takes 1 μs, it would take about 5.78 days. Blocking is a standard technique to tackle this. For example, Since domain knowledge suggests that restaurants from different cities are unlikely to be matches, the records can be partitioned into 1000 blocks, where each block contains 1000 restaurants from one city. The number of comparisons required after blocking is 1000 × 103×103 , which would only take 20.8 min. However, devise appropriate blocking rules is also a pain point in practice that needs expensive human supervision.
§3. Panda: A Weakly Supervised Entity Matching System
To tackle the two challenges, we are developing theories, systems, and algorithms to (1) increasingly reduce the amount of supervision needed for performing ER; and (2) to making it dramatically easier to provide various supervision signals. See below publications for a list of our recent work towards that direction.
We are currently building a weakly supervised and scalable ER system Panda, which will provide a unifying interface (along with theories and algorithms) for human supervision and will feature an automated blocking component (by using learning-to-hash). Please see below video for a demonstration of Panda.
- Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples
Peng Li, Xiang Cheng, Xu Chu, Yeye He, Surajit Chaudhuri
SIGMOD 2021 [PDF]
- ZeroER: Entity Resolution using Zero Labeled Examples
Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, Saravanan Thirumuruganathan
SIGMOD 2020 [PDF]
- Data Cleaning (Book)
Ihab F. Ilyas, Xu Chu
ACM Book Series 2019 [Amazon Link]
- Distributed Data Deduplication
Xu Chu, Ihab F. Ilyas, Paraschos Koutris
VLDB 2016 [PDF]
- Trends in Cleaning Relational Data: Consistency and Deduplication (Book)
Ihab F. Ilyas, Xu Chu
In Foundations and Trends® in Databases, Volume 5, Issue 4, 2015 [PDF]