AHMED K. ELMAGARMID
PURDUE UNIVERSITY, WEST LAFAYETTE, IN
Senior member, IEEE
PANAGIOTIS G. IPEIROTIS
LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY
Member, IEEE computer security
VASSILIOS S. VERYKIOS
UNIVERSITY OF THESSALY, VOLOS, GREECE
Member ,IEEE computer security.
INTRODUCTION TO THE PROBLEM
atabases play an important role in today’s IT
Many businesses and organizations depend on the
quality of data(or the lack thereof) stored in the
Any discrepancies in the data can have
significant cost implications to a system that
relies on information to function.
Data are not carefully controlled for quality nor
defined in a consistent way across different data
sources, thus data quality is compromised due to
many factors .//examples
instead of Microsoft
Integrity errors. Ex:
Multiple conventions for information.Ex: 44 W.4
street and 44 west fourth street.
While integrating data from different sources
into a warehouse , organizations become aware
potential systematic differences and these
problems and conflicts fall under a umbrella term
called as “DATA HETEROGENEITY”.
Two types of heterogeneity can be distinguished:
Structural heterogeneity and lexical
Data cleaning refers to the process of resolving
identification problems in the data.
Different record structure
City, State, and Zip code 
Identical record structure, but data is different
44 W. 4
44 West Fourth Street 
Step before the duplicate record detection.
Improves the quality of the data
Makes data more comparable and more usable.
Data preparation stage includes three steps.
STEPS IN DATA PREPARATION
Locates, identifies and isolates individual data
Makes it easier to correct, standardize and match
Comparison of individual components rather
than complex strings
For example, the appropriate parsing of the name
and address components into consistent packets
is a very important step.
Simple conversions of data type
Decoding field values
Range checking: involves examining data in a
field to ensure that it falls within the expected
range ,usually a numeric or date range
Dependency checking: is slightly more complex
kind of data transformation where we check the
values in a particular field to the values in
another field to ensure minimal level of
consistency in data
Represent certain fields in a standard format
US Postal Service Address Verification
Date and time formatting
Names (first, last, middle, prefix, suffix)
LAST STEP IN DATA PREPARATION
Store data in tables having comparable fields.
Identify fields suitable for comparison
Data may still contain inconsistencies due to
misspellings and different conventions to
FIELD MATCHING TECHNIQUES
Most common sources of mismatches in database
entries is due to typographical errors
The field matching metrics that have been
designed to overcome this problem are :
based similarity metrics
Token based similarity metrics
Phonetic similarity metrics
Numeric similarity metrics
CHARACTER BASED SIMILARITY
Works best on typographical errors
Shortest sequence of edit commands that can
transform a string
Three types of edit operations .
f (cost =1) this
version of edit distance is referred to as the
Insert, delete, replace operations.
Example. S1=“tin” s2= “tan”
We need to replace “I” to “A” to convert string s1 to
The edit distance here is 1. because we needed only
one operation to convert s1 to s2.
CHARACTER BASED SIMILARITY
Affine gap distance
Strings that have been truncated
John R. Smith versus Jonathan Richard Smith
Substring matching which ignores the prefix and
Compares first and last name
Divides string into a series of substrings of length
E.g.: NELSON and NELSEN are phonetically similar
but spelled differently. The q
grams for these words
are NE LS ON and NE LS EN .
TOKEN BASED SIMILARITY
Works best when word (tokens) are transposed
Weights words based on frequency to determine similarity
The words in the database have a weight associated with it,
which is calculated using a cosine similarity metric.
Example: in a database of company names the words
“AT&T” and “IBM” are less frequent than the word “
Similarity of John Smith and
Smith is close to 1.
But the similarity of
computer is zero since it doesn’t take care of misspelled
Grams with weighting
Extends WHIRL to handle spelling errors
Comparison based on how words sound
Considers only numbers
Convert numbers to text data
Simple range queries
Authors provided no insight in this area
SUMMARY OF METRICS
DUPLICATE RECORD DETECTION
The methods described till now have been
describing about similarity checking in single
The real life situations consist of multiple fields
which have to be checked for duplicate records.
Probabilistic approaches and supervised
machine learning techniques
Approaches that rely on domain knowledge
or Generic distance metrics
Models derived from Bayes theorem
Use prior knowledge to make decision about current
A tuple pair is assigned to one of the two classes M or
U. M class represents(match) same entity, and the U
match) different entity.
This can be determined by calculating the probability
based decision tree
Relies on the existence of trained data.
The trained data is in the form of record pairs.
These record pairs are labeled matching or not.
SVM approach out performs all the simpler
The post processing step is to create a graph for
all the records linking the matching records.
Records are considered identical using the
transitivity relation applied on the connected
DISTANCE BASED TECHNIQUES
This method can be used when there is absence of
training data or human effort to create matching
Treat a record as a one long field
Use a distance metric
Best matches are ranked using a weighting algorithm
Alternatively, use a single field
Must be highly discriminating
RULE BASED TECHNIQUES
Relies on business rules to derive key
Must determine functional dependencies
Requires subject matter expert to build matching
This figure depicts the equation theory that
dictates the logic of domain equivalence.
It specifies an inference about the similarity of
Classify data as matched or unmatched without a
The comparison vector generally depicts which
category it belongs to. If it does not then it has to
be done manually.
One way to avoid manual labeling is to use the
Group together similar comparison vectors.
Each cluster contains vectors with similar
By knowing the real class of only few vectors we
can infer the class of all the vectors.
Reduce the number of record comparisons
Improve the efficiency of record comparison
Elementary nested loop
Compare every record in one table to another table
Requires A*B comparisons (Cartesian product) which
is very expensive
Cost required for a single comparison
Must consider number of fields/record
Clustering and Canopies
Basic: Compute a hash value for each
Only compare records in the same bucket
Subdivide files into subsets (blocks)
, NYSIIS, or
Increases in speed may increase number of
Compromise is multiple runs using different
Create composite key, sort data, merge
Duplicate records will be close in sorted
Highly dependent upon the comparison key
: Duplicate records are kept in a
cluster and only the representative of a cluster is
kept for future comparisons.
This reduces the total number of record
comparisons without compromising the accuracy.
: The records are grouped into
overlapping clusters called as “canopies” and
then the records are compared which lead to
better qualitative results.
Freely Extensible Biomedical Record Linkage
SQL Server 2005 implements “
Oracle 11g implements these techniques in its
Lack of a standardized, large
Training data is needed to produce matching
Databases emphasize simple, fast, and efficient
Machine learning and statistics rely on sophisticated
techniques and probabilistic models
More synergy is needed among various communities
Detection systems need to be adaptive over time