DUPLCATE RECORD DETECTION

zoomzurichΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

68 εμφανίσεις

D
UPLICATE

RECORD

DETECTION

AHMED K. ELMAGARMID

PURDUE UNIVERSITY, WEST LAFAYETTE, IN

Senior member, IEEE

PANAGIOTIS G. IPEIROTIS

LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY

Member, IEEE computer security

VASSILIOS S. VERYKIOS

UNIVERSITY OF THESSALY, VOLOS, GREECE

Member ,IEEE computer security.




PRESENTED BY



SHILPA MURTHY



INTRODUCTION TO THE PROBLEM


D
atabases play an important role in today’s IT
based economy


Many businesses and organizations depend on the
quality of data(or the lack thereof) stored in the
databases.


Any discrepancies in the data can have
significant cost implications to a system that
relies on information to function.



DATA QUALITY


Data are not carefully controlled for quality nor
defined in a consistent way across different data
sources, thus data quality is compromised due to
many factors .//examples


D
ata errors.Ex:
M
icrosft

instead of Microsoft


Integrity errors. Ex:
EmployeeAge
=567


Multiple conventions for information.Ex: 44 W.4
th

street and 44 west fourth street.


DATA HETEROGENEITY


While integrating data from different sources
into a warehouse , organizations become aware
potential systematic differences and these
problems and conflicts fall under a umbrella term
called as “DATA HETEROGENEITY”.


Two types of heterogeneity can be distinguished:
Structural heterogeneity and lexical
heterogeneity.


DATA QUALITY


Data cleaning refers to the process of resolving
identification problems in the data.


Structural heterogeneity


Different record structure


Addr

versus

City, State, and Zip code [1]


Lexical heterogeneity


Identical record structure, but data is different


44 W. 4
th

St.
versus

44 West Fourth Street [1]



TERMINOLOGY

Duplicate

Record

Detection

Record

linkage

Record

matching

Data

deduplication

Merge

purge

Instance

identification

Database

hardening

Name

matching

Coreference

resolution

Identity

uncertainty


DATA PREPARATION




Step before the duplicate record detection.


Improves the quality of the data


Makes data more comparable and more usable.


Data preparation stage includes three steps.



STEPS IN DATA PREPARATION


Parsing


Data transformation


Standardization



PARSING


Locates, identifies and isolates individual data
elements


Makes it easier to correct, standardize and match
data


Comparison of individual components rather
than complex strings


For example, the appropriate parsing of the name
and address components into consistent packets
is a very important step.



DATA TRANSFORMATION


Simple conversions of data type


Field renaming


Decoding field values


Range checking: involves examining data in a
field to ensure that it falls within the expected
range ,usually a numeric or date range


Dependency checking: is slightly more complex
kind of data transformation where we check the
values in a particular field to the values in
another field to ensure minimal level of
consistency in data



DATA STANDARDIZATION


Represent certain fields in a standard format


Addresses


US Postal Service Address Verification

tool


Date and time formatting


Names (first, last, middle, prefix, suffix)


Titles



LAST STEP IN DATA PREPARATION


Store data in tables having comparable fields.


Identify fields suitable for comparison


Not foolproof


Data may still contain inconsistencies due to
misspellings and different conventions to
represent data



FIELD MATCHING TECHNIQUES

o
Most common sources of mismatches in database
entries is due to typographical errors

o
The field matching metrics that have been
designed to overcome this problem are :

o
Character

based similarity metrics

o
Token based similarity metrics

o
Phonetic similarity metrics

o

Numeric similarity metrics


CHARACTER BASED SIMILARITY


Works best on typographical errors


Edit distance


Shortest sequence of edit commands that can
transform a string
s

into
t


Three types of edit operations .
I
f (cost =1) this
version of edit distance is referred to as the

Levenshein
” distance.


Insert, delete, replace operations.


Example. S1=“tin” s2= “tan”


We need to replace “I” to “A” to convert string s1 to
s2.


The edit distance here is 1. because we needed only
one operation to convert s1 to s2.




CHARACTER BASED SIMILARITY


Affine gap distance



Strings that have been truncated


John R. Smith versus Jonathan Richard Smith


Smith
-
Waterman distance


Substring matching which ignores the prefix and
suffix

Example: Prof.
John.R.Smith

and
John.R.Smith,Prof


Jaro

distance


Compares first and last name


Q
-
Grams

Divides string into a series of substrings of length
q.

E.g.: NELSON and NELSEN are phonetically similar
but spelled differently. The q
-
grams for these words
are NE LS ON and NE LS EN .



TOKEN BASED SIMILARITY


Works best when word (tokens) are transposed

Atomic Strings


Computational average

WHIRL


Weights words based on frequency to determine similarity


The words in the database have a weight associated with it,
which is calculated using a cosine similarity metric.


Example: in a database of company names the words
“AT&T” and “IBM” are less frequent than the word “
inc.



Similarity of John Smith and
Mr.John

Smith is close to 1.


But the similarity of
comptr

department and
deprtment

of
computer is zero since it doesn’t take care of misspelled
words.

Q
-
Grams with weighting


Extends WHIRL to handle spelling errors



PHONETIC SIMILARITY


Comparison based on how words sound





NUMERIC SIMILARITY


Considers only numbers


Convert numbers to text data


Simple range queries


Authors provided no insight in this area



SUMMARY OF METRICS

Comparison

Metrics

Edit

Distance

(Levenshtein)

Affine Gap

Smith

Waterman

Jaro

Distance

Atomic

Strings

WHIRL

Q
-
Grams

Soundex

NYSIIS

Oxford

Name

Compression

Metaphone

Double

Metaphone

Numeric


DUPLICATE RECORD DETECTION


The methods described till now have been
describing about similarity checking in single
fields.


The real life situations consist of multiple fields
which have to be checked for duplicate records.



CATEGORIZING METHODS


Probabilistic approaches and supervised
machine learning techniques


Approaches that rely on domain knowledge
or Generic distance metrics

PROBABILISTIC MATCHINGMODELS


Models derived from Bayes theorem


Use prior knowledge to make decision about current
data set


A tuple pair is assigned to one of the two classes M or
U. M class represents(match) same entity, and the U
class represents(non
-
match) different entity.


This can be determined by calculating the probability
distribution.


Rule
-
based decision tree


If
-
then
-
else traversal



S
UPERVISED

LEARNING


Relies on the existence of trained data.


The trained data is in the form of record pairs.


These record pairs are labeled matching or not.


SVM approach out performs all the simpler
approaches.


The post processing step is to create a graph for
all the records linking the matching records.


Records are considered identical using the
transitivity relation applied on the connected
components.


ACTIVE LEARNING


DISTANCE BASED TECHNIQUES

o
This method can be used when there is absence of
training data or human effort to create matching
models.

o
Treat a record as a one long field


Use a distance metric


Best matches are ranked using a weighting algorithm


Alternatively, use a single field


Must be highly discriminating



RULE BASED TECHNIQUES


Relies on business rules to derive key


Must determine functional dependencies


Requires subject matter expert to build matching
rules














R
ULE

BASED

TECHNIQUES


This figure depicts the equation theory that
dictates the logic of domain equivalence.


It specifies an inference about the similarity of
the records.



UNSUPERVISED LEARNING


Classify data as matched or unmatched without a
training set.


The comparison vector generally depicts which
category it belongs to. If it does not then it has to
be done manually.


One way to avoid manual labeling is to use the
clustering algorithms.


Group together similar comparison vectors.


Each cluster contains vectors with similar
characteristics.


By knowing the real class of only few vectors we
can infer the class of all the vectors.




TECHNIQUES

TO

IMPROVE

EFFICIENCY


Reduce the number of record comparisons


Improve the efficiency of record comparison



COMPARATIVE METRICS


Elementary nested loop


Compare every record in one table to another table


Requires A*B comparisons (Cartesian product) which
is very expensive


Cost required for a single comparison


Must consider number of fields/record



R
EDUCE

R
ECORD

C
OMPARISONS


Blocking


Sorted Neighborhood


Clustering and Canopies


Set Joins



B
LOCKING


Basic: Compute a hash value for each
record


Only compare records in the same bucket


Subdivide files into subsets (blocks)


Soundex
, NYSIIS, or
Metaphone


Drawback


Increases in speed may increase number of
false mismatches


Compromise is multiple runs using different
blocking fields



S
ORTED

N
EIGHBORHOOD


Create composite key, sort data, merge


Assumption


Duplicate records will be close in sorted
system


Highly dependent upon the comparison key



C
LUSTERING

AND

CANOPIES


Clustering
: Duplicate records are kept in a
cluster and only the representative of a cluster is
kept for future comparisons.


This reduces the total number of record
comparisons without compromising the accuracy.


Canopies
: The records are grouped into
overlapping clusters called as “canopies” and
then the records are compared which lead to
better qualitative results.


SOFTWARE TOOLS


Open Architecture


Freely Extensible Biomedical Record Linkage
(FEBRL)

-

Python


TAILOR


MLC++
,
DBGen


WHIRL



C++


Flamingo Project

-

C


BigMatch

-

C



DATABASE TOOLS


Commercial RDBMS


SQL Server 2005 implements “
fuzzy matches



Oracle 11g implements these techniques in its
utl_match

package


Levenshtein

Distance


Soundex


Jaro

Winkler


CONCLUSIONS


Lack of a standardized, large
-
scale benchmarking
data set


Training data is needed to produce matching
models


Research diversion


Databases emphasize simple, fast, and efficient
techniques


Machine learning and statistics rely on sophisticated
techniques and probabilistic models


More synergy is needed among various communities


Detection systems need to be adaptive over time