Unsupervised Outlier Detection
Machine Learning for Text Mining
Definition: What is an outlier?
An outlier is an observation that deviates so
much from the other observations as to arouse
suspicion that it was generated by a different
Object not located in the clusters of a dataset.
Usually called “noise”.
“One person’s noise is another person’s
Outlier detection is useful as a standalone
Detection of credit card fraud.
Detection of attacks on Internet servers.
Find the best/worst basketball/hockey/football
Outlier detection can be used to remove
noise for clustering applications.
Some of the best clustering algorithms (EM,
Means) require an initial model
(informally “seeds”) to work. If the initial
points are outliers
the final model is
Means with Bad Initial Model
Algorithms for Mining Distance
Based Outliers in Large Datasets
Edwin M. Knorr, Raymond T. Ng
What is a Distance
An object O in a dataset T is a DB(p,D)
outlier if at least a fraction p of the objects
in T lies greater than distance D from O.
The distance is not defined here. Could be
Outliers in Statistics
Properties of DB Outliers
Similar lemmas exist for Poisson distributions and regression models.
Efficient algorithms for the detection of DB
outliers exist with complexities: O(k N
based: uses k
d or R trees to index all
objects based on distance
efficient search of
Other algorithms presented that are
exponential in the number of attributes k
not applicable for real text collection (k
Clean and simple to implement
Equivalent with other formal definitions
Depends on too many parameters (D and p). What are good
values for real
The decisions is (almost) binary: a data point is or is not an
outlier. In real life, it is not so simple
Approach was evaluated only on toy or synthetic data with few
attributes (< 50). Does it work on big real
LOF: Identifying Density
Based Local Outliers
Markus Breunig, Hans
Peter Kriegel, Raymond T. Ng, Jörg Sander
outliers can handle only “nice”
distributions. Many examples in real
data (e.g. mix of distributions) can not be
outliers give a binary classification of
objects: is or is not an outlier
Example of Local Outliers
Define a Local Outlier Factor (LOF) that
indicates the degree of outlier
ness of an
object using only the object’s
distance = smallest radius that includes at least k objects
Example of reach
Informally: LOF(p) is high when p’s density is low and the density of it’s neighbors is high.
The LOF of objects “deep” inside a cluster
is bounded as follows: 1/(1 +
) <= LOF
<= (1 +
), with a small
Hence LOF for objects in a cluster is
Applies to outlier objects that are in the vicinity of a
Illustration of Theorem 1
Applies to outlier objects that are in the vicinity of
Illustration of Theorem 2
LOF >= (0.5 d1
+ 0.5 d2
) / (0.5 i1
+ 0.5 i2
How to choose the best MinPts?
LOF Values when MinPts Varies
MinPts > 10 to remove statistical fluctuations.
MinPts < minimum number of objects in a cluster (?) to avoid
including outliers in the cluster densities.
How to choose the best MinPts?
Compute the LOF values for a range of
Pick the maximum LOF for each object
from this range.
Evaluation on Synthetic Data Set
Addresses better real
Formal proofs that LOF behaves well for outlier and
Gives a degree of outlier
ness not a binary decision.
Evaluated on toy (from our pov) collections.
MinPts is a sensitive parameter.
Unsupervised Outlier Detection and Semi
Adam Vinueza and Gregory Grudic
Cost Function for Supervised Training
by constraining the classification of nearby
objects to to not change too much (W
encodes nearness of x
s optimized to
maximize distances between points in different classes
Outlier = all objects classified with a low
The approach based on density outliers (LOF)
seems to be the best for real
But it was not tested on real
(thousands of documents, tens of thousands of
attributes). Plus, some factors are ad hoc (e.g.
MinPts > 10).
If supervised information is available, we can do a
lot better (duh).