Unsupervised Outlier Detection

journeycartAI and Robotics

Oct 15, 2013 (3 years and 5 months ago)

130 views

Unsupervised Outlier Detection

Seminar of
Machine Learning for Text Mining


UPC, 5/11/2004

Mihai Surdeanu

Definition: What is an outlier?


Hawkins outlier:


An outlier is an observation that deviates so
much from the other observations as to arouse
suspicion that it was generated by a different
mechanism.


Clustering outlier:


Object not located in the clusters of a dataset.
Usually called “noise”.


Applications (1)


“One person’s noise is another person’s
signal.”


Outlier detection is useful as a standalone
application:


Detection of credit card fraud.


Detection of attacks on Internet servers.


Find the best/worst basketball/hockey/football
players.

Applications (2)


Outlier detection can be used to remove
noise for clustering applications.


Some of the best clustering algorithms (EM,
K
-
Means) require an initial model
(informally “seeds”) to work. If the initial
points are outliers


the final model is
junk.

Example: K
-
Means with Bad Initial Model

Paper1:

Algorithms for Mining Distance
-
Based Outliers in Large Datasets


Edwin M. Knorr, Raymond T. Ng


What is a Distance
-
Based Outlier?


An object O in a dataset T is a DB(p,D)
-
outlier if at least a fraction p of the objects
in T lies greater than distance D from O.


The distance is not defined here. Could be
Euclidian, 1


cosine etc

Outliers in Statistics

Normal Distributions

Properties of DB Outliers

Similar lemmas exist for Poisson distributions and regression models.

Efficient Algorithms


Efficient algorithms for the detection of DB
-
outliers exist with complexities: O(k N
2
):


Index
-
based: uses k
-
d or R trees to index all
objects based on distance


efficient search of
neighborhood objects.


Other algorithms presented that are
exponential in the number of attributes k


not applicable for real text collection (k
> 10,000)

Conclusions


Advantages


Clean and simple to implement


Equivalent with other formal definitions
for well
-
behaved
distributions


Disadvantages


Depends on too many parameters (D and p). What are good
values for real
-
world collections?


The decisions is (almost) binary: a data point is or is not an
outlier. In real life, it is not so simple


Approach was evaluated only on toy or synthetic data with few
attributes (< 50). Does it work on big real
-
world collections?

Paper 2:

LOF: Identifying Density
-
Based Local Outliers

Markus Breunig, Hans
-
Peter Kriegel, Raymond T. Ng, Jörg Sander


Motivation


DB
-
outliers can handle only “nice”
distributions. Many examples in real
-
world
data (e.g. mix of distributions) can not be
handled


DB
-
outliers give a binary classification of
objects: is or is not an outlier

Example of Local Outliers

Goal


Define a Local Outlier Factor (LOF) that
indicates the degree of outlier
-
ness of an
object using only the object’s
neighborhood.

Definitions (1)

Informally: K
-
distance = smallest radius that includes at least k objects

Definitions (2)

Definitions (3)

Example of reach
-
dist

Definitions (4)

Definitions (5)

Informally: LOF(p) is high when p’s density is low and the density of it’s neighbors is high.

Lemma 1


The LOF of objects “deep” inside a cluster
is bounded as follows: 1/(1 +

) <= LOF
<= (1 +

), with a small

.


Hence LOF for objects in a cluster is
practically 1!

Theorem 1

Applies to outlier objects that are in the vicinity of a
single

cluster.

Illustration of Theorem 1

Theorem 2

Applies to outlier objects that are in the vicinity of
multiple
clusters.

Illustration of Theorem 2

LOF >= (0.5 d1
min

+ 0.5 d2
min
) / (0.5 i1
max

+ 0.5 i2
max
)

How to choose the best MinPts?


LOF Values when MinPts Varies

MinPts > 10 to remove statistical fluctuations.

MinPts < minimum number of objects in a cluster (?) to avoid


including outliers in the cluster densities.

How to choose the best MinPts?


Solution


Compute the LOF values for a range of
MinPts values.


Pick the maximum LOF for each object
from this range.

Evaluation on Synthetic Data Set

Conclusions


Advantages


Addresses better real
-
world data.


Formal proofs that LOF behaves well for outlier and
non
-
outlier objects.


Gives a degree of outlier
-
ness not a binary decision.


Disadvantages


Evaluated on toy (from our pov) collections.


MinPts is a sensitive parameter.


Paper 3:

Unsupervised Outlier Detection and Semi
-
Supervised Learning


Adam Vinueza and Gregory Grudic


Cost Function for Supervised Training

Q(F) maintains
local consistency

by constraining the classification of nearby

objects to to not change too much (W
ij
encodes nearness of x
i

and x
j
).



i
s optimized to
maximize distances between points in different classes
.

Calinski?

Outlier Detection


Outlier = all objects classified with a low
confidence

Global Conclusions


The approach based on density outliers (LOF)
seems to be the best for real
-
world data.


But it was not tested on real
-
world collection
(thousands of documents, tens of thousands of
attributes). Plus, some factors are ad hoc (e.g.
MinPts > 10).


If supervised information is available, we can do a
lot better (duh).