Programming in Data

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

76 εμφανίσεις

Mathematical
Programming in Data
Mining

Author: O. L. Mangasarian

Advisor: Dr. Hsu

Graduate: Yan
-
Cheng Lin

Abstract


Describe mathematical programming
to feature selection, clustering and
robust representation


Outline


Motivation


Objective


Problems


Feature Selection


Clustering


Robust Representation


Conclusion

Motivation


Mathematical programming has been
applied to a great variety of
theoretical


Problems can be formulated and
effectively solved as mathematical
programs



Objective


Describe three mathematical
-
programming
-
based developments
relevant to data mining

Problems


Feature Selection


Clustering


Robust Representation

Problem
-

Feature Selection


Discriminating between two finite
point sets in n
-
dimensional feature
space and utilizes as few of the
feature as possible


Formulated as mathematical
program with a parametric objective
function and linear constraints



Problem
-

Clustering


Assigning m points in the n
-
dimensional real space R
n

to k
clusters


Formulated as determining k centers
in R
n
, the sum of distances of each
point to the nearest center is
minimized

Problem
-

Robust Representation


Modeling a system of relations in a
manner that preserves the validity of
the representation when the data on
which the model is based changes


Use a sufficiently small error


is
purposely tolerated

Feature Selection


Use the simplest model to describe
the essence of a phenomenon


Binary classification problem:


discriminating between two given point
sets A and B in the n
-
dimensional real
space R
n

by using as few of the n
-
dimensions of the space as possible

Binary classification

W

P


the following are some defined:






A




B

Feature Selection

Successive Linearization Algorithm


w vector is result








Experimentation


32
-
feature
Wisconsin
Prognostic Breast
Cancer(WPBC)


N=32, m = 28, k =
118, r

= 0.05, 4
features,
increasing tenfold
cross
-
validation
correctness by
35.4%


Clustering


Determining k cluster centers, the
sum of the 1
-
norm distances of each
point in a given database to nearest
cluster center is minimized


Minimizing product of two linear
functions on a set defined by linear
inequalities

K
-
Median Algorithm


Need to solve

Experimentation


used as a KDD tool to mine WPBC to
discover medical knowledge


key observation is curves are well
separated





Experimentation

Robust Representation


model remains valid under a class of
data perturbation


Use

-
tolerance zone wherein errors
are disregarded


Better generalization results than
conventional zero
-
tolerance


Robust Representation


A is a m*n matrix, a is a m*1 vector


x is a vector be

learned



find minimize of Ax
-

a


Robust Representation

=

x

A

a


-
tolerate

=

x

A

a

Conclusion


Mathematical programming codes
are reliable and robust codes


Problems solved demonstrate
mathematical programming as
versatile and effective tool for
solving important problems in data
mining and knowledge discovery in
databases

Opinion


Mathematical describe
can explain
about complex problems and
convince others, but

you must be
understand it first