# Programming in Data

Διαχείριση Δεδομένων

20 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

98 εμφανίσεις

Mathematical
Programming in Data
Mining

Author: O. L. Mangasarian

-
Cheng Lin

Abstract

Describe mathematical programming
to feature selection, clustering and
robust representation

Outline

Motivation

Objective

Problems

Feature Selection

Clustering

Robust Representation

Conclusion

Motivation

Mathematical programming has been
applied to a great variety of
theoretical

Problems can be formulated and
effectively solved as mathematical
programs

Objective

Describe three mathematical
-
programming
-
based developments
relevant to data mining

Problems

Feature Selection

Clustering

Robust Representation

Problem
-

Feature Selection

Discriminating between two finite
point sets in n
-
dimensional feature
space and utilizes as few of the
feature as possible

Formulated as mathematical
program with a parametric objective
function and linear constraints

Problem
-

Clustering

Assigning m points in the n
-
dimensional real space R
n

to k
clusters

Formulated as determining k centers
in R
n
, the sum of distances of each
point to the nearest center is
minimized

Problem
-

Robust Representation

Modeling a system of relations in a
manner that preserves the validity of
the representation when the data on
which the model is based changes

Use a sufficiently small error

is
purposely tolerated

Feature Selection

Use the simplest model to describe
the essence of a phenomenon

Binary classification problem:

discriminating between two given point
sets A and B in the n
-
dimensional real
space R
n

by using as few of the n
-
dimensions of the space as possible

Binary classification

W

P

the following are some defined:

A

B

Feature Selection

Successive Linearization Algorithm

w vector is result

Experimentation

32
-
feature
Wisconsin
Prognostic Breast
Cancer(WPBC)

N=32, m = 28, k =
118, r

= 0.05, 4
features,
increasing tenfold
cross
-
validation
correctness by
35.4%

Clustering

Determining k cluster centers, the
sum of the 1
-
norm distances of each
point in a given database to nearest
cluster center is minimized

Minimizing product of two linear
functions on a set defined by linear
inequalities

K
-
Median Algorithm

Need to solve

Experimentation

used as a KDD tool to mine WPBC to
discover medical knowledge

key observation is curves are well
separated

Experimentation

Robust Representation

model remains valid under a class of
data perturbation

Use

-
tolerance zone wherein errors
are disregarded

Better generalization results than
conventional zero
-
tolerance

Robust Representation

A is a m*n matrix, a is a m*1 vector

x is a vector be

learned

find minimize of Ax
-

a

Robust Representation

=

x

A

a

-
tolerate

=

x

A

a

Conclusion

Mathematical programming codes
are reliable and robust codes

Problems solved demonstrate
mathematical programming as
versatile and effective tool for
solving important problems in data
mining and knowledge discovery in
databases

Opinion

Mathematical describe
can explain
convince others, but

you must be
understand it first