Exercises Data Mining Lecture 1

desertcockatooΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

88 εμφανίσεις

Exercises Data Mining Lecture 1


1.

Show that the measure of similarity
sim

does not follow the triangle
-
inequality.

2.

Use Matlab (or Maple or whatever) to plot


or draw by hand:

V

= {
x

= (
x,y
) in IR
2

| ||
x
||
1

>= 1 and ||
x
||
5

<= 1}

3.

a
Use Matlab (etc) to plot:

V

= {
x

= (
x,y
) in IR
2

| ||
x
||
1/2

= 1}

b
Argue from the plot why we normally in the generalized norm ||*||
d

choose
d

>= 1.

4.

Let:









4
1
1
4
g


a
Use Matlab (etc) to plot:
V

= {
x

= (
x,y
) in IR
2

| ||
x
||
g

= 4}, where ||*||
g

is the
Riemannian norm wi
th metric
g


b

Determine the eigen
-
values

i

and eigen
-
vectors
v
i

of
g

, and draw the vectors:

1
v
1

and

2
v
2

in the same plot as
V
.

5.

Consider the dataset DAMlex3.mat. Load this file in Matlab

with:

load ' DAMlex3.mat '

ascii. This set contains ten points
x
i

in IR
4
,
i=
1..10. Let the
matrix
d
indicate the distances between these points where
d(i,j)

is the distance
between the points
x
i

and
x
j

according to a specific norm or metric.

a.
Determine
d
in case the norm is:

i.
Euclidean,
ii.
Max
-
norm, i.e. generali
zed
p
-
norm

with
p
= ∞:
,
iii.
generalized
p
-
norm

with
p

= 4.

b.
Consider the matrix
1

g
:
















2

1

1

1
-
1

2

2
-

2
-
1

2
-

7

1
1
-

2
-

1

6
5
1
g


i.
Show that
g

is a valid metric.

ii.

Compute the Riemannian distance matrix
d
under the metric
g
.

6.

Consider the dataset DAMlex1.mat. Load thi
s file in Matlab

with:

load ' DAMlex1.mat '

ascii.

a
Determine the
two

principal axes
a
1

and

a
2

. Perform this by applying the Matlab
-
source pca.m available on the web. What fraction of the data is explained by these two
components?

b

The algorithm in p
ca.m projects the dataset on the plane spanned by the two
principal axes
a
1

and

a
2
. Plot the dataset in this plane.

7.

Consider dataset DAMlex2.mat. This contains the distances
d
between ten
unknown

points
x
i
, where
d(i,j)

is the Euclidean distance between t
he points
x
i

and
x
j
. The
objective is to compute or estimate the points: X = {
x
1
T
,
x
2
T
, …,
x
10
T
} from this
distance matrix
d
.

a.
Implement the algorithm described in hand et al. in equations 3.14 and 3.15, in
section 3.7, pp. 84


86, for computing the
matrix B = XX
T

.

b.

Show that the diagonal components of B are the squared Euclidean lengths of the



1

A DxD
-
matrix with
D

> 3 is often called a
tensor
.

ten sought points.

c.

Suppose that the points lie in a plane, i.e.
x
i



IR
2
. Argue that one can choose one
arbitrary point to define the first coordinate ax
is (e.g. the
x
-
axis) . Call your selected
index
i
*, i.e. your selected point is
x
i*
.


d.
Argue that B
ij

is the inner product between the points
x
i

and
x
j
. Argue that using B
jj

and B
ij

we can determine the
length

of
x
j

and its
angle

with the
x
-
axis. Argue t
hat in
general this allows for two solutions of
x
j
.

e.
Now select another point:
x
j*
, and use it to define the second axis; the
y
-
axis. Show
that


it the underlying set X were truly 2D


using
i*

and
j*

we can compute the all
coordinates
x
i

from B(
i,i
),
B(
i,i*
), and B(
i,j*
).

f.
Determine in this way the 10 2D
-
coordinates X for the distance matrix
d
.

g.

What will happen if the underlying set X is not truly 2D?