# Clustering Uncertain Data

AI and Robotics

Nov 25, 2013 (4 years and 7 months ago)

60 views

Clustering Uncertain Data

CS 290 Project

Nick Larusso

Brian Ruttenberg

Motivation

Many data acquisition tools provide uncertain data

eg. sensor networks, image analysis, etc.

Records are no longer points in multidimensional
space, but regions based on the certainty of the data

New methods are required to manage and learn from
this data

Probabilistic Analysis of Ganglion
Cell Morphology

Bioimages are inherently uncertain

We would like to answer questions like “how large
is the cell soma?”, “how many dendrites are there,
and how often do they branch?”

It is important to provide a level of confidence in
each measurement to avoid error propagation

Project Goal

Approximately
200
images of ganglion cells under
various conditions

healthy cells, detached retina (
7
d,
28
d,
56
d)

Probabilistic measurements of soma size, dendritic
field size, and dendritic field density for each cell

Want to cluster these cells to determine the effect of
retinal detachment on cell morphology

UK
-
means Algorithm

K
-
means algorithm minimizes sum of squared errors
(SSE)

UK
-
means, minimizes expected sum of squared
errors

Compute by finding the expected value of each
dimension

j=
1
K

x
i

C
j

c
j

x
j

2
E

j=
1
K

x
j

C
j

c
j

x
j

2
=

j=
1
K

x
j

C
j

c
j

x
j

2
f
x
i
dx
i
UK
-
Means

This idea may be sufficient for Gaussian
-
like
distributions?

Does not account for variance in data

All Possible Worlds (APW)
Probabilistic Clustering

Instead of representing with a single value, consider
all possible values for a distribution weighted by
their respective probabilities

Provides a much better description of the data than
expected value

APW Example

Choose one state from
object A

For each possible state
in object B calculate
distance

Continue for each state
in A

APW Clustering

Compute probability of a possible world

Where x(i) is the value chosen for object i

Cluster (certain) objects using k
-
means

Combine clustering results for each possible world

P
world
j
=

P
object
i
=x
i
APW Computational Costs

N = # of uncertain objects

D = # of dimensions of each object

Assume each dimension is described by a
distribution over a constant number of values, C

(D * C)
N
Possible Worlds

Our data:

D ~ 3, C ~ 15, N ~ 200 => we need a very fast computer!

Gibbs Sampling

Computationally infeasible to calculate all possible
worlds, so sample from this space instead

Intuition
: We really only care about the possible
worlds that carry high probabilities, so we can
weight our sampling toward these worlds

APW Clustering Using Gibbs
Sampling

Randomly pick values for each dimension of each object

Iterate through each dimension of every object

For a given object and a given dimension

Pick a sample value weighted by the probability distribution

Calculate probability of world

Cluster objects via k
-
means

The objects are then binned according to how often a
particular clustering result shows up

Preliminary Results

Interpretation of results is the biggest challenge

Preliminary results run on 7 Day Ganglion cells

30 cells from each detached and normal retinas

UK
-
means and APW approach run

7D Normal: UK
-
means

Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
2
1
0
1
1
1
2
1
1
1
2
1
2
0
0
1
0
2
2
1
0
0
2
2
0
0
2
0
7D Normal: APW

Cell 1
Cell 2
Cell 3
Cell 4
Cell 5
Cell 6
Cell 7
Cell 8
Cell 9
Cell 10
Cell 11
Cell 12
Cell 13
Cell 14
Cell 15
Cell 16
Cell 17
Cell 18
Cell 19
Cell 20
Cell 21
Cell 22
Cell 23
Cell 24
Cell 25
Cell 1
24.3%
9.4%
24.8%
52.2%
24.8%
24.4%
24.4%
1.6%
24.3%
25.9%
32.7%
32.9%
25.5%
26.5%
80.1%
65.7%
25.1%
76.0%
40.1%
26.5%
43.5%
74.3%
76.4%
12.9%
Cell 2
24.3%
0.0%
99.1%
67.3%
99.0%
99.6%
99.8%
0.0%
100.0%
97.9%
90.2%
0.1%
98.2%
0.4%
13.3%
3.6%
99.0%
22.0%
0.1%
0.3%
73.0%
3.7%
5.4%
0.1%
Cell 3
9.4%
0.0%
0.0%
0.9%
0.0%
0.0%
0.0%
84.1%
0.0%
0.0%
0.1%
60.9%
0.0%
66.3%
9.7%
26.6%
0.0%
8.5%
53.7%
65.7%
2.4%
17.7%
15.5%
77.0%
Cell 4
24.8%
99.1%
0.0%
68.2%
99.6%
99.5%
99.3%
0.0%
99.1%
97.9%
91.0%
0.1%
99.1%
0.4%
13.8%
3.7%
98.3%
22.5%
0.1%
0.4%
73.3%
3.8%
5.6%
0.1%
Cell 5
52.2%
67.3%
0.9%
68.2%
68.3%
67.7%
67.5%
0.0%
67.3%
68.6%
74.5%
6.2%
69.1%
5.9%
45.5%
27.1%
67.9%
52.3%
9.3%
6.6%
67.8%
32.8%
35.6%
2.5%
Cell 6
24.8%
99.0%
0.0%
99.6%
68.3%
99.4%
99.2%
0.0%
99.0%
97.9%
91.1%
0.1%
99.2%
0.4%
13.9%
3.7%
98.2%
22.5%
0.1%
0.4%
73.3%
3.8%
5.6%
0.1%
Cell 7
24.4%
99.6%
0.0%
99.5%
67.7%
99.4%
99.8%
0.0%
99.6%
98.1%
90.6%
0.1%
98.6%
0.4%
13.4%
3.7%
98.7%
22.1%
0.1%
0.3%
73.2%
3.7%
5.4%
0.1%
Cell 8
24.4%
99.8%
0.0%
99.3%
67.5%
99.2%
99.8%
0.0%
99.8%
98.1%
90.4%
0.1%
98.4%
0.4%
13.4%
3.6%
98.8%
22.0%
0.1%
0.3%
73.1%
3.7%
5.4%
0.1%
Cell 9
1.6%
0.0%
84.1%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
49.7%
0.0%
59.5%
0.5%
13.2%
0.0%
0.8%
41.2%
60.8%
0.3%
4.0%
2.8%
80.4%
Cell 10
24.3%
100.0%
0.0%
99.1%
67.3%
99.0%
99.6%
99.8%
0.0%
97.9%
90.2%
0.1%
98.2%
0.4%
13.3%
3.6%
98.9%
22.0%
0.1%
0.3%
73.0%
3.7%
5.4%
0.1%
Cell 11
25.9%
97.9%
0.0%
97.9%
68.6%
97.9%
98.1%
98.1%
0.0%
97.9%
90.1%
0.4%
97.4%
0.6%
15.1%
4.7%
97.1%
23.7%
0.5%
0.6%
72.9%
5.0%
6.9%
0.2%
Cell 12
32.7%
90.2%
0.1%
91.0%
74.5%
91.1%
90.6%
90.4%
0.0%
90.2%
90.1%
1.2%
91.7%
1.5%
22.7%
9.5%
90.0%
31.0%
1.8%
1.7%
72.9%
11.2%
13.5%
0.6%
Cell 13
32.9%
0.1%
60.9%
0.1%
6.2%
0.1%
0.1%
0.1%
49.7%
0.1%
0.4%
1.2%
0.1%
67.6%
38.4%
56.3%
0.2%
32.8%
72.0%
64.7%
9.4%
50.5%
47.7%
59.5%
Cell 14
25.5%
98.2%
0.0%
99.1%
69.1%
99.2%
98.6%
98.4%
0.0%
98.2%
97.4%
91.7%
0.1%
0.4%
14.7%
3.9%
97.5%
23.3%
0.1%
0.4%
73.6%
4.1%
6.0%
0.1%
Cell 15
26.5%
0.4%
66.3%
0.4%
5.9%
0.4%
0.4%
0.4%
59.5%
0.4%
0.6%
1.5%
67.6%
0.4%
30.4%
46.7%
0.5%
26.3%
64.4%
63.6%
7.9%
40.7%
38.3%
63.9%
Cell 16
80.1%
13.3%
9.7%
13.8%
45.5%
13.9%
13.4%
13.4%
0.5%
13.3%
15.1%
22.7%
38.4%
14.7%
30.4%
76.0%
14.3%
84.3%
46.7%
30.2%
38.2%
86.1%
88.6%
13.8%
Cell 17
65.7%
3.6%
26.6%
3.7%
27.1%
3.7%
3.7%
3.6%
13.2%
3.6%
4.7%
9.5%
56.3%
3.9%
46.7%
76.0%
4.2%
68.1%
62.7%
45.2%
24.6%
83.2%
82.2%
29.1%
Cell 18
25.1%
99.0%
0.0%
98.3%
67.9%
98.2%
98.7%
98.8%
0.0%
98.9%
97.1%
90.0%
0.2%
97.5%
0.5%
14.3%
4.2%
22.8%
0.3%
0.5%
72.9%
4.4%
6.2%
0.2%
Cell 19
76.0%
22.0%
8.5%
22.5%
52.3%
22.5%
22.1%
22.0%
0.8%
22.0%
23.7%
31.0%
32.8%
23.3%
26.3%
84.3%
68.1%
22.8%
40.4%
26.5%
43.3%
77.4%
79.8%
12.1%
Cell 20
40.1%
0.1%
53.7%
0.1%
9.3%
0.1%
0.1%
0.1%
41.2%
0.1%
0.5%
1.8%
72.0%
0.1%
64.4%
46.7%
62.7%
0.3%
40.4%
61.5%
11.9%
58.8%
56.0%
52.9%
Cell 21
26.5%
0.3%
65.7%
0.4%
6.6%
0.4%
0.3%
0.3%
60.8%
0.3%
0.6%
1.7%
64.7%
0.4%
63.6%
30.2%
45.2%
0.5%
26.5%
61.5%
8.1%
39.8%
37.5%
63.4%
Cell 22
43.5%
73.0%
2.4%
73.3%
67.8%
73.3%
73.2%
73.1%
0.3%
73.0%
72.9%
72.9%
9.4%
73.6%
7.9%
38.2%
24.6%
72.9%
43.3%
11.9%
8.1%
28.2%
30.4%
3.7%
Cell 23
74.3%
3.7%
17.7%
3.8%
32.8%
3.8%
3.7%
3.7%
4.0%
3.7%
5.0%
11.2%
50.5%
4.1%
40.7%
86.1%
83.2%
4.4%
77.4%
58.8%
39.8%
28.2%
91.5%
21.3%
Cell 24
76.4%
5.4%
15.5%
5.6%
35.6%
5.6%
5.4%
5.4%
2.8%
5.4%
6.9%
13.5%
47.7%
6.0%
38.3%
88.6%
82.2%
6.2%
79.8%
56.0%
37.5%
30.4%
91.5%
19.3%
Cell 25
12.9%
0.1%
77.0%
0.1%
2.5%
0.1%
0.1%
0.1%
80.4%
0.1%
0.2%
0.6%
59.5%
0.1%
63.9%
13.8%
29.1%
0.2%
12.1%
52.9%
63.4%
3.7%
21.3%
19.3%
Cell 26
23.8%
0.0%
70.1%
0.0%
3.2%
0.0%
0.0%
0.0%
61.6%
0.0%
0.1%
0.5%
72.7%
0.0%
70.3%
27.4%
46.5%
0.1%
23.3%
68.3%
67.5%
6.4%
38.9%
36.2%
67.1%
Cell 27
66.7%
17.3%
18.1%
17.6%
40.6%
17.6%
17.4%
17.3%
8.1%
17.3%
18.6%
23.8%
43.0%
18.0%
35.6%
74.6%
68.6%
17.9%
69.4%
49.5%
34.9%
35.4%
75.0%
75.6%
20.9%
Cell 28
80.3%
10.3%
10.9%
10.7%
42.3%
10.7%
10.3%
10.3%
0.7%
10.3%
12.0%
19.5%
41.1%
11.5%
32.6%
93.2%
78.9%
11.2%
84.3%
49.5%
32.2%
35.7%
89.0%
91.2%
15.0%
Cell 29
27.7%
0.0%
66.1%
0.0%
4.7%
0.0%
0.0%
0.0%
56.6%
0.0%
0.2%
0.8%
72.7%
0.0%
68.8%
32.0%
50.4%
0.1%
27.4%
69.2%
65.9%
7.7%
43.7%
41.0%
63.8%
Cell 30
67.8%
19.9%
16.1%
20.3%
44.5%
20.3%
20.0%
20.0%
6.9%
19.9%
21.3%
27.0%
39.7%
20.8%
32.8%
75.3%
66.5%
20.6%
70.7%
46.0%
32.2%
37.9%
73.4%
74.6%
18.9%
7D Detached: UK
-
Means

Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
0
0
0
0
0
2
0
0
0
0
0
0
2
1
1
0
0
1
0
1
0
1
0
1
1
1
1
0
7D Detached: APW

Cell 5
Cell 6
Cell 7
Cell 8
Cell 9
Cell 10
Cell 11
Cell 12
Cell 13
Cell 14
Cell 15
Cell 16
Cell 17
Cell 18
Cell 19
Cell 20
Cell 21
Cell 22
Cell 23
Cell 24
Cell 25
Cell 26
Cell 27
Cell 28
Cell 29
Cell 1
70.2%
76.1%
71.7%
1.4%
71.5%
73.5%
52.2%
74.2%
72.9%
66.5%
0.1%
50.9%
45.4%
67.0%
68.6%
47.3%
79.7%
44.1%
77.2%
46.8%
79.4%
42.5%
45.7%
42.4%
48.4%
Cell 2
17.1%
26.2%
24.4%
0.1%
44.8%
38.4%
11.8%
29.9%
21.8%
15.2%
0.0%
90.9%
96.6%
14.7%
66.2%
94.5%
32.0%
96.6%
29.1%
95.6%
34.3%
96.2%
96.4%
96.2%
94.0%
Cell 3
70.6%
74.8%
71.1%
2.1%
69.6%
71.6%
53.2%
72.7%
72.4%
67.1%
0.3%
49.6%
44.8%
67.6%
65.1%
46.5%
77.0%
43.6%
75.1%
46.1%
76.4%
42.1%
45.1%
42.0%
47.5%
Cell 4
47.8%
55.4%
51.9%
0.8%
61.1%
59.4%
34.9%
55.9%
51.5%
44.7%
0.1%
69.2%
67.7%
44.8%
69.1%
68.0%
60.1%
66.5%
57.1%
68.5%
60.8%
64.9%
67.9%
64.9%
69.1%
Cell 5
89.9%
90.4%
6.6%
70.7%
77.0%
78.1%
85.7%
93.2%
95.4%
1.6%
22.2%
15.9%
96.7%
48.6%
18.2%
84.2%
15.1%
86.9%
17.1%
81.9%
14.0%
16.1%
14.0%
18.8%
Cell 6
89.9%
86.8%
3.3%
74.6%
79.7%
69.4%
85.9%
89.6%
86.3%
0.4%
31.8%
25.4%
87.0%
57.5%
27.7%
88.6%
24.3%
88.4%
26.8%
86.6%
22.8%
25.7%
22.7%
28.6%
Cell 7
90.4%
86.8%
5.5%
71.0%
76.3%
71.5%
83.1%
88.1%
87.8%
1.6%
29.4%
23.4%
88.9%
53.5%
25.6%
83.6%
22.6%
84.7%
24.6%
81.8%
21.4%
23.7%
21.4%
26.3%
Cell 8
6.6%
3.3%
5.5%
2.2%
2.7%
26.0%
3.8%
5.4%
10.1%
84.7%
0.1%
0.0%
9.3%
0.4%
0.1%
1.9%
0.0%
3.1%
0.0%
1.8%
0.0%
0.0%
0.0%
0.0%
Cell 9
70.7%
74.6%
71.0%
2.2%
71.4%
53.4%
72.7%
72.4%
67.2%
0.3%
49.5%
44.7%
67.7%
65.0%
46.5%
76.8%
43.5%
75.0%
46.0%
76.0%
42.0%
45.0%
41.9%
47.4%
Cell 10
77.0%
79.7%
76.3%
2.7%
71.4%
58.7%
77.1%
78.1%
73.7%
0.6%
43.7%
38.1%
74.3%
62.8%
40.0%
80.7%
37.0%
79.5%
39.4%
79.6%
35.5%
38.4%
35.4%
41.0%
Cell 11
78.1%
69.4%
71.5%
26.0%
53.4%
58.7%
66.4%
73.8%
80.2%
17.4%
14.7%
10.7%
80.7%
34.4%
12.2%
63.8%
10.4%
66.7%
11.2%
61.8%
10.0%
10.8%
10.0%
12.3%
Cell 12
85.7%
85.9%
83.1%
3.8%
72.7%
77.1%
66.4%
85.3%
82.5%
0.9%
35.2%
29.1%
83.3%
58.5%
31.2%
84.8%
28.1%
84.5%
30.4%
83.3%
26.7%
29.4%
26.6%
32.1%
Cell 13
93.2%
89.6%
88.1%
5.4%
72.4%
78.1%
73.8%
85.3%
90.6%
1.5%
27.0%
20.7%
91.6%
52.9%
23.0%
86.1%
19.8%
87.3%
21.9%
84.0%
18.6%
21.0%
18.6%
23.6%
Cell 14
95.4%
86.3%
87.8%
10.1%
67.2%
73.7%
80.2%
82.5%
90.6%
3.9%
19.6%
13.8%
96.9%
45.2%
16.0%
80.4%
13.3%
83.3%
14.7%
78.1%
12.6%
14.0%
12.6%
16.3%
Cell 15
1.6%
0.4%
1.6%
84.7%
0.3%
0.6%
17.4%
0.9%
1.5%
3.9%
0.0%
0.0%
2.8%
0.0%
0.0%
0.1%
0.0%
0.5%
0.0%
0.1%
0.0%
0.0%
0.0%
0.0%
Cell 16
22.2%
31.8%
29.4%
0.1%
49.5%
43.7%
14.7%
35.2%
27.0%
19.6%
0.0%
93.4%
19.3%
70.9%
92.4%
37.8%
92.1%
34.8%
94.2%
40.1%
90.4%
93.7%
90.4%
93.7%
Cell 17
15.9%
25.4%
23.4%
0.0%
44.7%
38.1%
10.7%
29.1%
20.7%
13.8%
0.0%
93.4%
13.3%
67.0%
97.2%
31.3%
98.7%
28.3%
98.6%
33.7%
97.0%
99.7%
96.9%
96.8%
Cell 18
96.7%
87.0%
88.9%
9.3%
67.7%
74.3%
80.7%
83.3%
91.6%
96.9%
2.8%
19.3%
13.3%
45.5%
15.5%
81.0%
12.8%
84.1%
14.2%
78.6%
12.1%
13.4%
12.0%
15.9%
Cell 19
48.6%
57.5%
53.5%
0.4%
65.0%
62.8%
34.4%
58.5%
52.9%
45.2%
0.0%
70.9%
67.0%
45.5%
68.2%
63.5%
65.7%
60.2%
68.3%
64.8%
63.9%
67.3%
63.9%
69.6%
Cell 20
18.2%
27.7%
25.6%
0.1%
46.5%
40.0%
12.2%
31.2%
23.0%
16.0%
0.0%
92.4%
97.2%
15.5%
68.2%
33.6%
96.2%
30.6%
97.0%
35.9%
94.7%
97.3%
94.6%
95.6%
Cell 21
84.2%
88.6%
83.6%
1.9%
76.8%
80.7%
63.8%
84.8%
86.1%
80.4%
0.1%
37.8%
31.3%
81.0%
63.5%
33.6%
30.1%
88.5%
32.7%
89.2%
28.4%
31.7%
28.4%
34.5%
Cell 22
15.1%
24.3%
22.6%
0.0%
43.5%
37.0%
10.4%
28.1%
19.8%
13.3%
0.0%
92.1%
98.7%
12.8%
65.7%
96.2%
30.1%
27.2%
97.3%
32.4%
98.2%
98.4%
98.2%
95.5%
Cell 23
86.9%
88.4%
84.7%
3.1%
75.0%
79.5%
66.7%
84.5%
87.3%
83.3%
0.5%
34.8%
28.3%
84.1%
60.2%
30.6%
88.5%
27.2%
29.7%
86.9%
25.7%
28.6%
25.7%
31.5%
Cell 24
17.1%
26.8%
24.6%
0.0%
46.0%
39.4%
11.2%
30.4%
21.9%
14.7%
0.0%
94.2%
98.6%
14.2%
68.3%
97.0%
32.7%
97.3%
29.7%
35.1%
95.6%
98.9%
95.5%
97.8%
Cell 25
81.9%
86.6%
81.8%
1.8%
76.0%
79.6%
61.8%
83.3%
84.0%
78.1%
0.1%
40.1%
33.7%
78.6%
64.8%
35.9%
89.2%
32.4%
86.9%
35.1%
30.8%
34.0%
30.7%
36.9%
Cell 26
14.0%
22.8%
21.4%
0.0%
42.0%
35.5%
10.0%
26.7%
18.6%
12.6%
0.0%
90.4%
97.0%
12.1%
63.9%
94.7%
28.4%
98.2%
25.7%
95.6%
30.8%
96.6%
99.9%
93.8%
Cell 27
16.1%
25.7%
23.7%
0.0%
45.0%
38.4%
10.8%
29.4%
21.0%
14.0%
0.0%
93.7%
99.7%
13.4%
67.3%
97.3%
31.7%
98.4%
28.6%
98.9%
34.0%
96.6%
96.6%
97.1%
Cell 28
14.0%
22.7%
21.4%
0.0%
41.9%
35.4%
10.0%
26.6%
18.6%
12.6%
0.0%
90.4%
96.9%
12.0%
63.9%
94.6%
28.4%
98.2%
25.7%
95.5%
30.7%
99.9%
96.6%
93.7%
Cell 29
18.8%
28.6%
26.3%
0.0%
47.4%
41.0%
12.3%
32.1%
23.6%
16.3%
0.0%
93.7%
96.8%
15.9%
69.6%
95.6%
34.5%
95.5%
31.5%
97.8%
36.9%
93.8%
97.1%
93.7%
Cell 30
86.0%
90.2%
85.1%
1.9%
76.9%
81.5%
65.4%
86.0%
87.8%
82.2%
0.1%
36.0%
29.5%
82.8%
62.1%
31.8%
91.9%
28.2%
89.7%
30.9%
90.3%
26.6%
29.8%
26.5%
32.7%
7
D Normal and Detached: UK
-
means

Normal
Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
2
1
1
1
1
1
3
1
1
1
2
1
2
0
0
1
0
2
2
1
0
0
2
2
0
0
2
0
Detached
Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
0
0
2
0
2
3
0
0
2
0
0
2
4
0
1
2
0
1
0
1
0
0
0
1
1
1
0
0
7D Normal and Detached: AWP

Method Validation

Biologists manually cluster the normal cells based
on previous studies [COOMBS06] & [SUN02]

Identify ganglion cell subtypes

Compare these results with the two clustering
algorithms

Future Work

Use Earth Mover's Distance (EMD) as a distance
metric between two distribution

Questions