Clustering Uncertain Data

savagelizardAI and Robotics

Nov 25, 2013 (3 years and 11 months ago)

53 views

Clustering Uncertain Data

CS 290 Project

Nick Larusso

Brian Ruttenberg

Motivation


Many data acquisition tools provide uncertain data


eg. sensor networks, image analysis, etc.


Records are no longer points in multidimensional
space, but regions based on the certainty of the data


New methods are required to manage and learn from
this data

Probabilistic Analysis of Ganglion
Cell Morphology


Bioimages are inherently uncertain


We would like to answer questions like “how large
is the cell soma?”, “how many dendrites are there,
and how often do they branch?”


It is important to provide a level of confidence in
each measurement to avoid error propagation

Project Goal


Approximately
200
images of ganglion cells under
various conditions


healthy cells, detached retina (
7
d,
28
d,
56
d)



Probabilistic measurements of soma size, dendritic
field size, and dendritic field density for each cell


Want to cluster these cells to determine the effect of
retinal detachment on cell morphology

UK
-
means Algorithm


K
-
means algorithm minimizes sum of squared errors
(SSE)




UK
-
means, minimizes expected sum of squared
errors




Compute by finding the expected value of each
dimension


j=
1
K

x
i

C
j

c
j

x
j

2
E

j=
1
K

x
j

C
j

c
j

x
j

2
=

j=
1
K

x
j

C
j


c
j

x
j

2
f
x
i
dx
i
UK
-
Means


This idea may be sufficient for Gaussian
-
like
distributions, but what about arbitrary
distributions?


Does not account for variance in data

All Possible Worlds (APW)
Probabilistic Clustering


Instead of representing with a single value, consider
all possible values for a distribution weighted by
their respective probabilities


Provides a much better description of the data than
expected value

APW Example


Choose one state from
object A


For each possible state
in object B calculate
distance


Continue for each state
in A

APW Clustering


Compute probability of a possible world




Where x(i) is the value chosen for object i


Cluster (certain) objects using k
-
means


Combine clustering results for each possible world

P
world
j
=

P
object
i
=x
i
APW Computational Costs


N = # of uncertain objects


D = # of dimensions of each object


Assume each dimension is described by a
distribution over a constant number of values, C


(D * C)
N
Possible Worlds


Our data:


D ~ 3, C ~ 15, N ~ 200 => we need a very fast computer!

Gibbs Sampling


Computationally infeasible to calculate all possible
worlds, so sample from this space instead


Intuition
: We really only care about the possible
worlds that carry high probabilities, so we can
weight our sampling toward these worlds

APW Clustering Using Gibbs
Sampling


Randomly pick values for each dimension of each object


Iterate through each dimension of every object


For a given object and a given dimension


Pick a sample value weighted by the probability distribution


Calculate probability of world


Cluster objects via k
-
means


The objects are then binned according to how often a
particular clustering result shows up

Preliminary Results


Interpretation of results is the biggest challenge


Preliminary results run on 7 Day Ganglion cells


30 cells from each detached and normal retinas


UK
-
means and APW approach run


7D Normal: UK
-
means

Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
2
1
0
1
1
1
2
1
1
1
2
1
2
0
0
1
0
2
2
1
0
0
2
2
0
0
2
0
7D Normal: APW

Cell 1
Cell 2
Cell 3
Cell 4
Cell 5
Cell 6
Cell 7
Cell 8
Cell 9
Cell 10
Cell 11
Cell 12
Cell 13
Cell 14
Cell 15
Cell 16
Cell 17
Cell 18
Cell 19
Cell 20
Cell 21
Cell 22
Cell 23
Cell 24
Cell 25
Cell 1
24.3%
9.4%
24.8%
52.2%
24.8%
24.4%
24.4%
1.6%
24.3%
25.9%
32.7%
32.9%
25.5%
26.5%
80.1%
65.7%
25.1%
76.0%
40.1%
26.5%
43.5%
74.3%
76.4%
12.9%
Cell 2
24.3%
0.0%
99.1%
67.3%
99.0%
99.6%
99.8%
0.0%
100.0%
97.9%
90.2%
0.1%
98.2%
0.4%
13.3%
3.6%
99.0%
22.0%
0.1%
0.3%
73.0%
3.7%
5.4%
0.1%
Cell 3
9.4%
0.0%
0.0%
0.9%
0.0%
0.0%
0.0%
84.1%
0.0%
0.0%
0.1%
60.9%
0.0%
66.3%
9.7%
26.6%
0.0%
8.5%
53.7%
65.7%
2.4%
17.7%
15.5%
77.0%
Cell 4
24.8%
99.1%
0.0%
68.2%
99.6%
99.5%
99.3%
0.0%
99.1%
97.9%
91.0%
0.1%
99.1%
0.4%
13.8%
3.7%
98.3%
22.5%
0.1%
0.4%
73.3%
3.8%
5.6%
0.1%
Cell 5
52.2%
67.3%
0.9%
68.2%
68.3%
67.7%
67.5%
0.0%
67.3%
68.6%
74.5%
6.2%
69.1%
5.9%
45.5%
27.1%
67.9%
52.3%
9.3%
6.6%
67.8%
32.8%
35.6%
2.5%
Cell 6
24.8%
99.0%
0.0%
99.6%
68.3%
99.4%
99.2%
0.0%
99.0%
97.9%
91.1%
0.1%
99.2%
0.4%
13.9%
3.7%
98.2%
22.5%
0.1%
0.4%
73.3%
3.8%
5.6%
0.1%
Cell 7
24.4%
99.6%
0.0%
99.5%
67.7%
99.4%
99.8%
0.0%
99.6%
98.1%
90.6%
0.1%
98.6%
0.4%
13.4%
3.7%
98.7%
22.1%
0.1%
0.3%
73.2%
3.7%
5.4%
0.1%
Cell 8
24.4%
99.8%
0.0%
99.3%
67.5%
99.2%
99.8%
0.0%
99.8%
98.1%
90.4%
0.1%
98.4%
0.4%
13.4%
3.6%
98.8%
22.0%
0.1%
0.3%
73.1%
3.7%
5.4%
0.1%
Cell 9
1.6%
0.0%
84.1%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
0.0%
49.7%
0.0%
59.5%
0.5%
13.2%
0.0%
0.8%
41.2%
60.8%
0.3%
4.0%
2.8%
80.4%
Cell 10
24.3%
100.0%
0.0%
99.1%
67.3%
99.0%
99.6%
99.8%
0.0%
97.9%
90.2%
0.1%
98.2%
0.4%
13.3%
3.6%
98.9%
22.0%
0.1%
0.3%
73.0%
3.7%
5.4%
0.1%
Cell 11
25.9%
97.9%
0.0%
97.9%
68.6%
97.9%
98.1%
98.1%
0.0%
97.9%
90.1%
0.4%
97.4%
0.6%
15.1%
4.7%
97.1%
23.7%
0.5%
0.6%
72.9%
5.0%
6.9%
0.2%
Cell 12
32.7%
90.2%
0.1%
91.0%
74.5%
91.1%
90.6%
90.4%
0.0%
90.2%
90.1%
1.2%
91.7%
1.5%
22.7%
9.5%
90.0%
31.0%
1.8%
1.7%
72.9%
11.2%
13.5%
0.6%
Cell 13
32.9%
0.1%
60.9%
0.1%
6.2%
0.1%
0.1%
0.1%
49.7%
0.1%
0.4%
1.2%
0.1%
67.6%
38.4%
56.3%
0.2%
32.8%
72.0%
64.7%
9.4%
50.5%
47.7%
59.5%
Cell 14
25.5%
98.2%
0.0%
99.1%
69.1%
99.2%
98.6%
98.4%
0.0%
98.2%
97.4%
91.7%
0.1%
0.4%
14.7%
3.9%
97.5%
23.3%
0.1%
0.4%
73.6%
4.1%
6.0%
0.1%
Cell 15
26.5%
0.4%
66.3%
0.4%
5.9%
0.4%
0.4%
0.4%
59.5%
0.4%
0.6%
1.5%
67.6%
0.4%
30.4%
46.7%
0.5%
26.3%
64.4%
63.6%
7.9%
40.7%
38.3%
63.9%
Cell 16
80.1%
13.3%
9.7%
13.8%
45.5%
13.9%
13.4%
13.4%
0.5%
13.3%
15.1%
22.7%
38.4%
14.7%
30.4%
76.0%
14.3%
84.3%
46.7%
30.2%
38.2%
86.1%
88.6%
13.8%
Cell 17
65.7%
3.6%
26.6%
3.7%
27.1%
3.7%
3.7%
3.6%
13.2%
3.6%
4.7%
9.5%
56.3%
3.9%
46.7%
76.0%
4.2%
68.1%
62.7%
45.2%
24.6%
83.2%
82.2%
29.1%
Cell 18
25.1%
99.0%
0.0%
98.3%
67.9%
98.2%
98.7%
98.8%
0.0%
98.9%
97.1%
90.0%
0.2%
97.5%
0.5%
14.3%
4.2%
22.8%
0.3%
0.5%
72.9%
4.4%
6.2%
0.2%
Cell 19
76.0%
22.0%
8.5%
22.5%
52.3%
22.5%
22.1%
22.0%
0.8%
22.0%
23.7%
31.0%
32.8%
23.3%
26.3%
84.3%
68.1%
22.8%
40.4%
26.5%
43.3%
77.4%
79.8%
12.1%
Cell 20
40.1%
0.1%
53.7%
0.1%
9.3%
0.1%
0.1%
0.1%
41.2%
0.1%
0.5%
1.8%
72.0%
0.1%
64.4%
46.7%
62.7%
0.3%
40.4%
61.5%
11.9%
58.8%
56.0%
52.9%
Cell 21
26.5%
0.3%
65.7%
0.4%
6.6%
0.4%
0.3%
0.3%
60.8%
0.3%
0.6%
1.7%
64.7%
0.4%
63.6%
30.2%
45.2%
0.5%
26.5%
61.5%
8.1%
39.8%
37.5%
63.4%
Cell 22
43.5%
73.0%
2.4%
73.3%
67.8%
73.3%
73.2%
73.1%
0.3%
73.0%
72.9%
72.9%
9.4%
73.6%
7.9%
38.2%
24.6%
72.9%
43.3%
11.9%
8.1%
28.2%
30.4%
3.7%
Cell 23
74.3%
3.7%
17.7%
3.8%
32.8%
3.8%
3.7%
3.7%
4.0%
3.7%
5.0%
11.2%
50.5%
4.1%
40.7%
86.1%
83.2%
4.4%
77.4%
58.8%
39.8%
28.2%
91.5%
21.3%
Cell 24
76.4%
5.4%
15.5%
5.6%
35.6%
5.6%
5.4%
5.4%
2.8%
5.4%
6.9%
13.5%
47.7%
6.0%
38.3%
88.6%
82.2%
6.2%
79.8%
56.0%
37.5%
30.4%
91.5%
19.3%
Cell 25
12.9%
0.1%
77.0%
0.1%
2.5%
0.1%
0.1%
0.1%
80.4%
0.1%
0.2%
0.6%
59.5%
0.1%
63.9%
13.8%
29.1%
0.2%
12.1%
52.9%
63.4%
3.7%
21.3%
19.3%
Cell 26
23.8%
0.0%
70.1%
0.0%
3.2%
0.0%
0.0%
0.0%
61.6%
0.0%
0.1%
0.5%
72.7%
0.0%
70.3%
27.4%
46.5%
0.1%
23.3%
68.3%
67.5%
6.4%
38.9%
36.2%
67.1%
Cell 27
66.7%
17.3%
18.1%
17.6%
40.6%
17.6%
17.4%
17.3%
8.1%
17.3%
18.6%
23.8%
43.0%
18.0%
35.6%
74.6%
68.6%
17.9%
69.4%
49.5%
34.9%
35.4%
75.0%
75.6%
20.9%
Cell 28
80.3%
10.3%
10.9%
10.7%
42.3%
10.7%
10.3%
10.3%
0.7%
10.3%
12.0%
19.5%
41.1%
11.5%
32.6%
93.2%
78.9%
11.2%
84.3%
49.5%
32.2%
35.7%
89.0%
91.2%
15.0%
Cell 29
27.7%
0.0%
66.1%
0.0%
4.7%
0.0%
0.0%
0.0%
56.6%
0.0%
0.2%
0.8%
72.7%
0.0%
68.8%
32.0%
50.4%
0.1%
27.4%
69.2%
65.9%
7.7%
43.7%
41.0%
63.8%
Cell 30
67.8%
19.9%
16.1%
20.3%
44.5%
20.3%
20.0%
20.0%
6.9%
19.9%
21.3%
27.0%
39.7%
20.8%
32.8%
75.3%
66.5%
20.6%
70.7%
46.0%
32.2%
37.9%
73.4%
74.6%
18.9%
7D Detached: UK
-
Means

Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
0
0
0
0
0
2
0
0
0
0
0
0
2
1
1
0
0
1
0
1
0
1
0
1
1
1
1
0
7D Detached: APW

Cell 5
Cell 6
Cell 7
Cell 8
Cell 9
Cell 10
Cell 11
Cell 12
Cell 13
Cell 14
Cell 15
Cell 16
Cell 17
Cell 18
Cell 19
Cell 20
Cell 21
Cell 22
Cell 23
Cell 24
Cell 25
Cell 26
Cell 27
Cell 28
Cell 29
Cell 1
70.2%
76.1%
71.7%
1.4%
71.5%
73.5%
52.2%
74.2%
72.9%
66.5%
0.1%
50.9%
45.4%
67.0%
68.6%
47.3%
79.7%
44.1%
77.2%
46.8%
79.4%
42.5%
45.7%
42.4%
48.4%
Cell 2
17.1%
26.2%
24.4%
0.1%
44.8%
38.4%
11.8%
29.9%
21.8%
15.2%
0.0%
90.9%
96.6%
14.7%
66.2%
94.5%
32.0%
96.6%
29.1%
95.6%
34.3%
96.2%
96.4%
96.2%
94.0%
Cell 3
70.6%
74.8%
71.1%
2.1%
69.6%
71.6%
53.2%
72.7%
72.4%
67.1%
0.3%
49.6%
44.8%
67.6%
65.1%
46.5%
77.0%
43.6%
75.1%
46.1%
76.4%
42.1%
45.1%
42.0%
47.5%
Cell 4
47.8%
55.4%
51.9%
0.8%
61.1%
59.4%
34.9%
55.9%
51.5%
44.7%
0.1%
69.2%
67.7%
44.8%
69.1%
68.0%
60.1%
66.5%
57.1%
68.5%
60.8%
64.9%
67.9%
64.9%
69.1%
Cell 5
89.9%
90.4%
6.6%
70.7%
77.0%
78.1%
85.7%
93.2%
95.4%
1.6%
22.2%
15.9%
96.7%
48.6%
18.2%
84.2%
15.1%
86.9%
17.1%
81.9%
14.0%
16.1%
14.0%
18.8%
Cell 6
89.9%
86.8%
3.3%
74.6%
79.7%
69.4%
85.9%
89.6%
86.3%
0.4%
31.8%
25.4%
87.0%
57.5%
27.7%
88.6%
24.3%
88.4%
26.8%
86.6%
22.8%
25.7%
22.7%
28.6%
Cell 7
90.4%
86.8%
5.5%
71.0%
76.3%
71.5%
83.1%
88.1%
87.8%
1.6%
29.4%
23.4%
88.9%
53.5%
25.6%
83.6%
22.6%
84.7%
24.6%
81.8%
21.4%
23.7%
21.4%
26.3%
Cell 8
6.6%
3.3%
5.5%
2.2%
2.7%
26.0%
3.8%
5.4%
10.1%
84.7%
0.1%
0.0%
9.3%
0.4%
0.1%
1.9%
0.0%
3.1%
0.0%
1.8%
0.0%
0.0%
0.0%
0.0%
Cell 9
70.7%
74.6%
71.0%
2.2%
71.4%
53.4%
72.7%
72.4%
67.2%
0.3%
49.5%
44.7%
67.7%
65.0%
46.5%
76.8%
43.5%
75.0%
46.0%
76.0%
42.0%
45.0%
41.9%
47.4%
Cell 10
77.0%
79.7%
76.3%
2.7%
71.4%
58.7%
77.1%
78.1%
73.7%
0.6%
43.7%
38.1%
74.3%
62.8%
40.0%
80.7%
37.0%
79.5%
39.4%
79.6%
35.5%
38.4%
35.4%
41.0%
Cell 11
78.1%
69.4%
71.5%
26.0%
53.4%
58.7%
66.4%
73.8%
80.2%
17.4%
14.7%
10.7%
80.7%
34.4%
12.2%
63.8%
10.4%
66.7%
11.2%
61.8%
10.0%
10.8%
10.0%
12.3%
Cell 12
85.7%
85.9%
83.1%
3.8%
72.7%
77.1%
66.4%
85.3%
82.5%
0.9%
35.2%
29.1%
83.3%
58.5%
31.2%
84.8%
28.1%
84.5%
30.4%
83.3%
26.7%
29.4%
26.6%
32.1%
Cell 13
93.2%
89.6%
88.1%
5.4%
72.4%
78.1%
73.8%
85.3%
90.6%
1.5%
27.0%
20.7%
91.6%
52.9%
23.0%
86.1%
19.8%
87.3%
21.9%
84.0%
18.6%
21.0%
18.6%
23.6%
Cell 14
95.4%
86.3%
87.8%
10.1%
67.2%
73.7%
80.2%
82.5%
90.6%
3.9%
19.6%
13.8%
96.9%
45.2%
16.0%
80.4%
13.3%
83.3%
14.7%
78.1%
12.6%
14.0%
12.6%
16.3%
Cell 15
1.6%
0.4%
1.6%
84.7%
0.3%
0.6%
17.4%
0.9%
1.5%
3.9%
0.0%
0.0%
2.8%
0.0%
0.0%
0.1%
0.0%
0.5%
0.0%
0.1%
0.0%
0.0%
0.0%
0.0%
Cell 16
22.2%
31.8%
29.4%
0.1%
49.5%
43.7%
14.7%
35.2%
27.0%
19.6%
0.0%
93.4%
19.3%
70.9%
92.4%
37.8%
92.1%
34.8%
94.2%
40.1%
90.4%
93.7%
90.4%
93.7%
Cell 17
15.9%
25.4%
23.4%
0.0%
44.7%
38.1%
10.7%
29.1%
20.7%
13.8%
0.0%
93.4%
13.3%
67.0%
97.2%
31.3%
98.7%
28.3%
98.6%
33.7%
97.0%
99.7%
96.9%
96.8%
Cell 18
96.7%
87.0%
88.9%
9.3%
67.7%
74.3%
80.7%
83.3%
91.6%
96.9%
2.8%
19.3%
13.3%
45.5%
15.5%
81.0%
12.8%
84.1%
14.2%
78.6%
12.1%
13.4%
12.0%
15.9%
Cell 19
48.6%
57.5%
53.5%
0.4%
65.0%
62.8%
34.4%
58.5%
52.9%
45.2%
0.0%
70.9%
67.0%
45.5%
68.2%
63.5%
65.7%
60.2%
68.3%
64.8%
63.9%
67.3%
63.9%
69.6%
Cell 20
18.2%
27.7%
25.6%
0.1%
46.5%
40.0%
12.2%
31.2%
23.0%
16.0%
0.0%
92.4%
97.2%
15.5%
68.2%
33.6%
96.2%
30.6%
97.0%
35.9%
94.7%
97.3%
94.6%
95.6%
Cell 21
84.2%
88.6%
83.6%
1.9%
76.8%
80.7%
63.8%
84.8%
86.1%
80.4%
0.1%
37.8%
31.3%
81.0%
63.5%
33.6%
30.1%
88.5%
32.7%
89.2%
28.4%
31.7%
28.4%
34.5%
Cell 22
15.1%
24.3%
22.6%
0.0%
43.5%
37.0%
10.4%
28.1%
19.8%
13.3%
0.0%
92.1%
98.7%
12.8%
65.7%
96.2%
30.1%
27.2%
97.3%
32.4%
98.2%
98.4%
98.2%
95.5%
Cell 23
86.9%
88.4%
84.7%
3.1%
75.0%
79.5%
66.7%
84.5%
87.3%
83.3%
0.5%
34.8%
28.3%
84.1%
60.2%
30.6%
88.5%
27.2%
29.7%
86.9%
25.7%
28.6%
25.7%
31.5%
Cell 24
17.1%
26.8%
24.6%
0.0%
46.0%
39.4%
11.2%
30.4%
21.9%
14.7%
0.0%
94.2%
98.6%
14.2%
68.3%
97.0%
32.7%
97.3%
29.7%
35.1%
95.6%
98.9%
95.5%
97.8%
Cell 25
81.9%
86.6%
81.8%
1.8%
76.0%
79.6%
61.8%
83.3%
84.0%
78.1%
0.1%
40.1%
33.7%
78.6%
64.8%
35.9%
89.2%
32.4%
86.9%
35.1%
30.8%
34.0%
30.7%
36.9%
Cell 26
14.0%
22.8%
21.4%
0.0%
42.0%
35.5%
10.0%
26.7%
18.6%
12.6%
0.0%
90.4%
97.0%
12.1%
63.9%
94.7%
28.4%
98.2%
25.7%
95.6%
30.8%
96.6%
99.9%
93.8%
Cell 27
16.1%
25.7%
23.7%
0.0%
45.0%
38.4%
10.8%
29.4%
21.0%
14.0%
0.0%
93.7%
99.7%
13.4%
67.3%
97.3%
31.7%
98.4%
28.6%
98.9%
34.0%
96.6%
96.6%
97.1%
Cell 28
14.0%
22.7%
21.4%
0.0%
41.9%
35.4%
10.0%
26.6%
18.6%
12.6%
0.0%
90.4%
96.9%
12.0%
63.9%
94.6%
28.4%
98.2%
25.7%
95.5%
30.7%
99.9%
96.6%
93.7%
Cell 29
18.8%
28.6%
26.3%
0.0%
47.4%
41.0%
12.3%
32.1%
23.6%
16.3%
0.0%
93.7%
96.8%
15.9%
69.6%
95.6%
34.5%
95.5%
31.5%
97.8%
36.9%
93.8%
97.1%
93.7%
Cell 30
86.0%
90.2%
85.1%
1.9%
76.9%
81.5%
65.4%
86.0%
87.8%
82.2%
0.1%
36.0%
29.5%
82.8%
62.1%
31.8%
91.9%
28.2%
89.7%
30.9%
90.3%
26.6%
29.8%
26.5%
32.7%
7
D Normal and Detached: UK
-
means

Normal
Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
2
1
1
1
1
1
3
1
1
1
2
1
2
0
0
1
0
2
2
1
0
0
2
2
0
0
2
0
Detached
Cell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Cluster
0
1
0
0
2
0
2
3
0
0
2
0
0
2
4
0
1
2
0
1
0
1
0
0
0
1
1
1
0
0
7D Normal and Detached: AWP


Spreadsheet…

Method Validation


Biologists manually cluster the normal cells based
on previous studies [COOMBS06] & [SUN02]


Identify ganglion cell subtypes


Compare these results with the two clustering
algorithms

Future Work


Use Earth Mover's Distance (EMD) as a distance
metric between two distribution


Questions