Clustering Uncertain Data

desertcockatooΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

106 εμφανίσεις















Clustering Uncertain Data

CS290 Project


Nick Larusso

Brian Ruttenberg

Background


Probabilistic data is becoming more and more common through various analysis and acquisition
techniques. This is problematic because many of the existing meth
ods for classifying and clustering
data are meant to work with n
-
dimensional points. Instead of certain points, we are now faced with
uncertain regions in n
-
dimensional space. This provides to be a more difficult problem. New methods
have slowly been intro
duced for learning from uncertain data; however, it is far from a solved problem.


Our project specifically deals with biomages containing ganglion cells. These are neurons in the
retina that compose the optic nerve. We have about 80 images of these cells,

40 from a healthy retina,
and 40 from a retina that has been detached for seven days. We have applied some probabilistic
analyses to these images to pick out characteristics of each cells' morphology. We have data on the
soma size, dendritic field size, a
nd dendritic field diameter. For the sake of this project, each of these
cells is then an uncertain region in three dimensions (one for each feature). We would like to cluster
these cells to find characteristics of how ganglion cell morphology changes in r
esponse to a retinal
detachment [3,4].


We have implemented and tested three different clustering methods. Each method will be
described, and then we provide results from their clustering output. In our analysis, we are most
concerned with identifying th
e differences between the algorithms as we do not have a ground truth to
compare against.



Clustering Algorithms


All of the following clustering algorithms are based on k
-
means, in which the goal is to minimize sum
of square error (SSE). K
-
means was chos
en because the paper this project was based on [2] used k
-
means, thus for consistency the other clustering methods were also based on k
-
means. Below is a basic
outline of each of the algorithms we have compared.


Uncertain K
-
means (UK
-
means)



The basic id
ea behind the uncertain k
-
means algorithm is to minimize the expected sum of
squared errors. In this algorithm, the cluster centroids are still represented by certain points so the
distance calculation for data point x(i) to centroid c(j) is:


E

c
j

x
i

=

k=
1

Features


c
j,k

x
i,k

p
x
i,k


The rest of the algorithm works just as k
-
means.


Although this method is incredibly simple to implement and multiple pruning techniques were
introduced [1] to reduce computation, this method has some downfalls as well. Because we a
re taking
the expected distance of two discrete probability distributions, it is clear to see that this method should
work well if our distributions are Gaussian
-
like (uni
-
modal with relatively low variance). However, it is
unclear if this method will be s
ufficient for arbitrary distributions, such as the ones we are dealing with.
Instead, we would like a method that takes the variance of the individual distributions into account.



All Possible Worlds (APW) Using Gibbs Sampling


The All Possible Worlds (A
PW) method is based on taking a single sample from each
distribution, and considering this as one possible world. Given these samples, we could then cluster the
data points as certain objects. Each of these worlds is not equally likely since it depends on
the samples
chosen from each distribution, the calculation for the probability is:


P
world
j
=

P
object
i
=x
i



For the true computation, we would iterate through each possible sample value from each
distribution from each object so that we covere
d every possible outcome (world). This method grows
exponentially in the number of objects we want to cluster, and quickly becomes computationally
infeasible. Let N be the number of uncertain objects, D the number of dimensions of each object and C
be the

number of possible discrete points available for each distribution, then this approach is
O((DC)
N
).


To deal with the computational load, we implement Gibbs Sampling. The intuition behind
Gibbs is that we really only care about the possible worlds with hi
gh probabilities, so we weight our
sampling towards these worlds, and discard the rest.


Earth Mover's Distance k
-
means (EMD)


This clustering method contains two main changes. First, instead of clustering certain points in
three dimensional space as we h
ave done in the previous two methods, here the distributions are
maintained and we are actually clustering uncertain regions in 3
-
d space. In this case, the basic
Euclidean distance will not work. Instead, we use the Earth Mover's Distance (EMD) to calcula
te the
distance between probability distributions in each dimension and thus compute the distance between
two uncertain regions. The second change is that the cluster centroids are no longer certain points;
instead, they also become uncertain regions in th
e feature space.


Illustration
1
: Example of (1) a gaussian distribution, and (2) an arbitrary bimodal distribution and
their expec
ted values (in red)


The EMD computation is based on a solution to the transportation problem [7]. This is a
simplified optimization problem because an additional constraint is put in place that requires the total
supply equals the total demand. The full set
of constraints can be found in the explanation in [6]. Thus
there exist more efficient solutions to the problem than the more general simplex algorithm. The code
for computing the EMD was taken from [6], which uses the u
-
v method for optimization [7].


Res
ults

Testing Data Set


For our experiments on clustering methodology, we chose to concentrate on one specific subset
of the ganglion cells. We took 40 cells from each of the 7 day detached and control images, extracted
the three probabilistic features us
ing our own tools (soma size, dendritic field size and density), and ran
the three different clustering algorithms on the data under different conditions. All features were
normalized to the range of (0 1) so that feature distances would have the same wei
ghts for all features.

Each method was run on 40 normal cells (LE), 40 detached cells (RE) and then run again with
all 80 cells together (LE_RE). From [3,4], it was determined that there are approximately 12 mono
-
stratified Ganglion cell types. However s
ince our sample size of cells is small and we have no manual
ground truth available, the number of clusters present in our choice of 80 cells is unknown. Therefore
we ran each method from 2 to 10 clusters to account for uncertainty in the number of cluste
rs.

Additionally, each method was also run with varying numbers of iterations. The APW approach
is naturally an iterative method, so all APW tests were run 100,000 iterations when clustering 40 cells,
and 200,000 iterations when clustering 80 cells. Gi
ven starting centroids for clusters, UK
-
means and
EMD will deterministically cluster the data in exactly the same manner. However, the optimal starting
centroids are unknown, so the starting centroids are randomly chosen over the range [0 1] and the
metho
ds are iterated similar to the APW approach. The number of iterations can be reduced
significantly though because the sensitivity the randomness in the UK
-
means and EMD method is
much lower than the APW approach. UK
-
means and EMD were run for 5,000 itera
tions when
clustering 40 cells, and for 10,000 iterations when clustering 80 cells.

Output Data Analysis


All three methods, per iteration, output an optimal clustering based on a variant of the K
-
means
algorithm. However, since the starting centroids are

random (and in the APW approach the object
values undergoing clustering as well), we have to iterate these methods to produce a probability
distribution of clustering. The output of a given test is list of the clusters and their members that
appeared ove
r all iterations, and the associated frequencies of those clusters.

Deriving any useful knowledge from this data is difficult however. It is not possible to compute
an “expected” cluster, since there is no relationship between different clusters and the s
ame cluster can
have different members in two different iterations. Nor is it useful to look at the most frequent cluster.
Let S(N, K) be the number of different possible clustering combinations for N objects and K clusters.
Then S(N, K) is given by th
e formula [8]:

1
K
!

k
=
1
K

1
K

k
k
K
k
N


While randomizing the starting centroids does increase the possible clustering combinations our
methods can generate, most likely they are unable to reach every possible different clustering. Even so,
the

number of different combinations of clusters is still significant. Therefore the most frequent
clustering may only have occurred once or twice more than the next frequent clustering.

To account for this, the output of each method is converted to a matrix

of individual cell to cell
clustering. From the distribution of clustering, we can calculate the frequency that a given cell has
been clustered with every other cell in the data set. Using this matrix, we can determine over the
thousands of iterations w
hat the likely clusters of cells are given their individual clustering frequencies.
Of course, given the number of possible clustering, it is extremely likely that by random two cells will
end up clustered together even though their features are dissimila
r. However, if two cells are dissimilar
but clustered together randomly, the frequency of this happening should be no more than 1/K. If S(N
-
1, K) are the number of combinations where two cells are randomly clustered together, then frequency
of S(N
-
1,K) o
ccurring in S(N,K) is:

S
N

1,
K
S
N
,
K
=
1
K


As we shall see in the results sections, the clustering frequencies between similar cells are much
greater than 1/K and are rarely affected when K increases, while random associations tend to decre
ase
as K increases.

The cell to cell clustering frequency matrix can also be interpreted as a distance matrix, and the
results were input into the PHYLIP neighbor joining program in order to better visualize which cells
were clustering together.


UK
-
means


The UK
-
means algorithm could be characterized as the least robust of all the methods. Its
insensitivity to variance within a distribution can be viewed as a major flaw, especially given that the
distributions of the features in the cells are extremely v
ariable. However, for invariant data it
performed very well. One important metric with all these clustering methods is their robustness to
increasing the number of clusters allowed. Lower values of k increase the probability that two cells
will be rando
mly clustered together. Therefore it is important to determine how well UK
-
means keeps
clusters together as k increases. Drastic drop
-
offs in cell to cell clustering as k increases could indicate
that a group of cells with high clustering associations at

lower k was actually just random clustering.

One such identified cluster involves four cells from the LE data set. Cells 4, 7, 8 and 18 have
similar feature values, and all methods identified these four cells as being very similar. As can be seen
from t
he plot below, all four cells have probability distributions and expected values that are very close
together.



Illustration 2: Distributions and expected values of the three features for four specific cells from the
LE data set. The expected values
are shown as vertical lines.


The cell to cell frequency of clustering of these four cells for UK
-
means was extremely high
(nearly 100% of the time these cells were clustered together), and UK
-
means maintained this cluster
over increasing values of k. The

figures below show the neighbor joining plots of this cluster for k=2
and k=10.








Illustration 3: Neighbor joining plot of k=2 (top) and k=10 (bottom). As can be seen, these four cells
stayed together even as the number of clusters significantly
increased.


The cluster was insensitive to increasing k, and even as the average cell to cell clustering
frequency for the LE data decreased, the clustering for these four cells stayed almost constant.




Illustration 4: The average frequencies of cluste
ring of cells 4, 7, 8 and 18 with UK
-
means and the
average frequency of clustering for the entire LE data set. As k increases, cells 4, 7, 8 and 18 still
cluster together almost 100% of the time, while the average of the entire LE data set significantly
d
ecreases (as expected with more clusters)


Since UK
-
means uses the expected value of a feature to compute distance, it is particularly
insensitive to features with high variances. An example can be seen with RE cells 10, 12 and 24. From
the plot below, i
t can be seen that cell 24 has an extremely high variance in the soma size, while cells
10 and 12 have more than average variances in the dendritic field size and density. However in all
three cases, their expected values are not very distant.



Illust
ration 5: Distributions and expected values of the three features for three specific cells from the
RE data set. The expected values are shown as vertical lines.


The cluster plot from k=2 to k=10 also shows that UK
-
means considers this cluster very stro
ng,
since it maintains nearly 100% clustering over varying values of k. However, as compared with the
APW approach that we see later, these results differ greatly.



Illustration 6: The average frequencies of clustering of cells 10, 12 and 24 using UK
-
m
eans and the
average frequency of clustering for the entire RE data set. As k increases, cells 10, 12 and 24 still
cluster together almost 100% of the time, even though all three cells have large variances


All Possible Worlds with Gibbs Sampling


The APW

approach is, at least theoretically, the most robust of all three methods. While it is not
possible to explore every possible world, Gibbs sampling gives us the most probable clustering of the
cells. For strong clusters, such as cells 4, 7, 8 and 18 fro
m the LE data set, APW easily maintains their
high association through the different values of k.








Illustration 7: Neighbor joining plot of k=2 (top) and k=10 (bottom) for the APW method. As can be
seen, these four cells stayed together even as t
he number of clusters significantly increased.


Illustration 8: The average frequencies of clustering of cells 4, 7, 8 and 18 using APW and the average
frequency of clustering for the entire LE data set. As k increases, cells 4, 7, 8 and 18 still cluste
r
together almost 100% of the time, while the average of the entire LE data set significantly decreases
(as expected with more clusters)



On cells with high variance, the APW approach differs from UK
-
means significantly. As was
shown in the UK
-
means resu
lts, cells 10, 12 and 24 from the RE data set have high variances. The
APW approach takes this variance into account, which is why we iterate the Gibbs sampling portion a
significant number of times. As can be seen from the plot of the average clustering

percentages below,
these cell’s average clustering percentage drops as we increase k. For example, when k=10, cell 24 is
clustered with cells 10 and 12 62% and 54% of time using the APW method. UK
-
means on the other
hand clusters cell 24 with cells 10 a
nd 12 93% and 88% percent of the time, significantly more than the
APW method. If we remove cell 24 from the “cluster”, cells 10 and 12 have extremely high clustering
percentages.



Illustration 9: The average frequencies of clustering of cells 10, 12

and 24 using APW and the average
frequency of clustering for the entire RE data set. As k increases, APW starts to disassociate cell 24
with cells 10 and 12, resulting in a lower average frequency of those three cells. If cell 24 is removed,
the cluster
ing frequencies remain high for cells 10 and 12 (black line).


This is a clear example of the APW approach reflecting the variance of the data. Cells 10 and
12 do share some feature similarity with cell 24, but certainly not a significant amount. A large

amount
of the clustering percentage between cell 24 and cells 10 and 12 appears to be due to randomness.


Earth Mover’s Distance


The Earth Mover’s Distance method clusters based on K
-
means using EMD as the distance
measure between objects and clusters.
On the identified strong cluster, LE cells 4, 7, 8 an 18, the EMD
method did identify them as alike, but their association with cell 8 was not nearly as strong as the other
methods.





Illustration 10: Neighbor joining plot of k=2 (top) and k=10 (botto
m) for the EMD method. As can be
seen, these four cells stayed together with a low k, but as k increased to 10, cell 8 was pushed to an
outlier.



Illustration 11: The average frequencies of clustering of cells 4, 7, 8 and 18 using EMD and the
average f
requency of clustering for the entire LE data set. As k increases, cells 4, 7 and 18 still cluster
together almost 100% of the time but their clustering with cell 8 steadily decreases. The average of the
entire LE data set significantly decreases (as exp
ected with more clusters)


As can be seen from the plot, the cluster averages drop to nearly 80% due to cell 8’s low
frequency of clustering with the other three cells. With no manual ground truth it is difficult to say
which method is correct. Reviewing

the distribution plots for these cells in the UK
-
means section, it is
easy to see that cell 8 (black) is the outlier of this cluster, so if there is any cell that would decrease its
frequency of clustering with the other cells, it would be cell 8. Howeve
r, the distance between these
four cell’s distributions is still not a large amount, so it is entirely possible that the UK
-
means and the
APW methods are correct in assigning these cells as a strong cluster.


For highly variant data, such as cell 10, 12 an
d 24 from the RE data set, the EMD method
agreed more with the APW method than the UK
-
means method. The plot below shows the average
clustering frequency for cells 10, 12 and 24 with and without cell 24.



Illustration 13: The average frequencies of clu
stering of cells 10, 12 and 24 using EMD and the
average frequency of clustering for the entire RE data set. As k increases, EMD starts to disassociate
cell 24 with cells 10 and 12, resulting in a lower average frequency of those three cells. If cell 24
is
removed, the clustering frequencies remain high for cells 10 and 12 (black line).


As can been seen, cell 24 is clearly the outlier. With 10 clusters, cell 24 only clusters with cells
10 and 12 approximately 77% of the time. For the same amount of clu
sters, cells 10 and 12 cluster
with each other nearly 100% of the time, indicating that there is definitely uncertainly regarding
whether cell 24 is part of the cell 10
-
12 cluster or not.


Running Times


The running times of the three different methods can
not really be compared exactly. Each
method is iterated, so comparing the APW method with 100,000 iterations to UK
-
means or EMD with
5,000 iterations is really not a fair benchmark. Even so, it is still noteworthy to see how the individual
runtimes scale
d with increasing values of k and for two different values of N.


Illustration 14: Run times for the LE data set (N=40)


Illustration 15: Run times for the RE data set (N=40)



Illustration 16: Run times for the LE/RE data set (N=80)


Analysis of Cel
l Clusters


Analyzing how many true clusters are present in the cell data is difficult without manual ground
truth. In addition, the way that the data is organized, by cell to cell frequencies, it is difficult to judge at
what frequency two cells should
be considered clustered. However, we can look at the lower bound to
give us some indication of the underlying clusters in each data set. By looking at the highest level of
clusters (k=10) and setting a reasonably high threshold (90% clustering), we can e
stimate that the cells
clustered at this high threshold at least make up the clusters at lower thresholds. It is possible that some
of the clusters would merge if the thresholds were lowered, but that will at least contain the cells that
were present with

k=10 and 90% cutoff.



Data
Set

Method

Number of
Clusters
above 90%
threshold at
k=10

Size of
Clusters

LE

UK
-
means

9

3, 4, 3, 2, 3, 3,
2, 4, 2


APW

2

3, 3


EMD

9

2, 3, 3, 3, 2, 3,
3, 4, 2





RE

UK
-
means

9

3, 2, 2, 3, 3, 4,
4, 2, 2


APW

3

3, 4, 2


EMD

12

2, 2, 2, 2, 2, 2,
2, 2, 4, 2, 2, 2





LE/RE

UK
-
means

13

6, 7, 5, 6, 5, 3,
2, 3, 3, 5, 2, 2,
2


APW

3

4, 2, 4


EMD

19

2, 2, 3, 5, 2, 5,
2, 5, 5, 3, 2, 3,
3, 3, 2, 2, 2, 2,
2


Conclusion


All three of these methods have their advantages and disa
dvantages. UK
-
means certainly has
the advantage of being a quick, easy approximation to these feature PDF’s, and we get reasonably good
results given low variance data. However, considering the distributions for these morphological
features usually have
high variances, it is evident that the APW method and the EMD method are
better, given that they are variance sensitive.


Unfortunately, in depth analysis of the cell morphology with relation to normal and detached
retinas is beyond the scope of this pro
ject. However, these tests have given us a strong indication of the
number of underlying clusters in the data, as well as how each method performs under various
conditions. Future research into the cell morphology can build on the methods and tools we ha
ve
developed in this project.




References

[1] W. K. Ngai, B. Kao, C. K. Chui, R. Cheng, M. Chau, and K. Y. Yip, "Efficient Clustering of
Uncertain Data," Proceedings of the Sixth International Conference on Data Mining, pp. 436
-
445,
2006.

[2] M. Chau, R
. Cheng, B. Kao, and J. Ng, "Uncertain Data Mining: An Example in Clustering
Location Data," Pacific
-
Asia Conference on Knowledge Discovery and Data Mining, 2005.

[3] W. Sun, N. Li, and S. He, "Large
-
scale morphological survey of mouse retinal ganglion ce
lls," The
Journal of Comparative Neurology, vol. 451, pp. 115
-
126, 2002.

[4] J. Coombs, D. van der List, G. Y. Wang, and L. M. Chalupa, "Morphological properties of mouse
retinal ganglion cells," Neuroscience, 2006.

[5] S. K. Fisher, G. P. Lewis, K. A. L
inberg, and M. R. Verardo, "Cellular remodeling in mammalian
retina: results from studies of experimental retinal detachment," Prog Retin Eye Res, vol. 24, pp. 395
-
431, 2005.

[6] Y. Rubner, C. Tomasi, and L. J. Guibas, "A metric for distributions with app
lications to image
databases,"
Proceedings of the Sixth International Conference on Computer Vision,
p. 59, 1998.


[7]
http://www.utdallas.edu/~scniu/OPRE
-
6201/docume
nts/TP1
-
Transportation.html


[8]
Jain, A, Dubes, R.,“Algorithms for Clustering Data”, Prentice Hall
.
, 1988.