Clustering to Analyze High

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 8 months ago)

122 views

Faithful Sampling for Spectral
Clustering to Analyze High
Throughput Flow Cytometry Data

Parisa

Shooshtari


School of Computing Science, Simon Fraser University, Burnaby


Brinkman’s Lab, Terry Fox Laboratory, BC Cancer Agency, Vancouver

Outline:


Flow Cytometry (FCM) Data


Clustering of FCM data


Spectral Clustering


Faithful Sampling for Spectral Clustering


Result


Summary

Basics of Flow Cytometry Technique


3

Sample

Wave Length

Wave Length

Intensity

Intensity

MHC
-
II

MHC
-
II

MHC
-
II

MHC
-
II

CD
-
11c

CD
-
11c

Int
-
1

Int
-
2

CD
-
11c

MHC
-
II

Int
-
1

Int
-
2

Cell Population Identification in

Flow
Cytometry

(FCM)

X%

Adapted from the Science Creative Quarterly (2)

Parameter 3

Parameter 4

Parameter 2

Parameter 1

Importance of FCM Data Clustering


Manual Gating is


Subjective


Error
-
prone


Time
-
Consuming


It ignores the multi
-
variation nature of the data



Analyzing large size FCM data sets (with up to 19
dimensions and 1000,000 points) is impractical
without the aim of automated techniques

Which Clustering Algorithm Is Suitable?


Model
-
Based algorithms like FlowClust, FlowMerge and FLAME
are not suitable for non
-
elliptical shape clusters.

6

FlowMerge

A Good Clustering

GFP

Our Motivation for Using

Spectral Clustering


Spectral clustering does not require any priori
assumption on cluster size, shape or distribution



It is not sensitive to outliers, noise and shape of
clusters

7

Spectral Clustering in One Slide

Represent data sets by a similarity graph


Construct the Graph:


Vertices: data points p
1
, p
2
, …, p
n


Weights of edges: similarity values S
i, j

as





Clustering: Find a cut through the graph


Define a cut objective function


Solve it

The Bottleneck of Spectral Clustering


Serious empirical barriers when applying this
algorithm to large datasets



Time complexity: O(n
3
)
----

> 2 years for 300,000 data
points (cells)



Required memory: O(n
2
)
----

> 5 terabytes for 300,000
data points (cells)


9

Faithful Sampling: Our Solution for Applying
Spectral Clustering to Large Data


Uniform Sampling
:


Low density populations close to dense
ones may not remain distinguishable



10


Faithful Sampling
:


Tends to choose more samples from non
-
dense parts of the data.

How Does Our Faithful Sampling

Preserve Information?

1.
Space Uniform Sampling:
It
preserves low
-
density parts
of the data by selecting more
samples from them
compared to the uniform
sampling.


2.
Keeping the list of points in
neighbourhood of samples:
This will be used to define
similarities between
communities.

Clustering Result


Low density populations surrounded by dense ones

Clustering Result


Populations with Non
-
elliptical Shapes








Subpopulations of a major population


13

SamSPECTRAL

flowMerge

FLAME

Summary


Spectral clustering can now be applied to large size data
by our proposed
Faithful (Information Preserving)
sampling.



This sampling method can be used in combination with
other graph
-
based clustering algorithms with different
objective functions to reduce size of the data.



We have shown that
SamSPECTRAL

has advantage over
model
-
based
clusterings

in identification of


Cell populations with non
-
elliptical shapes


Low
-
density populations surrounded by dense ones


Sub
-
populations of a major population

Acknowledgement


Committee:


Dr.
Arvind

Gupta


Dr. Ryan Brinkman


Dr. Tobias
Kollman


Co
-
authors on SamSPECTRAL


Habil Zare


Data Providers


Connie Eaves


Peter
Landsdrop


Keith Humphries

Thanks for

Your Attention!