# Clustering to Analyze High

Τεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 5 χρόνια και 3 μήνες)

159 εμφανίσεις

Faithful Sampling for Spectral
Clustering to Analyze High
Throughput Flow Cytometry Data

Parisa

Shooshtari

School of Computing Science, Simon Fraser University, Burnaby

Brinkman’s Lab, Terry Fox Laboratory, BC Cancer Agency, Vancouver

Outline:

Flow Cytometry (FCM) Data

Clustering of FCM data

Spectral Clustering

Faithful Sampling for Spectral Clustering

Result

Summary

Basics of Flow Cytometry Technique

3

Sample

Wave Length

Wave Length

Intensity

Intensity

MHC
-
II

MHC
-
II

MHC
-
II

MHC
-
II

CD
-
11c

CD
-
11c

Int
-
1

Int
-
2

CD
-
11c

MHC
-
II

Int
-
1

Int
-
2

Cell Population Identification in

Flow
Cytometry

(FCM)

X%

Adapted from the Science Creative Quarterly (2)

Parameter 3

Parameter 4

Parameter 2

Parameter 1

Importance of FCM Data Clustering

Manual Gating is

Subjective

Error
-
prone

Time
-
Consuming

It ignores the multi
-
variation nature of the data

Analyzing large size FCM data sets (with up to 19
dimensions and 1000,000 points) is impractical
without the aim of automated techniques

Which Clustering Algorithm Is Suitable?

Model
-
Based algorithms like FlowClust, FlowMerge and FLAME
are not suitable for non
-
elliptical shape clusters.

6

FlowMerge

A Good Clustering

GFP

Our Motivation for Using

Spectral Clustering

Spectral clustering does not require any priori
assumption on cluster size, shape or distribution

It is not sensitive to outliers, noise and shape of
clusters

7

Spectral Clustering in One Slide

Represent data sets by a similarity graph

Construct the Graph:

Vertices: data points p
1
, p
2
, …, p
n

Weights of edges: similarity values S
i, j

as

Clustering: Find a cut through the graph

Define a cut objective function

Solve it

The Bottleneck of Spectral Clustering

Serious empirical barriers when applying this
algorithm to large datasets

Time complexity: O(n
3
)
----

> 2 years for 300,000 data
points (cells)

Required memory: O(n
2
)
----

> 5 terabytes for 300,000
data points (cells)

9

Faithful Sampling: Our Solution for Applying
Spectral Clustering to Large Data

Uniform Sampling
:

Low density populations close to dense
ones may not remain distinguishable

10

Faithful Sampling
:

Tends to choose more samples from non
-
dense parts of the data.

How Does Our Faithful Sampling

Preserve Information?

1.
Space Uniform Sampling:
It
preserves low
-
density parts
of the data by selecting more
samples from them
compared to the uniform
sampling.

2.
Keeping the list of points in
neighbourhood of samples:
This will be used to define
similarities between
communities.

Clustering Result

Low density populations surrounded by dense ones

Clustering Result

Populations with Non
-
elliptical Shapes

Subpopulations of a major population

13

SamSPECTRAL

flowMerge

FLAME

Summary

Spectral clustering can now be applied to large size data
by our proposed
Faithful (Information Preserving)
sampling.

This sampling method can be used in combination with
other graph
-
based clustering algorithms with different
objective functions to reduce size of the data.

We have shown that
SamSPECTRAL

model
-
based
clusterings

in identification of

Cell populations with non
-
elliptical shapes

Low
-
density populations surrounded by dense ones

Sub
-
populations of a major population

Acknowledgement

Committee:

Dr.
Arvind

Gupta

Dr. Ryan Brinkman

Dr. Tobias
Kollman

Co
-
authors on SamSPECTRAL

Habil Zare

Data Providers

Connie Eaves

Peter
Landsdrop

Keith Humphries

Thanks for