R Analytics in the Cloud - Meetup

educationafflictedΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 6 μήνες)

79 εμφανίσεις

R Analytics

in the Cloud



Radek Maciaszek


DataMine
Lab

(
www.dataminelab.com
)
-

Data
mining
,
business

intelligence

and data
warehouse

consultancy
.


MSc

in
Bioinformatics

at
Birkbeck
,
University

of
London.


Project
at UCL
Institute

of
Healthy

Ageing

under

supervision

of
Dr

Eugene
Schuster
.

Introduction

2

Primer in Bioinformatics

3


Bioinformatics
-

applying computer
science to biology (DNA, Proteins,
Drug
discovery,
etc
)


Ageing strategy


solve it in simple
organism and apply findings to more
complex organisms (i.e. humans).


Goal: find genes responsible for ageing

Caenorhabditis

Elegans

Genes
are encoded

by
the DNA
.

Microarray

(100 x 100)

4

Central dogma of molecular biology


Database of 50 curated experiments.


10k genes compare to each other

Why R?


Very popular in bioinformatics


Functional,
scripting
programming
language


Swiss
-
army knife for
statistician


Designed by statisticians for
statisticians


Lots of ready to use packages (CRAN)

5

R limitations &
Hadoop


Data needs to fit in the
memory


Single
-
threaded


Hadoop

integration:


Hadoop

Streaming


Rhipe
:
http://ml.stat.purdue.edu/rhipe
/



Segue:
http://code.google.com/p/segue
/


6

Segue


Works with Amazon Elastic MapReduce.


Creates a cluster for you
.


Designed for Big Computations
(rather
than
Big
Data)


Implements a cloud version of
lapply
()
function.

7

Segue workflow (
emrlapply
)

8

Amazon AWS

List (local)

List (remote)

R very quick example

m
<
-

list
(a = 1:10,
b
=
exp
(
-
3:3
))

l
apply
(m, mean)

$a

[1]
5.5

$
b

[1] 4.535125


l
apply(X, FUN)

returns a list of the same length as X,
each element of which is the result of
applying FUN

to
the corresponding element of X.

9

Segue


large scale example

>
AnalysePearsonCorelation

<
-

function(probe) {


A.vector

<
-

experiments.matrix
[probe,]


p.values

<
-

c()


for(
probe.name

in
rownames
(
experiments.matrix
)) {


B.vector

<
-

experiments.matrix
[
probe.name
,]


p.values

<
-

c(
p.values
,
cor.test
(
A.vector
,
B.vector
)$
p.value
)


}


return (
p.values
)

}



>
pearson.cor

<
-

lapply
(probes
,
AnalysePearsonCorelation
)


Moving to the cloud in 3 lines of code!

10

RNA Probes

Segue


large scale example

>
AnalysePearsonCorelation

<
-

function(probe) {


A.vector

<
-

experiments.matrix
[probe,]


p.values

<
-

c()


for(
probe.name

in
rownames
(
experiments.matrix
)) {


B.vector

<
-

experiments.matrix
[
probe.name
,]


p.values

<
-

c(
p.values
,
cor.test
(
A.vector
,
B.vector
)$
p.value
)


}


return (
p.values
)

}



> #
pearson.cor

<
-

lapply
(probes,
AnalysePearsonCorelation
)

>
myCluster

<
-

createCluster
(
numInstances
=5,
masterBidPrice
="0.68”
,


slaveBidPrice
="0.68”,
masterInstanceType
=”c1.xlarge”
,


slaveInstanceType
=”c1.xlarge”,
copy.image
=TRUE)

>
pearson.cor

<
-

emrlapply
(
myCluster
, probes,
AnalysePearsonCorelation
)

>
stopCluster
(
myCluster
)

11

RNA Probes

Discovering genes

12

Topomaps

of clustered genes

This work was based on a similar approach to:

A
Gene Expression Map for
Caenorhabditis

elegans
,
Stuart
K. Kim,
et al
.,

Science
293
,
2087 (2001
)

Conclusions


R is great for statistics.


It’s easy to scale up
R using Segue
.


We are all going to live very
long.

13

Thanks!


Questions?



References
:

http
://code.google.com/r/radek
-
segue
/


http://www.dataminelab.com


14