Class discovery (clustering) using R Bioconductor

overratedbeltΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

56 εμφανίσεις

Microarray data analysis

________________
__________________________________________________________

Class discovery (clustering) using R Bioconductor




1 Introduction


Using real life data from a published study this exercise provides examples of how to
employ different clustering algorithms from various R Bioconductor packages to
uncover groups of gene
s sharing patterns of expression. The multitude of clustering
algorithms available for such tasks makes it impossible to show them all, so we will
be focusing on three of the most popular methods:
K
-
means clustering, partitioning
around medoids (PAM)

and
s
elf organising maps (SOM).
For visualisation and results
evaluation purposes we will be using
hierarchical clustering

and
multidimensional
scaling (MDS).



1.1 Example data


The data set we will be using was published by Bareno and co
-
workers in Genome
Bio
logy (Bareno
al.
, 2006) and was obtained from an experiment aimed at identifying
transcriptional targets of the p53 gene. The experiment involved measuring genome
wide expression using Affymetrix HGU
-
133A arrays at various time
-
points after
subjecting cell
s to 5 Gy of ionizing radiation. In total, 7 time points were investigated
and three independent experiment runs were carried out. Full sample information is
given below:


sample name time replicate

cARP1
-
0hrs 0 1

cARP1
-
10hrs 10 1

cAR
P1
-
12hrs 12 1

cARP1
-
2hrs 2 1

cARP1
-
4hrs 4 1

cARP1
-
6hrs 6 1

cARP1
-
8hrs 8 1

cARP2
-
0hrs 0 2

cARP2
-
10hrs 10 2

cARP2
-
12hrs 12 2

cARP2
-
2hrs 2 2

cARP2
-
4hrs 4

2

cARP2
-
6hrs 6 2

cARP2
-
8hrs 8 2

cARP3
-
0hrs 0 3

cARP3
-
10hrs 10 3

cARP3
-
12hrs 12 3

cARP3
-
2hrs 2 3

cARP3
-
4hrs 4 3

cARP3
-
6hrs 6 3

cARP3
-
8hrs 8 3


Y
ou can find the raw cel files on Array Express at EBI. Create a phenodata.csv file
with the experimental design.



References



Microarray data analysis

________________
__________________________________________________________

Barenco M
,
Tomescu D
,
Brewer D
,
Callard R
,
Stark J
,
Hubank M
.
Ranked prediction of
p53 targets using hidden variable dynamic modeling.
Genome Biol.

2006;7(3):R25. Ep
ub 2006 Mar
31.

Microarray data analysis

________________
__________________________________________________________

2 Exercise
-

loading and preparing data for clustering



2.1 Packages


In addition to the
affy
and

gcrma

package used for importing and normalizing
the data we will also need some of the functionality provided by the
stats,
hopach
,
cluster
,

class,

RColorBrewer

and
MASS

packages. Start the
exercise by loading them (and if necessary first installing them) into memory.



2.1 Importing and normalizing the data


Use the
ReadAffy()

command, like you have been doing in previous exercises to
upload

data into a variable called

ir_data
. Experimental design can be read using
read.AnnotatedDataFrame()
. During the importing steps you should set up two
covariates indicating time point and replicate series for each sample. Because time
and replicate are nu
mbers, they are not turned into factors automatically by
read.AnnotatedDataFrame()
. This can be solved using as.factor.

To make the sample labels more readable in plots we will remove the ".CEL" part
from the names:

Let's make some plots to check the distr
ibutions of signal intensities for the samples
in the different replicate series (figure 1):




Figure 1:
Distribution of raw signal intensities for each sample,

with coloring according to replicate series.




How do you interpret the plots?



Normalize t
he data using the
rma()

function and store the results in the variable
ir_rma
.


Let's check if the normalization achieved the desired result by plotting the values in a
boxplot (figure 2):


Microarray data analysis

________________
__________________________________________________________



Figure 2:
Distribution of signal intensities for each sample up
on

gcrma normalization, with coloring according to replicate series.


Now that the samples look comparable we can move on to the filtering steps.



2.2 Filtering out low quality data.


Just like we did in the
preprocessing

exercise we will take advantage o
f the detection
calls calculated by the MAS5 algorithm to exclude unreliable data. This time we will
require that, for each time point, all three replicate samples should be scored as
present
. Since it may well be the case that genes are only expressed at
one specific
time point, we will adjust the filtering accordingly. Although following lines of code
may look daunting at first, they are in fact relatively simple, mostly consisting of
loops to traverse the data table by samples and genes. Study the code t
o see if you can
understand how it operates or simply copy and paste the whole chunk of code and you
should end up with a subset of reliable data stored in the variable
filtered_genes
.


# calculate PA calls

pa_calls <
-

mas5calls(ir_data)

is_present <
-

expr
s(pa_calls) == "P"


# define function for counting P

count_true <
-

function (data_table)

{


number_rows <
-

dim (data_table)[1]


number_columns <
-

dim (data_table)[2]


result_list <
-

numeric(number_rows)


for (row in 1:number_rows)

Microarray data analysis

________________
__________________________________________________________


{



count <
-

0



for (co
lumn in 1:number_columns)




if (data_table[row,column] == "TRUE")

count <
-

count+1



result_list[row] <
-

count


}


print(result_list)

}


# go through table by replicate series

time_points <
-

as.numeric(phenodata$time)

number_time_points <
-

length(levels(
phenodata$time))

number_replicates <
-

length(levels(phenodata$replicate))

number_genes <
-

dim(is_present) [1]

result_table <
-

seq(1,number_genes,1)

for (time_point_count in 1:number_time_points)

{


for (replicate_count in 1:number_replicates)


{



index <
-

time_points==time_point_count



extracted_data <
-

is_present[,index]



present_count <
-

count_true(extracted_data)



present_in_all <
-

present_count == number_replicates


}


result_table <
-

cbind(result_table, present_in_all)

}




# define function for co
unting occurences of the value 1

count_ones <
-

function (data_table)

{


number_rows <
-

dim (data_table)[1]


number_columns <
-

dim (data_table)[2]


result_list <
-

numeric(number_rows)


for (row in 1:number_rows)


{



count <
-

0



for (column in 1:number_col
umns)




if (data_table[row,column] == 1)

count <
-

count+1



result_list[row] <
-

count


}


print(result_list)

}


# identify genes present in at least one time point

present_count <
-

count_ones(result_table[,2:(number_time_points+1)])

present_in_any <
-

pre
sent_count > 1

filtered_genes <
-

ir_rma [present_in_any,]


Next we need to remove the control probes from the data. For this purpose we can
simply reuse the code that was provided for the exercise on how to identify
differentially expressed genes:


number_
genes_2 <
-

dim(exprs(filtered_genes))[1]

is_control <
-

logical(number_genes_2)

for (row in 1:number_genes_2)

{

Microarray data analysis

________________
__________________________________________________________


if (charmatch("AFFX", rownames




(exprs(filtered_genes))[row],



nomatch=0) == 0) is_control[row] <
-

FALSE


else


is_control[row] <
-

TRUE

}

fil
tered_genes_2 <
-

filtered_genes[!is_control]


The final filtering step involves getting rid of the uninteresting data and like in the
aforementioned exercise we will be employing a cut
-
off based on the distribution of
values calculated for the coefficient
of variability. Use apply with MARGIN=1(rows)
on the expression data of the filtered genes to calculate vectors with standard
deviation and mean. Divaide to get the coefficient of variability. Determine the 1
percent cutoff value using the quantile functio
n, and filter gene set.




How many genes remain after all the filtering steps?


Visualise the distribution (normal and log) and cut
-
off on a histogram (figure 3) and a
scatterplot of CV_values vs mean_values (figure 4):




Figure 3:
Histogram of CV
-
value
s in linear and

logarithmic scale with filtering cut
-
off overlaid.

Microarray data analysis

________________
__________________________________________________________



Figure 4:
CV
-
values as a function of the average of expression (in logarithmic scale)

across arrays. Blue colour indicates CV
-
values above the filtering cut
-
off.



2.3 Evaluating experim
ent outcome


Before proceeding with the exploration of gene expression patterns with the various
algorithms it might be a good idea to first check that the experiment performed as
expected, i.e. that samples from the different time points are more dissimil
ar to each
other than the three replicates at each time point. The most common way to do this is
to perform a hierarchical clustering on the samples. Although it will cluster both
genes and samples, the
heatmap()

function can conveniently be used for this
purpose. To set up a pleasing colour scheme to apply to the map we can use the
colorRampPalette()

function from the
RcolorBrewer

package. We will
further add colours to highlight the three replicates of each time point using the
ColSideColors parameter, th
e resulting image looking like figure 5:




Microarray data analysis

________________
__________________________________________________________



Figure 5:
Two
-
dimensional clustering of samples (columns) and genes (rows).




What conclusions can you draw from the heatmap?


The
heatmap()

function calculates distances between gene or sample profiles using

euclidean
distance. This is far from always the most appropriate distance measure to
use, especially so for time course data. A hierarchical clustering based on pearson
correlation can be achieved by first calculating a distance matrix (to find “similarit
y”
use cor with method=”pearson” as parameter on the expression data, make a distance
of it by adding
-
1 and using as.dist) and then applying the
hclust()

function using
the “average” method (figure 6):


dist_samples_pea <
-

as.dist (1
-
cor (exprs(filtered_
genes_3),

method ="pearson"))

hc_samples_pea <
-

hclust (dist_samples_pea, method="average")

plot (hc_samples_pea, hang=
-
1, ann=T, cex=0.75, main="Pearson")


Microarray data analysis

________________
__________________________________________________________



Figure 6:
Hierarchical clustering of samples based on pearson correlation.


One disadvantage of
hierarchical clustering is that it is not easy to compare any given
sample to any other sample. Multidemsional scaling, on the other hand, will allow
precisely that:


mds_pea <
-

cmdscale ( dist_samples_pea, eig = TRUE)

plot (mds_pea$points, col = color_tim
e, xlab="Component 1",

ylab="Component 2", pch=19, main="MDS plot (pearson)")

legend(legend=levels(phenodata$time), col=2:8,


x=
-
0.28, y=0, lty=1, cex=1)



Microarray data analysis

________________
__________________________________________________________



Figure 7:
Results from multidimensional scaling

on samples based on pearson correlation.




Do th
e results from MDS and hierarchical clustering agree?


Now that we have seen that the experiment (mostly) performed according to our
expectations it is almost time to move on to the clustering of gene expression profiles.
Almost, because we first need to p
rocess the data a bit more before we can go on.



2.4 Averaging, centering and ordering the data


At present, the data contains 3 replicate measurements for each time point. When
clustering with regards to the time course behaviour we are not interested in

treating
the different replicates of the same time points as individual samples. Therefore, the
data needs to be averaged over the 3 replicates. Furthermore, the first time point, i.e.
at the moment of irradiation, is not really interesting
per se

and was

only measured to
define a reference point. By making ratios of the averaged values for each time point
with the zero time point we will also effectively center the data and scale each gene so
they become more comparable. Finally, the columns of the data t
able are not ordered
chronologically, which will make plotting of expression profiles difficult. It is thus
desirable to also order the columns according to time. The following code will take
care of these issues:


# average over replicates

# go through ta
ble by replicate series

number_genes_3 <
-

dim(exprs(filtered_genes_3)) [1]

averaged_data <
-

matrix(nrow=number_genes_3, ncol=number_time_points)

Microarray data analysis

________________
__________________________________________________________

for (gene_count in 1:number_genes_3)

{


row_values <
-

numeric(number_time_points)


for (time_point_count in 1:n
umber_time_points)


{



index <
-

time_points==time_point_count

row_data <
-

exprs(filtered_genes_3[gene_count,index==TRUE])



sum <
-

0



for (replicate_count in 1:number_replicates)



{




sum <
-

sum+row_data[replicate_count]



}



row_values[time_point_cou
nt] <
-

sum/number_replicates


}


averaged_data[gene_count,] <
-

row_values

}

rownames (averaged_data) <
-

(rownames(exprs(filtered_genes_3)))

colnames (averaged_data) <
-

levels(phenodata$time)


# make ratios to the zero time point and order columns accordin
g to

# time course

centered_data <
-

averaged_data[,2:7]

centered_data <
-

centered_data/averaged_data[,1]

sort_order <
-

order(as.numeric (colnames(centered_data)))

ordered_data <
-

centered_data[,sort_order]


Microarray data analysis

________________
__________________________________________________________

3 Exercise
-

clustering of gene expression prof
iles



3.1 Calculating distance matrices


As was done for the sample clustering we first need to calculate distances between all
possible pairs of gene expression profiles. Besides allowing us to visualise and
evaluate the clustering results in dendrograms

and MDS plots, it is also a necessary
step for the
PAM
algorithm. Since the data has been centered and since it generally is
desirable to focus the attention on similarities in the shape of the expression profiles
rather that its amplitude, the most appro
priate choice of distance measure is again the
pearson correlation
.


All three clustering algorithms that we are going to try require the input of the number
of clusters to be expected. A good starting point for trying and estimate this number is
to plot
the genes in a dendrogram. Performing hierarchical clustering and plotting the
resulting dendrogram is done exactly in the same way as we did for the samples in
figure 6.




Perform hierarchical clustering and plot the results (which should look like figur
e
8).




Figure 8:
Results from hierarchical clustering of genes based on pearson correlation.



Microarray data analysis

________________
__________________________________________________________



How many clusters do you see?


In the following we will assume that there are about 7 clusters in the data. If you
arrived at another number you are welcome

to use that instead.



3.2 Clustering by k
-
means


In addition to the number of expected clusters, the k
-
means algorithm needs that the
maximum number of iterations is specified. This is done to avoid the alogorithm
looping "forever". Usually, a solution
is found within the first 5 or 10 iterations so the
default value of 100 is more than enough. Since k
-
means starts by randomly assigning
genes to the different clusters a particular run of the algorithm may result in finding of
a solution representing the
"locallly" best solution. By repeating the run for 100
different random starting clusters we can minimise this risk. Calculate a k
-
means
result for 7 clusters (using the kmeans function).


To have a look at the expression profiles for the genes in the diff
erent clusters (cluster
1 shown in figure 9) you can use matplot using lines (type=”l”). If the kmeans result
was stored in kmeans_genes, you can use kmeans_genes$cluster to get the cluster
asignments in a vector. You need to transpose (function t) the mat
rix to plot
expression vs time points.

.



Figure 9: E
xpression profiles for all genes in the first of the

seven clusters identified using the k
-
means algorithm.

Microarray data analysis

________________
__________________________________________________________


In order to evaluate how well the identified clusters represent the patterns in the data,
as

well as showing the "tightness" within and "overlap" between clusters, the MDS
plot is very handy. Calculate mds_pea, and plot

mds_pea$points using
kmeans_genes$cluster to assign colors.

The following lines will overlay the clustering results on an MDS p
lot based on the
pearson correlation matrix calculated for the genes:



Figure 10:
MDS
-
plot for genes based on pearson correlation

with results from k
-
means clustering highlighted with colours.




How do you interpret the results?




Repeat the clustering

for other numbers of clusters and compare the results.



3.3 Clustering by PAM



The PAM algorithm is related to k
-
means, but there are a couple of significant
differences, the most crucial being that K
-
means evaluates profile similarities using
euclidean

distance whereas PAM can use any distance measure. Basing the clustering
on pearson correlation rather than euclidean distance should put more focus on profile
shape rather than amplitude. In addition to feeding the distance matrix (calculated
earlier), t
he only parameter that needs to be provided is the number of clusters (k),
making it easy to achieve good results. Type the following:


Microarray data analysis

________________
__________________________________________________________




Plot expression profiles and MDS
-
plot for the resulting clusters.


Hint
: cluster belonging is contained in the variab
le
pam_genes_pea$clustering
.





Draw matplot for all clusters on one page.



How do the results compare with the ones obtained using the K
-
means algorithm?



Change number of clusters and see if you can achieve better results.



Figure 13:
Expression pro
files for all genes in the first of the

seven clusters identified using the PAM algorithm.

Microarray data analysis

________________
__________________________________________________________


Figure 14:
MDS
-
plot for genes based on pearson correlation

with results from PAM clustering highlighted with colours.