Principal Component Analysis

splashburgerInternet and Web Development

Oct 22, 2013 (3 years and 7 months ago)

59 views

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
1


Principal Component Analysis

Learning
Objectives

After completion of this module, the student will be able to



describe

principal component analysis (PCA)

in geometric terms



i
nterpret
visual

representations

of PCA: scree
plot

and

biplot



apply PCA to a small data set



research the application of PCA in different
knowledge
domains



design

a research project using
microarray
data

and analysis tools

on REMBRANDT

Concepts



Big data



Dimensional
ity reduction



Principal component analysis (PCA)

Knowledge and Skills



Excel skills:
Conditional formatting
, linear regression, scatter plot, functions




Scree plot, biplot



Coordinate representation of points

Prerequisites



Familiarity with Excel

o

Copy, paste,

graphing, sorting



Scatterplot



Correlation



Linear regression

Supporting Articles and Data Sets

Kho, A.T., Q. Zhao. Z. Cai, A.J. Butte, J.Y.H. Kim, S.L. Pomeroy, D.H. Rowitch, and I.S. Kohane. 2004.
Conserved mechanisms across development and tumorigenesis
revealed by a mouse development
perspective of human cancers. Genes & Development 2004. 18: 629
-
640. Doi: 10.1101/gad.1182504.

Accessed on the web on May 27, 2012:
http://www.genesdev.org/
cgi/doi/10.1101/gad.1182504

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
2


Big Data
, Heat Maps, Scatter Plots,

and Data Clouds

In this module, we will look at
a study by Kho
et al
. 2004.
The abstract of the paper begins with the
sentence: “I
dentification of
common mechanisms underlying organ development and primary tumor
formation
should yield new insights into tumor biology and facilitate the generation of relevant cancer
models.” (Kho
et al
. 2004)
Their study focuses on a childhood cancer, medulloblastoma,
which is a
cancer of the central nervous system.
As part of the study, the research group
analyzed microarray

expression data
of mouse cerebella during postnatal days 1
-
60 to identify
the

genes

that

were
expressed early versus late during development.


Th
is module uses the data in Supplemental Table 1 (MS Excel), which is available at
http://genesdev.cshlp.org/content/18/6/629/suppl/DC1

The complete data set is in the accompanying workshe
et. The raw data is in the first sheet, labeled “Raw
Data.” Except for adding an identifier for each row in Column A, the sheet contains the data from the
Supplemental Table 1 (Kho
et al
. 2004; see above for link). The sheet is protected to avoid accidenta
l
changes in the raw data. However, you can still copy data from that sheet into a new sheet.

The study resulted in a large amount of data. We will start with a preliminary exploration to get a better
sense of the data.
Table 1

shows the data of the first
thirteen
of 2552
genes from one of the
experiments:

Table 1:
Expression data of one of the experiments of the

first thirteen genes
(
Kho
et al
. 2004
)


Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
3


Normalizing the Data and Generating a Heat Map

In the second sheet (called HeatMap), we copied the data s
et

from Columns V to AD

of the first sheet
.

The data come from wild
-
type mouse cerebella during the first 60 days postnatal, and were profiled
using Affymetrix Mu11K arrays at nine time points: P1, P3, P5, P7, P10, P15, P21, P30, and P60,
indicating the nu
mber of days postnatal.


The first step is to normalize the data. We follow Kho
et al
. (2004)
: “each of the 2552 genes was
individually normalized to mean zero and variance one across P1
-
P60.” We illustrate this on the first
gene

(see Table
2
)
. The expression data for the nine different time points is listed in Cells A2:AI. To
standardize the data, we need to calculate the mean and the standard deviation of the nine data points.
To calculate the mean, we use the Excel function
AVERAGE
(number1,
[number2],…)
. We enter
in Cell J2 the expression




=AVERAGE(
A2
:
I2
)

To calculate the standard deviation, we use the Excel
function STDEV.S
(number1, [number2],…)
.

We enter in Cell K2 the expression




=STDEV.S(
A2
:
I2
)

To standardize the values, we use the Excel function
STANDARDIZE
(x, mean, standard_dev)
.
This function has three arguments, the value we want to standardize, the mean, and the standard
deviation.
The function returns the normalized value of x, that is,



To standardize the value in Cell A2, we

enter in Cell A6




=STANDARDIZE(A2,$J$2,$K$2)

We obtain the value
-
0.9877, which is the result of the expression (
-
66.0
-
509.3)/582.4
8
.

Note that we used absolute references for the mean and
standard deviation

(indicated by the $ sign
before the column letter and row number, respectively)
, but relative reference for the value we want to
standardize. This allows us to drag the cell A6 across so that the remaining cells B6 to I6 are filled with
the standardized data.



Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
4


Table
2
: The expression profile of the first gene before (Row 2) and after normalization (Row 6)


To plot the standardized expression profile over time, we generate a table as below:


We then graph the data as a scatterplot with
straight lines and markers

(Figure
1
)
:


Figure
1
: Scatterplot of normalized expression profile for the first gene in the data set.


We now return to the full data set in the second sheet and standardize the remaining gene expression
profiles.

(Note: Adjus
t the rows and columns in the formulas according to where the data are.)


To standardize the data in Columns B through L in the worksheet under the HeatMap tab, calculate the
mean and the standard deviation of each gene
, and enter the results

in Columns M
and N, respectively.
Use the
STANDARDIZE
function to normalize the data
, and enter the results

in Columns P through X.
Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
5


To visualize the up
-

versus down
-
regulated genes, use the
Conditional Formatting

option that is
available in the
Styles

group of the
Home

ribbon. Choose a Color Scale that
formats cells with high
values red and cells with low values blue. Now, sort the data
from largest to smallest by the first time
point of the normalized data (Column P)
using Custom Sort. Make sure you
sort all columns

so

that you
can keep track of the genes using the numeric ID in Column B.
The coloration indicates which genes tend
to be expressed early versus late during development.

Scatter Plots and Correlations

Since the data consist of expression profiles
that were taken on days that are close together, we expect
that the expression profiles from time point to time point are correlated. We
use scatter plots to
visualize correlations and

calculate the correlation among all pairs of time points. Figure
2

show
s
an

example of a scatter plot where each data point represents the expression of a single gene at time
points 5 days (horizontal axis) and 7 days (vertical axis). We see that the data
are

positively correlated

(Figure
2
)
.


Figure
2
: Scatterplot of the n
ormalized data from time points 5 and 7.

We can calculate the correlation using the Excel function

=CORREL(array
1
,

array
2
)

w
here
array
1

(similarly,
array2
)
is
the
range of data for a given time point.
For instance, we find
that the correlation between tim
e points 5 and 7 of the normalized data is 0.66.

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
6


Exercise

1
:

Calculate the correlation between pairs of time points for the normalized data.

Find the pairs with the
larges positive and largest negative correlation and plot each
of these
pair
s

of data as a scatter plot.

What property of the scatter plot tells you whether the data are positively versus negatively correlated?


Data Clouds

The data w
ere

collected over nine days during Days 1
-
60 postnatal. We can think of each time point as a
single dimension
.

To visualize the data as a cloud in the temporal space, we would need a 9
-
dimensional
space, one dimension for each time point.

Of course, we canno
t draw such a cloud. We are restricted to
at most three dimensions when plotting clouds. But even a three
-
dimensional cloud is not easy to
interpret as

illustrated in Figure
3
.


Figure
3
: Data cloud in three spatial dimensions

In the following, we will le
arn a tool, called Principal Component Analysis that reduce
s

the
dimensionality of the data by rotating the coordinate axes in such a way as to maximize the signal and
minimize redundancy in the representation.
This will allow us to represent high
-
dimensio
nal data with
fewer dimensions while keeping the most important features of the data. Before we explain this further,
we will need to review how to represent points in space.

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
7


Representing Points in Space

When we represent a point in 2
-
dimensional space, we

give its coordinates relative to a coordinate
system. For instance, the red point in the graph below (Figure
4
) has coordinates (
x
,
y
) in the blue
coordinate system and coordinates (
u
,
v
) in the black coordinate system. Since in either coordinate
system the

axes are orthogonal, we can us
e the Pythagorean T
heorem and find that




Figure
4
: Representing points in rectangular coordinate systems


Exercise

2
:

In Figure
4
, assume that the coordinates of the red point in the
x
-
y

coordinate system are (1,2). Assume
that the
u
-
axis goes through the point (2,1) in the
x
-
y

coordinate system. What are the coordinates of
the red point in the
u
-
v

coordinate system?



Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
8


Choosing a Coordinate System

Look at the left and right panel
in Fig
ure
5
.
In the left panel, the data points are normally distributed
with mean 0 and variance 1 and they are uncorrelated. In the right panel, the data points are
also
normally distributed with mean 0 and variance 1, but this time they are correlated.


Fig
ure
5
:
The data in the left panel are uncorrelated; the data in the right panel are correlated.

Correlation in data introduces redundancy in d
ata in the sense that knowing

the value of one
coordinate allows us to make predictions about the other coordina
te, and the higher the correlation,
the better our prediction will be.






Figure
6
: The rotated data cloud

In
Figure
6,

we rotated the data cloud from the right panel
of

Figure
5

so that the maximum variability is
aligned with the horizontal axis.
Rotating the data cloud is equivalent to rotating the axes. That is, we
could have said that we rotated the axes to that the first axis goes through the cloud where the
variability is maximal.
To express the data points in the new coordinate system, we pro
ceed as in
Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
9


Exercise
2
. The variables describing the data points in the new coordinate system are then linear
combinations of the variables describing the old coordinate system.

Because the data are highly correlated, almost all the information about
the d
ata

is contained in the first
coordinate when the data are expressed in this rotated coordinate system. The second coordinate adds
much less

information. As a consequence, if we wanted to reduce the dimensionality of the data, we
could use the first coordi
nate and neglect the second coordinate.

Neglecting dimensions in this way
where we rotate the coordinate axes so that the first coordinate maximizes the signal, as measured by
the variation, and neglect the information contained in the second coordinate, i
s an example of
dimensionality reduction
.

Reducing dimensionality becomes important when data are high
-
dimensional since we cannot visualize
data clouds in more than three
spatial
dimensions.

To maximize the signal and reduce redundancy, we
should therefor
e rotate the
coordinate axes

so that the first axis maximizes the signal, as measured by
the amount of variation. The second axis should have the second highest amount of variation, and so
on.
In addition, we choose the coordinate axes so that they are ort
hogonal with respect to each other.
Th
e selection of these axes, called principal components, is at the heart of

Principal Component
Analysis.

Finding the First Principal Component

We begin with a paper and pencil approach to finding the first principal
component of a small number of
data points.

This will give a description of the first principal component in geometric terms.


Figure
7: Representation of a point in two rectangular coordinate systems

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
10


In Figure 7, t
he point (in red) has two representation
s:

in the blue coordinate system and

in
the black coordinate system.
We can think of
u

as the projection of the point
on the
u
-
axis.
Given

the coordinates
x

and
y
,

we want to calculate the value

of the coordinate
u
. This requires
some
trigonometry and the
Pythagorean Theorem. Using trigonometry on the red triangle, we find that

(
1
)


The cosine of the difference of two angles can be computed

using a standard formula from
trigonometry, and we find that

(
2
)


Combining Equations (
1
) and (
2
), we obtain



Note that

and
. Hence,

(
3
)


We wil
l find a more useful expression for the sine and cosine of the angle

in Equation (
3
).


Figure
8: Expressing the angle
in terms of the point

In Figure
8
, we marked off the point

on the
u
-
axis
. We
choose
a

and
b

in the blue coordinate
system
so

that

(
4
)


Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
11


This means that the length of the

orange
line segment is 1

(in either coordinate system)
a
ccording to the
Pythagorean Theorem
. It follows that
in the blue coordinate system



This now allows us to find an expression for the length of the projection of the point

on the
u
-
axis,
namely, us
in
g
Equation (
3
), we find:



Hence, the distance from the origin to the projection of the point (
x
,
y
) on
the
u
-
axis
is |
ax
+
by
|.

If the
u
-
axis
is given by its slope
m

in the
(blue)
x
-
y

coordinate system, we can find (
a
,
b
) as follows.
First
note that the
point (1,
m
) lies on the coordinate axis
u
. The distance f
rom the origin to that point is
. If we divide

the
length of the segment from the origin to the
point (1,
m
) by this length, we
obtain a point whose
coordinates are (
a
,
b
)

and satisfy (
4
)
.
We obtain



To find the first principal component for a cloud of data points in a two dimensional space, we start with
a line through the origin, and calculate the distances of the
projections of each data point onto this line

to the origin
. We then change the slope to maximize the sum of the distances from the origin to the
projected points on the line. We call this line the optimal line. It is the first principal component.


PCA S
imulation

In the spreadsheet under tab “PCA Simulation,” we picked ten genes to illustrate how to find a
coordinate system that maximizes the signal and minimizes redundancy.
The standardized gene
expression values for the nine time points are listed in th
e array B2:J11. You can take any pair of
expression data and copy them into the array in P3:Q12. The spreadsheet is set up to normalize each
column to have mean 0 and standard deviation 1. This standardized set of data points is in the array
U2:V12. The da
ta points are plotted in the scatter diagram.

We will now find the line that
maximizes

the sum of the distances from
the projection of each point to
the origin
.

In Figure
9
, we labeled the distance we
wish to consider as “Distance from projection of point

to origin
.


Figure 9 has four points. Each point has a projection to the
u
-
coordinate axis. Adding up the
distances from the projection of each point to the origin is the quantity we wish to maximize.

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
12



Figure 9: Projecting a point on a line


Let’s return to the spreadsheet.
We will find the

slope

of the line
, denoted by
m
,
that maximizes the
signal

by trial and error.
Enter a value in Cell Y
9
. The spreadsheet will calculate the sum of the distances
in Cell Z
9
. Record both the slope and the sum

in a row in the array Y1
7
:
AA26
. The pair of values will be
graphed in the figure to the right. Repeat with a different value for
m

and find the value of
m

that
maximizes the sum of the distances from the projection
s

of each of the points on the line to th
e origin.

Interpreting
Visual Representations

of PCA
: Scree Plot and
B
iplot

The microarray data
are

a cloud in a 9
-
dimensional space, one dimension for each time point.
PCA
rotates the axes to reduce redundancy in the data and thus maximize the signal, as
measured by the
variance.
Specifically
,
in the new coordinate system,

the
data
are

uncorrelated. The axes are ordered so
that the first axis explains the largest amount of variation, the second axis the second largest amount of
variation, and so on. The pe
rcentage of variation that is explained by each axis is summarized in a
scree
plot

(Figure
10
).

The shape of the graph explains the name:
A “scree” is “a steep mass of loose rock on
Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
13


the slope of a mountain” (Random House Webster’s College Dictionary, 1991)
.

Figure
10
: Scree plot: T
he bar chart shows the variance explained by each
of the first seven
principal
component
s
.
Note that the components are ordered from largest to smallest variance explained.
The
line graph shows the cumulative variance explained by the first
n

components.

The first two principal
components already explain about 72% of the variation; the first three principal components explain
80% of the variation.


Matlab has a function that
calculates the coordinates of the principal components.
If the data are in the
array X, then t
he function is given by the following expression

[COEFF,SCORE] = princomp(X)

The array

COEFF
contains the

coordinates of the principal components
, called loadings
,

(Table
3
). T
he
array
SCORE

contains the coordinates of the data in the principal component coordinate system

(not
shown).

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
14


Table
3
: The coordinates of the principal components



We see from the table that the first five coordinates of the first principal

component, corresponding to
the first five time points, are negative, whereas the last four coordinates of the first principal
component, corresponding to the last four time points, are positive. This suggests a straightforward
biological meaning of the f
irst principal component, namely data points whose first coordinates in the
principal component coordinate system are negative correspond to genes that are expressed early in
the development, whereas data points whose first coordinate in the principal comp
onent coordinate
system are positive correspond to genes that are expressed late in the development.

We plot the data in a two
-
dimensional plot where the horizontal axis is the first principal component
and the vertical axis is the second principal compone
nt. We include in this graph the original axes
projected onto this two
-
dimensional space (Figure
11
).

This graph is called a biplot.
The original axes are
labeled with the day postnatal development 1
-
60. We see that t
he axes with early days (1, 3, 5,
7, an
d
10) point toward the left, whereas the axes with later days (15, 21, 30, and 60) point toward the right.

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
15



Figure
11
: Biplot


We can select a few genes from the data set (Figure
12
). We label the data points with the ID number,
and collect the
normalized expression

data

in a table (Table
4
).


Figure
12
: Scatterplot in the principal component coordinate system with selected genes.

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
16


Table
4
: Normalized expression data of the genes selected in Figure
12
.

My ID

PN1.b
Mouse
Cereb
Signal

PN3.b
Mouse
Cereb
Signal

PN5.b
Mouse
Cereb
Signal

PN7.b
Mouse
Cereb
Signal

PN10.b
Mouse
Cereb
Signal

PN15.b
Mouse
Cereb
Signal

PN21.b
Mouse
Cereb
Signal

PN30.b
Mouse
Cereb
Signal

PN60.b
Mouse
Cereb
Signal

707

0.22355

0.611095

0.531138

1.117621

0.671159

0.467248

-
0.73288

-
0.86716

-
2.02177

1333

0.092538

0.79672

1.127559

0.981094

0.267147

0.500343

-
1.05047

-
1.27964

-
1.4353

1082

0.147645

1.047918

1.036735

0.57211

0.871524

-
0.04908

-
1.37026

-
0.74602

-
1.51057

2192

0.430371

1.170309

1.233016

0.686986

0.339688

-
0.39446

-
0.82714

-
1.40934

-
1.22942

779

1.463528

0.663629

1.094743

0.143908

0.15888

-
0.21999

-
0.90073

-
1.66243

-
0.74154

2289

-
1.19157

0.208326

-
1.43718

-
0.33021

-
0.84805

0.404156

1.274331

0.882056

1.038146

599

-
0.90043

-
1.16141

-
1.10639

-
0.89283

0.487834

0.597347

0.66823

0.913845

1.393797

1143

-
1.17137

-
1.42696

-
0.66094

-
0.20431

0.165347

0.038678

0.914546

0.673355

1.671659

2147

-
0.55156

-
1.32826

-
0.73429

-
0.53381

-
0.54718

0.457644

0.82625

0.495358

1.915844

128

-
0.50911

-
0.88517

-
0.58053

-
0.45867

-
0.51553

0.009286

-
0.47928

1.895853

1.523168


If we plot the expression data as a function of time

(Figure 13)
, we confirm that the first five genes are
expressed early in the development and the last five genes late in the development.


Figure
13: Gene expression profiles


Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
17


In Class Experiments

1. Go to
http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html

and run the applets. Can you assign
meaning to the principal components?

2
.
To research applications of PCA in different knowledge domains,
Google

the following search terms
and click on “Images.” Look for images that convey interpretable information.

A.

Principal component analysis ecology

B.

Principal component analysis genomics

C.

Principal component analysis sociology

3. REMBRANDT is a REpository for Molecular

BRAin DaTa. On their website,
https://caintegrator.nci.nih.gov/rembrandt/
, they state


REpository for Molecular
BRAin Neoplasia DaTa (REMBRANDT) is a robust bioinformatics
knowledgebase framework that leverages data warehousing technology to host and
integrate clinical and functional genomics data from clinical trials involving patients
suffering from Gliomas. The k
nowledge framework will provide researchers with the
ability to perform ad hoc querying and reporting across multiple data domains, such as
Gene Expression, Chromosomal aberrations and Clinical data.


Scientists will be able to answer basic questions relat
ed to a patient or patient
population and view the integrated data sets in a variety of contexts. Tools that link data
to other annotations such as cellular pathways, gene ontology terms and genomic
information will be embedded.


Download the User Guide an
d follow instructions to register for an account. Once you have an
account, you can do data analyses of microarray data from patients suffering from
g
liomas. One of
the tools available is PCA.

You may find the following website useful for identifying gene
s that are involved in cancer: The KEGG
Pathway Database,
http://www.genome.jp/kegg/pathway.html
, has a website dedicated to
g
liomas:
http://www.genome.jp/kegg/pathway/hsa/hsa05214.html
. You can find the genes that are important in
the development of
g
liomas in the diagram (Figure
14
).
Another site where you can find information on
Glioma pathways is
https://www.qiagen.com/geneglobe/pathwayview.aspx?pathwayID=347

(
Note that gliomas are distinct from medulloblastomas. Medulloblastomas were first classified as
gliomas. Now, they are considered a distinct group
of primitive neuroectodermal tumors.
)

Citation:

Neuhauser, C.

Principal Component Analysis.

Created:

May 27, 2012
Revisions:





Copyright:

©
2012

Neuhauser. This is an open
-
access article distributed under the terms of the Creative Commons Attribution
Non
-
Commercial Share Alike
License, which permits unrestrict
ed use, distribution, and reproduction in any medium,
and allows
others to translate, make remixes, and produce new stories based on this work,
provided the original author and source are
credited

and the new work will carry the same license
.



Page
18



Figure
14
: Pathway map for human
g
liomas from
http://www.genome.jp/kegg/pathway/hsa/hsa05214.html