# Introduction to Biostatistical AnalysisUsing R

AI and Robotics

Nov 8, 2013 (4 years and 6 months ago)

61 views

Introduction to Biostatistical Analysis

Using R

Statistics course for first
-
year PhD students

Lecturer
: Lorenzo Marini, PhD

Department of Environmental Agronomy and Crop Production,

E
-
mail:
lorenzo.marini@unipd.it

Tel.: +39 0498272807

http://www.biodiversity
-
lorenzomarini.eu/

Session 5

Lecture
:
Multivariate analysis of ecological data

Practical
: Assessment exercises

Type of analyses

Variables

-

One Response variable (Y):

(e.g. Y= hormone concentration)

-

One or more explanatory variables (X
i
) (e.g. N, pH, Temp)

Variables

-

More than 1 response variable (Y
i
)

(e.g. Y
i
= abundance of 5 species in 6 plots

or DNA sequences in different individuals)

-

One or more explanatory variables (x
i
) (e.g. N, pH, Temp)

Y

Y
1

Y
2

Y
3

Y
4

Y
5

Univariate analysis

Multivariate analysis

MULTIVARIATE ANALYSES

Response 1

Response 2

Response 3

Response n

object 1

object 2

object 3

object n

Response matrix

Predictor 1

Predictor 2

Predictor 3

Predictor n

object 1

object 2

object 3

object n

Explanatory matrix

Yes

1. CLASSIFICATION (Cluster Analysis)

2. Unconstrained ORDINATIONS (PCA, CA…)

3. Constrained

Ordinations

(RDA, CCA…)

No explanatory matrix

Distance
-
dissimilarity

The most natural dissimilarity measure is the Euclidean distance

(distance in species space
-

each species is an axis)

Dissimilarity

Sp 1

object 1

object 2

object3

object 1

object 2

object 3

object n

object 1

object 2

object 3

object n

One value for

each possible pair of objects

Euclidean distance:

object1
-
object2= 2

object2
-
object3= 6

object1
-
object3= 5

[
Σ
(x
i j
-
x
i k
)
2
]
0.5

There are many different dissimilarity indices (e.g.):

-

Jaccard index

-

Manhattan

-

Bray
-
Curtis

-

Morisita

-

1
-
Correlation…

Dissimilarity

Source of subjectivity in the choice of the method

CLASSIFICATION: Hierarchical clustering

Aim: Clustering is the
classification

of objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so
that the data in each subset (ideally) share some common traits
-

often
proximity according to some defined
distance measure

1. Distance matrix

Hierarchical clustering builds (agglomerative), or breaks up (divisive), a
hierarchy of clusters.

Agglomerative algorithms begin at the top of the tree, whereas divisive
algorithms begin at the root. (In the figure, the arrows indicate an
agglomerative clustering.)

2. Clustering

method

High subjectivity

in both step 1 and

step 2

Increasing

dissimilarity

CLASSIFICATION: Hierarchical clustering

CLASSIFICATION: Hierarchical clustering

Main biostatistical applications:

-
Classification of plant and animal community in
types

-
Phylogeny
(old methods)

-
Bioinformatics (
e.g. In sequence analysis, clustering is
used to group homologous sequences into gene families)

Geographical applications:

-
Imaging
(image segmentation)

ORDINATION

Definition

In
multivariate analysis
,
ordination

is a method complementary
to
data clustering
, and used mainly in
exploratory data analysis

(rather than in
hypothesis testing
). Ordination
orders

objects
that are characterized by values on multiple variables (i.e.,
multivariate objects) so that similar objects are near each other
and dissimilar objects are farther from each other. These
relationships between the objects, on each of several axes (one
for each variable), are then characterized numerically and/or
graphically.

UNCONSTRAINED ORDINATION

Principal Components Analysis (PCA)

PCA is mathematically defined as an
orthogonal

linear transformation

that transforms the data to a new
coordinate system

such that the
greatest variance by any projection of the data comes to lie on the first
coordinate (called the first principal component), the second greatest
variance on the second coordinate, and so on.
PCA assumes that
species have linear species response curves
.

Maximization of the variation explained by the axes

They order objects according to traits,
NO explanatory variables

Correspondence Analysis (CA)

UNCONSTRAINED ORDINATION

Correspondence Analysis (as well as its derivatives) represent species
AND samples as occurring in a postulated environmental space, or
ordination space.
CA assumes that species have unimodal species
response curves.

Linear approximation of an unimodal response

Short

Long

PCA

PCA is theoretically the optimal linear scheme, in terms
of
least mean square error
, for compressing a set of
high dimensional vectors into a set of lower dimensional
vectors and then reconstructing the original set

Original variables

x
1
, x
2
, x
3
, ..., x
n

Components

y
1

= a
1
x
1

+ a
2
x
2
+…+ a
n
x
n

y
2

= b
1
x
1

+ b
2
x
2
+…+ b
n
x
n

y
3

= c
1
x
1

+ c
2
x
2
+…+ c
n
x
n

λ
1
=variance
1

λ
2
=variance
2

λ
3
=variance
3

Total Variance = Σλ

n
= 1

UNCONSTRAINED ORDINATION: PCA

Inertia Rank

Total 14.78

Unconstrained 14.78 23

Inertia is total variance

Eigenvalues for unconstrained axes:

PC1

PC2

PC3

PC4

PC5

PC6

PC7

PC8

PC23

5.20

2.94

1.27

1.22

0.97

0.64

0.53

0.47

..

UNCONSTRAINED

UNCONSTRAINED ORDINATION: PCA

E.g. PC1 explained variation = λ
1
/ Σλ
i

Example with 23 variables (e.g. 23 species in 50 sites)

0
5
10
15
20
25
30
35
40
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
Explained Variation (R
2
)
UNCONSTRAINED ORDINATION: PCA

Main biostatistical applications:

1.
Reduction of a set of intercorrelated predictors to a smaller
set of independent variables in multiple regression

For example, two situations in regression where principal components may be
useful are (1) if the number of response variables is large relative to the
number of observations, a test may be ineffective or even impossible (e.g.
biometry), and (2) if the explanatory variables are highly correlated, the
estimates of regression coefficients may be unstable. In such cases, the
regression variables can be reduced to a smaller number of principal
components that will yield a better test or more stable estimates of the
regression coefficients.

In the analysis of community data we can do a multiple regression between the
PCA axes and some explanatory variables to explain the change in species
composition

If you have both the environmental data and the species composition, you
can both calculate the unconstrained ordination first and then calculate
regression of ordination axes on the measured environmental variables
or you can calculate directly the constrained ordination.

The approaches are complementary and should be used both!

By calculating the unconstrained ordination first you surely do not miss the
main part of the variability in species composition, but you could miss
the part of variability that is related to the measured environmental
variables.

By calculating the constrained ordination, you surely do not miss the main
part of the variability explained by the environmental variables, but you
could miss the main part of variability that is not related to the measured
environmental variables.

UNCONSTRAINED VS. CONSTRAINED

CONSTRAINED ORDINATION: RDA

Response 1

Response 2

Response n

object 1

object 2

object 3

object n

Response matrix

Predictor 1

Predictor 2

Predictor n

object 1

object 2

object 3

object n

Explanatory matrix

RDA

Response 1

object 1

object 2

object 3

object n

Response vector

Predictor 1

Predictor 2

Predictor n

object 1

object 2

object 3

object n

Explanatory matrix

Multiple regression

Think like you are

working on a linear

multiple regression

model

CONSTRAINED ORDINATION: RDA

The unconstrained ordination axes correspond to the
directions of the greatest variability within the data set.

The constrained ordination axes correspond to the
directions of the greatest variability of the data set that
can be explained by the environmental variables

There are as many constrained axes as there are
independent explanatory variables

CONSTRAINED ORDINATION: RDA

The constrained ordination axes correspond to the directions of
the greatest variability of the data set that can be explained by
the environmental variables

How to choose our explanatory variables?

Can we test them?

We can use a pseudo
-
F test using Monte Carlo Permutation

n
permutatio
number
total
F
F
where
n
permutatio
number
F
data
real
permutated
.
.
1
)
:
(
.
1
.

MONTE CARLO PERMUTATION

Response 1

Response 2

Response n

object 1

object 2

object 3

object 4

object 5

object 6

object 7

Predictor 1

Predictor 2

Predictor n

object 1

object 2

object 3

object 4

object 5

object 6

object 7

F value = 10

Response 1

Response 2

Response n

object 1

object 2

object 3

object 4

object 5

object 6

object 7

Predictor 1

Predictor 2

Predictor n

object 3

object 2

object 4

object 5

object 1

object 7

object 6

F
1

value = 1.4

Shuffle

First permutation

RDA
real

RDA
1

Repeat for n times

Get n F values

Compute the pseudo F

Fixed

We can apply the same approach using Canonical
Correspondence Analysis (CCA)

The difference is related only to the unimodal response
underlying

CONSTRAINED ORDINATION: CCA

How to prepare the report

Abstract layout
:

A4

Font 11

Margins (2.5 cm)

Times new roman

Word count: no more than 1000 words

Lines numbered

Double lines

1 figure with caption and/or 1 table with caption

Four sections:

1. Title: give a title to your study

2. Introduction: just set the aims of the study

3. Material and Methods: explain the sampling & statistical analysis performed

4. Results and Discussion: present the results with 1 figure and/or 1 table and
discuss briefly.

Report is composed of two parts: abstract + R script

R script

How to prepare the report

Write down the script used to perform the analysis on a separate
page.

Include everything you used.

You can find all you need in the single practical we have done.

If you are in trouble look at these books how to run your analysis
(http://cran.r
-
project.org/):

Practical Regression and Anova using R”

by Julian Faraway

Statistics Using R with Biological Examples”
by Kim Seefeld and Ernst
Linder

An Introduction to R: Software for Statistical Modelling & Computing”
by Petra Kuhnert and Bill Venables

Topic: multiple regression or ANOVA