Introduction to Biostatistical Analysis
Using R
Statistics course for first

year PhD students
Lecturer
: Lorenzo Marini, PhD
Department of Environmental Agronomy and Crop Production,
University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.
E

mail:
lorenzo.marini@unipd.it
Tel.: +39 0498272807
http://www.biodiversity

lorenzomarini.eu/
Session 5
Lecture
:
Multivariate analysis of ecological data
Practical
: Assessment exercises
Type of analyses
Variables

One Response variable (Y):
(e.g. Y= hormone concentration)

One or more explanatory variables (X
i
) (e.g. N, pH, Temp)
Variables

More than 1 response variable (Y
i
)
(e.g. Y
i
= abundance of 5 species in 6 plots
or DNA sequences in different individuals)

One or more explanatory variables (x
i
) (e.g. N, pH, Temp)
Y
Y
1
Y
2
Y
3
Y
4
Y
5
Univariate analysis
Multivariate analysis
MULTIVARIATE ANALYSES
Response 1
Response 2
Response 3
…
Response n
object 1
object 2
object 3
…
object n
Response matrix
Predictor 1
Predictor 2
Predictor 3
…
Predictor n
object 1
object 2
object 3
…
object n
Explanatory matrix
Yes
1. CLASSIFICATION (Cluster Analysis)
2. Unconstrained ORDINATIONS (PCA, CA…)
3. Constrained
Ordinations
(RDA, CCA…)
No explanatory matrix
Distance

dissimilarity
The most natural dissimilarity measure is the Euclidean distance
(distance in species space

each species is an axis)
Dissimilarity
Sp 1
object 1
object 2
object3
object 1
object 2
object 3
…
object n
object 1
object 2
object 3
…
object n
One value for
each possible pair of objects
Euclidean distance:
object1

object2= 2
object2

object3= 6
object1

object3= 5
[
Σ
(x
i j

x
i k
)
2
]
0.5
There are many different dissimilarity indices (e.g.):

Jaccard index

Manhattan

Bray

Curtis

Morisita

1

Correlation…
Dissimilarity
Source of subjectivity in the choice of the method
CLASSIFICATION: Hierarchical clustering
Aim: Clustering is the
classification
of objects into different groups, or
more precisely, the partitioning of a data set into subsets (clusters), so
that the data in each subset (ideally) share some common traits

often
proximity according to some defined
distance measure
1. Distance matrix
Hierarchical clustering builds (agglomerative), or breaks up (divisive), a
hierarchy of clusters.
Agglomerative algorithms begin at the top of the tree, whereas divisive
algorithms begin at the root. (In the figure, the arrows indicate an
agglomerative clustering.)
2. Clustering
method
High subjectivity
in both step 1 and
step 2
Increasing
dissimilarity
CLASSIFICATION: Hierarchical clustering
CLASSIFICATION: Hierarchical clustering
Main biostatistical applications:

Classification of plant and animal community in
types

Phylogeny
(old methods)

Bioinformatics (
e.g. In sequence analysis, clustering is
used to group homologous sequences into gene families)
Geographical applications:

Imaging
(image segmentation)
ORDINATION
Definition
In
multivariate analysis
,
ordination
is a method complementary
to
data clustering
, and used mainly in
exploratory data analysis
(rather than in
hypothesis testing
). Ordination
orders
objects
that are characterized by values on multiple variables (i.e.,
multivariate objects) so that similar objects are near each other
and dissimilar objects are farther from each other. These
relationships between the objects, on each of several axes (one
for each variable), are then characterized numerically and/or
graphically.
UNCONSTRAINED ORDINATION
Principal Components Analysis (PCA)
PCA is mathematically defined as an
orthogonal
linear transformation
that transforms the data to a new
coordinate system
such that the
greatest variance by any projection of the data comes to lie on the first
coordinate (called the first principal component), the second greatest
variance on the second coordinate, and so on.
PCA assumes that
species have linear species response curves
.
Maximization of the variation explained by the axes
They order objects according to traits,
NO explanatory variables
Correspondence Analysis (CA)
UNCONSTRAINED ORDINATION
Correspondence Analysis (as well as its derivatives) represent species
AND samples as occurring in a postulated environmental space, or
ordination space.
CA assumes that species have unimodal species
response curves.
Linear approximation of an unimodal response
Short
Long
PCA
PCA is theoretically the optimal linear scheme, in terms
of
least mean square error
, for compressing a set of
high dimensional vectors into a set of lower dimensional
vectors and then reconstructing the original set
Original variables
x
1
, x
2
, x
3
, ..., x
n
Components
y
1
= a
1
x
1
+ a
2
x
2
+…+ a
n
x
n
y
2
= b
1
x
1
+ b
2
x
2
+…+ b
n
x
n
y
3
= c
1
x
1
+ c
2
x
2
+…+ c
n
x
n
λ
1
=variance
1
λ
2
=variance
2
λ
3
=variance
3
Total Variance = Σλ
n
= 1
UNCONSTRAINED ORDINATION: PCA
Inertia Rank
Total 14.78
Unconstrained 14.78 23
Inertia is total variance
Eigenvalues for unconstrained axes:
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
…
PC23
5.20
2.94
1.27
1.22
0.97
0.64
0.53
0.47
…
..
UNCONSTRAINED
UNCONSTRAINED ORDINATION: PCA
E.g. PC1 explained variation = λ
1
/ Σλ
i
Example with 23 variables (e.g. 23 species in 50 sites)
0
5
10
15
20
25
30
35
40
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
Explained Variation (R
2
)
UNCONSTRAINED ORDINATION: PCA
Main biostatistical applications:
1.
Reduction of a set of intercorrelated predictors to a smaller
set of independent variables in multiple regression
For example, two situations in regression where principal components may be
useful are (1) if the number of response variables is large relative to the
number of observations, a test may be ineffective or even impossible (e.g.
biometry), and (2) if the explanatory variables are highly correlated, the
estimates of regression coefficients may be unstable. In such cases, the
regression variables can be reduced to a smaller number of principal
components that will yield a better test or more stable estimates of the
regression coefficients.
2. Indirect gradient analysis
In the analysis of community data we can do a multiple regression between the
PCA axes and some explanatory variables to explain the change in species
composition
If you have both the environmental data and the species composition, you
can both calculate the unconstrained ordination first and then calculate
regression of ordination axes on the measured environmental variables
or you can calculate directly the constrained ordination.
The approaches are complementary and should be used both!
By calculating the unconstrained ordination first you surely do not miss the
main part of the variability in species composition, but you could miss
the part of variability that is related to the measured environmental
variables.
By calculating the constrained ordination, you surely do not miss the main
part of the variability explained by the environmental variables, but you
could miss the main part of variability that is not related to the measured
environmental variables.
UNCONSTRAINED VS. CONSTRAINED
CONSTRAINED ORDINATION: RDA
Response 1
Response 2
…
Response n
object 1
object 2
object 3
…
object n
Response matrix
Predictor 1
Predictor 2
…
Predictor n
object 1
object 2
object 3
…
object n
Explanatory matrix
RDA
Response 1
object 1
object 2
object 3
…
object n
Response vector
Predictor 1
Predictor 2
…
Predictor n
object 1
object 2
object 3
…
object n
Explanatory matrix
Multiple regression
Think like you are
working on a linear
multiple regression
model
CONSTRAINED ORDINATION: RDA
The unconstrained ordination axes correspond to the
directions of the greatest variability within the data set.
The constrained ordination axes correspond to the
directions of the greatest variability of the data set that
can be explained by the environmental variables
There are as many constrained axes as there are
independent explanatory variables
CONSTRAINED ORDINATION: RDA
The constrained ordination axes correspond to the directions of
the greatest variability of the data set that can be explained by
the environmental variables
How to choose our explanatory variables?
Can we test them?
We can use a pseudo

F test using Monte Carlo Permutation
n
permutatio
number
total
F
F
where
n
permutatio
number
F
data
real
permutated
.
.
1
)
:
(
.
1
.
MONTE CARLO PERMUTATION
Response 1
Response 2
…
Response n
object 1
object 2
object 3
object 4
object 5
object 6
object 7
Predictor 1
Predictor 2
…
Predictor n
object 1
object 2
object 3
object 4
object 5
object 6
object 7
F value = 10
Response 1
Response 2
…
Response n
object 1
object 2
object 3
object 4
object 5
object 6
object 7
Predictor 1
Predictor 2
…
Predictor n
object 3
object 2
object 4
object 5
object 1
object 7
object 6
F
1
value = 1.4
Shuffle
First permutation
RDA
real
RDA
1
Repeat for n times
Get n F values
Compute the pseudo F
Fixed
We can apply the same approach using Canonical
Correspondence Analysis (CCA)
The difference is related only to the unimodal response
underlying
CONSTRAINED ORDINATION: CCA
How to prepare the report
Abstract layout
:
A4
Font 11
Margins (2.5 cm)
Times new roman
Word count: no more than 1000 words
Lines numbered
Double lines
1 figure with caption and/or 1 table with caption
Four sections:
1. Title: give a title to your study
2. Introduction: just set the aims of the study
3. Material and Methods: explain the sampling & statistical analysis performed
4. Results and Discussion: present the results with 1 figure and/or 1 table and
discuss briefly.
Report is composed of two parts: abstract + R script
R script
How to prepare the report
Write down the script used to perform the analysis on a separate
page.
Include everything you used.
You can find all you need in the single practical we have done.
If you are in trouble look at these books how to run your analysis
(http://cran.r

project.org/):
Practical Regression and Anova using R”
by Julian Faraway
Statistics Using R with Biological Examples”
by Kim Seefeld and Ernst
Linder
An Introduction to R: Software for Statistical Modelling & Computing”
by Petra Kuhnert and Bill Venables
Topic: multiple regression or ANOVA
Comments 0
Log in to post a comment