Supplementary Data - Genome Biology and Evolution

skirlorangeΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 6 μήνες)

73 εμφανίσεις


1

MATERIALS & METHODS

1

Gene family assignment

2

We
acquired

the predicted
protein sequences from ten Stramenopiles
:

3

Species

Version

Source

Url

Reference

Ectocarpus
siliculosus

V1

BOGAS

http://bioinformatics.psb.ugent.be/genomes/

(Cock et
al. 2010)

Aureococcus
anophagefferens

V1

JGI

http://genome.jgi
-
psf.org/

(Gobler et
al. 2011)

Phaeodactylum
tricornutum

V2

JGI

http://genome.jgi
-
psf.org/

(Bowler et
al. 2
008)

Thalassiosira
pseudonana

V3

JGI

http://genome.jgi
-
psf.org/

(Armbrust
et al.
2004)

Hyaloperonospor
a
arabidopsidis

V8.3

VBI

http://vmd.vbi.vt.edu/

(B
axter et
al. 2010)

Saprolegnia
parasitica

V1

BROAD

http://www.broadinstitute.org/scientific
-
community/data

NA

Pythium ultimum

V4

BROAD

http://www.broadinstitute.org/scientific
-
community/data

(Lévesque
et al.
2010)

Phytophthora
infestans

V1

BROAD

http://www.broadinstitute.org/scientific
-
communi
ty/data

(Haas et
al. 2009)

Phytophthora
ramorum

V1

BROAD

http://www.broadinstitute.org/scientific
-
community/data

(Tyler et
al. 2006)

Phytophthora
sojae

V1

BROAD

http://www.broadinstitute.org/scientific
-
community/data

(Tyler et
al. 2006)


4


2

To define protein families, w
e created a sparse network of nodes (proteins)
1

connected by edges (sequences similarity) by conducting a blastp
(Altschul et
2

al. 1990)

all
-

vs.
-

all sequence similar
ity search (e
-
value cutoff 1e
-
3,
enabled
3

soft filtering).
W
e
subsequently
removed edges that were formed between
4

proteins
due to

short segments of similarity
thereby el
iminating
spurious
5

connections within the network.

W
e removed edges between pr
oteins if the
6

matched area
was


50% o
r the
‘actual
-
matching


area
was


20% of either the
7

query or the subject. The matching area is defined as the area from the start
8

position of the first segment to the end position of the last segment and the
9

‘actual
-
matching


area as the sum of the covered area by each individual
10

segment.
Subsequently, we partitioned the resulting network into protein
11

(gene) families using the markov clustering algorithm (MCL)

(Van Dongen
12

2000; Enright et al. 2002)

with an inflation value of 3.0.

13

Det
ection of t
ransposable elements

14

The presence of transposable elements or their signatures within the
15

predicted proteomes of the analyzed species was assessed using two
16

complement
ary

methods
;

(i)
by screening for
the presence of
86
signature
17

domains
and the MULE
transposase domain
within predicted proteins
18

(Zdobnov et al. 2005)

and
(ii)
by screening for
sequence
s

that show similarity
19

to
position
-
specific scoring matric
es for several families of transposable
20

elements.
For
(i)
w
e predicted the domain
repertoire

for all proteins using
21

hmmer3
(Edd
y 1998)

and a local P
fam database

(v24)

(Finn et al. 2010)

22

applying a domain model specific cutoff (gathering cut
off)
.

For
(ii)
w
e used
23

TransposonPSI
(Haas BJ,
http://transposonpsi.sourceforge.net/
)
to scan the
24

predicted proteomes for the presence of different families of transposable
25


3

elements. Subsequently, we removed all families
containing predicted
1

proteins that have one or more signature domains that are specific for
2

transposable elements or
exhibit similarity to transposable elements.

3

Phylogenetic analysis

4

We constructed
a

phylogenetic tree of the analyzed Stramenopiles using 1
89
5

families whose members occur in a single copy
gene
in each
of the ten
6

species.
Each set of ten single copy genes was first
aligned using mafft

7

(Katoh et al. 2002)

(v6.713b, L
-
INS
-
I algorithm) and the aligned seque
nces
8

were subsequently concatenated. We removed columns with more than 80%
9

gaps. Furthermore, we removed adjacent divergent positions both up
-

and
10

downstream of the gap
-
position until a column with a median of pair
-
wise
11

BLOSUM62 scores


0 was found. The r
esulting alignment was used to infer
12

a maximum likelihood phylogenetic tree using
RAxML

(Stamatakis 2006)

13

(v7.0.4) with
gamma

model of heterogeneity and estimated alpha parameter (
-
14

PROTGAMMA) as well as a WAG amino acid substitution matrix. The
15

robustness of the tree topology was assed using 1
,
000 bootstrap replicates.

16

To address if few long alignments dominate the concatenated
alignment,
17

we removed all families whose alignments

(after removing gaps as
18

outlined above)

exceeded a length cutoff that was empirically defined by
19

the length distribution of the 189 single c
opy families (Additional file
20

15
A
). This length cutoff
was
set t
o be the 3
rd

quartile

+ 0.5 * inter
quartile

21

range of the length distribution and yielded 168 families that were
22

subsequently
concatenated. The phylogenetic tree was inferred as
23

described above and robustness was assessed using 1,000 bootstrap
24

replicates

(
Additional file 15
B)
. The tree topology
as well as the
25


4

bootstrap support for the individual branches
is identical to the
1

predicted topology

that was based on the full set of 189 single copy
2

markers
.

3

T
o estimate the relative divergence times
of the analyzed Stramenopiles, we
4

inferred the phylogeny including the ciliate
Paramecium tetraurelia

(v1.41,
5

ParameciumDB

(Arnaiz et al. 2007; Arnaiz & Sperling 2011; Arnaiz et al.
6

2007)
)
(Aury et al. 2006)

as explicit outgroup.
We
identified

35
single copy

7

families

in

Stramenopiles and
P. tetraurelia

and utilized these
as a
8

concatenated marker

that

was

prepared and
analyzed as described
9

above
.

The

relative
divergence times
w
ere

estimated with

BEAST
10

(Drummond & Rambaut 2007)

under

strict clock

assumption and a gamma
11

model of site heterogeneity (invariant sites + 4 gamma categories; WAG
12

substitution matrix). We used a
defined
tree topology

and
starting
branch
13

lengths derived from

the beforehand maximum likelihood analysis.
14

Furthermore,

we
set the age of the last common

ancestor
of Stramenopiles
15

arbitrary to
100
.
We
r
a
n

ten independent chains, each co
ntaining 4,000,000
16

generations of which we sam
pled

every 4
00 generation
s
.
The resulting
17

posterior distributions for parameter estimates were manually
assessed

using
18

Tracer (v1.5).
Subsequent
ly
,
maximum credibility trees were calculated with
19

TreeAnnotator (1.6.1) after removing
10% burn
-
in
.

The e
stimated branch
20

lengths

were averaged over the ten chains
. The
probability to observe
less
,
21

equal

or
more

than

the abundance of
evolutionary events
given the

22

expectation

values at each
individual branch was asse
sse
d by P
oisson
23

distribution. The expected va
lues were estimated using the
global relative
24

frequency

of duplications/losses.

25


5

Reconstruction of gene family evolution

1

We reconstructed the evolutionary history
of

protein famil
ies

(excluding
2

singletons) to
monitor
macro
-
evolutionary event
s

like duplications and losses
3

along the species tree. The sequences of the gene families were aligned
4

using different alignment algorithms

similar to a

strategy
outlined
by
Muller

5

and colleagues

(Muller, Creevey, Thompson, Arendt & Bork 2010b)
. We used
6

mafft
(v6.713b; L
-
INS
-
I, E
-
INS
-
I and default parameters)

(Katoh et al. 2002)

7

and

muscle
(v3.7
; with default parameter)
(Edgar 2004)

to align the protein
8

sequences
. Moreover, we corrected all alignments with rascal

(Thompson et
9

al. 2003)

and subseq
uently asse
sse
d the alignment quality with norMD

(v1.3)

10

(Thompson et al. 2001)
.
Per individual family the highest scoring alignment
11

out of the refined and original alignment
s

was chosen.
We constructed
12

phylogenetic trees using
RAxML

(Stamatakis 2006)

(PROTGAMMA, WAG)
13

for families > 3
, excluding familie
s > 500 members,

and the robustness of the
14

trees were asse
sse
d using 100 bootstrap replicates
.

15

We reconciled the protein trees with the species tree of Stramenopiles using
16

NOTUNG
(Chen et al. 2000; Durand et al. 2006)

(modified v2.6, personal
17

communication). The trees were reconciled using a cost of 1.5 fo
r
a
18

duplication
event
and 1 for
a
loss event. Subsequently, the tree was rooted so
19

that
the number of duplication and loss events are minimized. Furthermore,
20

NOTUNG

allows the rearrangement of the gene tree topology on weak
21

branches to account for errors in the gene tree that would lead to bias in the
22

derived number of evolutionary events. Weakly supported branches
23

(bootstrap <80%) were rearranged to minimize the num
ber of evolutionary
24

events, while at the same time strongly supported topologies
remained
intact.
25


6

Furthermore, we created orthologous groups by dividing families based on
1

duplications occurring at the last common ancestor of Stramenopiles.
2

Consequently, ea
ch orthologous group represents a single gene
either
in the
3

last common ancestor of Stramenopiles or at the point of gain. Subsequently,
4

we used maximum parsimony to project the derived evolutionary events of all
5

orthologous groups, including species
-
speci
fic groups, on the species
6

phylogeny of Stramenopiles.

7

To asses
s

the contribution of potential low quality alignments to the
8

evolutionary events we subdivide the set of alignments (optimal score
9

per family) into high quality and low quality subsets. Low qu
ality
10

alignments are defined by a norMD score of <0.75

(Addition
al file 16
A)
,
11

i.e.
f
amilies with a norMD score of <
0.6 (127) and those
which

exceed
12

this cutoff by 25% (477). The norMD score cutoff of 0.6 was proposed to
13

be of
high quality
and hence more
re
liable

(Thompson et a
l. 2001;
14

Muller, Creevey, Thompson, Arendt & Bork 2010b)
.

Subsequently
,
we
15

projected the
evolutionary events based on the
derived OGs of the high
16

and low quality families onto the species

phylogeny (Additional file 16
B).

17

To further elucidate the origin

of each individual orthologous group we
18

searched for homologs utilizing best hits identified by a blast search (e
-
value
19

cutoff 1e
-
3
;

enabled low complexity filtering
;

query & coverage filtering as
20

described above) against a local version of the eggNOG dat
abase
(v2)

(Muller,
21

Szklarczyk, Julien, Letunic, Roth, Kuhn, et al. 2010a)
.

If a homolog of a group
22

in eggNOG could be identified we assumed an origin before the
last common
23

ancestor

of Stramenopiles and introduced, if necessary, subsequent losses.

24


7

Functional
annotation of OGs

1

We projected functional
a
n
notation to each indi
vidual
OG using
five
2

independent
methods
.

3

(
i
)
COG functional classification
(Tatusov et al. 1997)

was assigned by

4

identification of homologs in the eggNOG database

utilizing best hits
identified
5

by blast search
(e
-
value cutoff 1e
-
3; enab
led low complexity filter)

(Muller,
6

Szklarczyk, Julien, Letunic, Roth, Kuhn, et al. 2010a)
.
If a protein was
7

consistently assigned t
o a functional class
derived from homologs in eggNOG
8

and the OG contained > 30%
proteins with the identical classification, the
9

functional annotation was projected to the whole OG.

10

(
ii
)
T
he presence of potentially secreted proteins within an OG
was predict
ed
11

using SignalP
(Bendtsen et al. 2004)

(v
3.0) in combination with TMHMM
12

(v2.0)

(Krogh et al. 2001)
. Secretion signal within the first 70 amino acids
of a
13

protein
was

accepted if both the neural network as well as the HMM
14

consistently predict
ed

the
presence of
the motif
.
Si
gnal

peptides were rejected
15

if TMHMM predicted more than a single transmembrane region within the
16

protein or a single region that overlapped with the SignalP prediction for more
17

than 10 amino acids and was positioned within the first 35 amino
acids
.

OGs
18

that contain >30% proteins with a predicted secretion signal were annotated
19

as secreted OG.

20

(
iii
)
Host
-
cell translocation motifs were predicted using
hmmer3
(Eddy 1998)

21

and

manually created
HMM
-
profiles of the RX
LR and the
LXLFLAK

motif

22

(
R.H.Y.
Jiang
,
personal communication)
.
Next to the RXLR/LXLFLAK

motif
23

itself, we also demand
ed

the presence of
a
predicted secretion signal within
24


8

the first 30 ami
no acids, the gap bet
ween the RXLR/LXLFLAK

motif to the
1

s
ecretion signal to be

50 amino acids and the RXLR/LXLFLAK

motif to start
2

within the first 100 amino acids.
OGs that contained > 30% proteins with a
3

predicted R
X
LR or
LXLFLAK
motif were annotated.

4

(
iv
)
Gene

express
ion

data
of
P. infestans

during infection of the host w
ere

5

acquired from the Gene Expression Omnibus
6

(
http://www.ncbi.nlm.nih.gov/geo/
)

(Haas et al. 2009)
.

Differentially expressed
7

genes were
identified as described elsewhere

(Haas et al. 2009; Seidl et al.
8

2011)
. OGs containing significantly
differentially expressed
P. infestans

genes
9

were

annotated as potentially differentially expressed

during host
-
pathogen
10

interaction
.

11

(v)

Chloroplast associated proteins were ident
ified by transferring the
12

gene ontology (GO) annotation using Blast2GO (default parameters)
13

(Conesa et al. 2005)
. All proteins that received GO annotations that could
14

be traced back to GO:0015979

(photosynthesis)
,
GO:0009536

(plastid)

or
15

GO:0009507

(chloroplast)

were annotated as chloroplast associated.
16

OGs containing these proteins were annotated as potentially chloroplast
17

associated.
Significantly enriched GO terms of individual proteins
18

present in OGs pred
icted t
o be lost at the LCA of O
omycetes were
19

defined by Bi
NGO (version 2.44)
(Maere et al. 2005)
. Significantly
20

enriched GO terms
were summarized by removing

redundancies
us
ing
21

REVIGO

(default settings)

(Supek et al. 2011)
.

22

(
v
i
)
All

OGs that could not been annotated with one of the
four methods
23

described
above were
classified as

unknown

.

24


9

The significant over
-
/under
-
representation of evolutionary events
for individual
1

functional classes at each branch of the phylogeny
was asse
sse
d b
y
applying
2

a
Fisher’s exact test (significance level of <0.05).
Multiple
-
testing

was
3

addressed using false discovery rates
calculated by the q
value
package
4

(Storey & Tibshirani 2003)

and a
q
-
value significance level
of 0.05 was
5

applied.


6

OGs
defined as glycoside hydrolase or peptidase were annotated b
ased

on
7

the presence of one or more signature domains acquired from
the
Pfam
8

database which contains

in total

72 different Pfam domains for glycoside
9

hydrolases and 171 for peptidases

(Finn et al. 2010)
.
We annotated
OGs that
10

contained >30% proteins that have
on of these

predicted signature domain as
11

defined by

hmmer3 (gathering cutoff)
(Eddy 1998)
. Additional annotation was
12

transferred based on identified homologs in the eggNOG database
(Muller,
13

Szklarczyk, Julien, Letunic, Roth, Kuhn, et al. 2010a)

(see above).

14

Distribution of best blast hits

15

To elucidate the phylogenetic affinity of the OGs to different
group of
16

organisms,

we searched for the best blast hit

(
e
-
value cutoff 1e
-
3;
17

enabled low complexity
filter;

query & coverage filtering as described
18

above
)

of each
protein

tha
t comprises the individual OG.
These
19

searches were conducted against the eggNOG database as well as
20

individual proteomes of several algae species

(the
effective length of the
21

database
was fixed for the blast search
)
:

the red alga

Cyanidioschyzon
22

merolae

(
http://merolae.biol.s.u
-
tokyo.ac.jp/
)
and the green algae
Volvox
23

carteri

(v2;

http://genome.jgi
-
psf.org/
)
,
Ostreococcus
tauri

(v2;
24


10

http://genome.jgi
-
psf.org/
)

and
Micromonas pusilla

(v3;
1

http://genome.jgi
-
psf.org/
). An
OG
was considered
to be
affine to a
2

certain group or subgroup of species if >50% of its containin
g proteins
3

consistently had their best blast hit within this group.

4



5


11

REFERENCES

1

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local
2

alignment search tool. J. Mol. Biol. 215:403

410.

3

Armbrust

EV et al. 2004. The genome of the diatom
Thalassiosira
4

pseudonana
: ecology, evolution, and metabolism. Science. 306:79

86.

5

Arnaiz O, Cain S, Cohen J, Sperling L. 2007. ParameciumDB: a community
6

resource that integrates the
Paramecium tetraurelia

genome se
quence with
7

genetic data. Nucleic Acids Res. 35:D439

44.

8

Arnaiz O, Sperling L. 2011. ParameciumDB in 2011: new tools and new data
9

for functional and comparative genomics of the model ciliate
Paramecium
10

tetraurelia
. Nucleic Acids Res. 39:D632

6.

11

Aury J
-
M et

al. 2006. Global trends of whole
-
genome duplications revealed by
12

the ciliate
Paramecium tetraurelia
. Nature. 444:171

178.

13

Baxter L et al. 2010. Signatures of adaptation to obligate biotrophy in the
14

Hyaloperonospora arabidopsidis

genome. Science.
330:1549

1551.

15

Bendtsen JD, Nielsen H, Heijne von G, Brunak S. 2004. Improved prediction
16

of signal peptides: SignalP 3.0. J. Mol. Biol. 340:783

795.

17

Bowler C et al. 2008. The
Phaeodactylum

genome reveals the evolutionary
18

history of diatom genomes. Nature.
456:239

244.

19

Chen K, Durand D, Farach
-
Colton M. 2000. NOTUNG: a program for dating
20

gene duplications and optimizing gene family trees. J. Comput. Biol. 7:429

21

447.

22

Cock JM et al. 2010. The
Ectocarpus

genome and the independent evolution
23

of multicellularity
in brown algae. Nature. 465:617

621.

24

Conesa A et al. 2005. Blast2GO: a universal tool for annotation, visualization
25

and analysis in functional genomics research. Bioinformatics. 21:3674

3676.

26

Drummond AJ, Rambaut A. 2007. BEAST: Bayesian evolutionary analy
sis by
27

sampling trees. BMC Evol. Biol. 7:214.

28

Durand D, Halldórsson BV, Vernot B. 2006. A hybrid micro
-
macroevolutionary
29

approach to gene tree reconstruction. J. Comput. Biol. 13:320

335.

30

Eddy SR. 1998. Profile hidden Markov models. Bioinformatics. 14:755

763.

31

Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy
32

and high throughput. Nucleic Acids Res. 32:1792

1797.

33

Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient algorithm for
34

large
-
scale detection of protein families. Nucleic Aci
ds Res. 30:1575

1584.

35


12

Finn RD et al. 2010. The Pfam protein families database. Nucleic Acids Res.
1

38:D211

22.

2

Gobler CJ et al. 2011. Niche of harmful alga
Aureococcus anophagefferens

3

revealed through ecogenomics. Proc. Natl. Acad. Sci. U.S.A. 108:4352

4357
.

4

Haas BJ et al. 2009. Genome sequence and analysis of the Irish potato
5

famine pathogen
Phytophthora infestans
. Nature. 461:393

398.

6

Katoh K, Misawa K, Kuma K
-
I, Miyata T. 2002. MAFFT: a novel method for
7

rapid multiple sequence alignment based on fast Four
ier transform. Nucleic
8

Acids Res. 30:3059

3066.

9

Krogh A, Larsson B, Heijne von G, Sonnhammer EL. 2001. Predicting
10

transmembrane protein topology with a hidden Markov model: application to
11

complete genomes. J. Mol. Biol. 305:567

580.

12

Lévesque CA et al. 2010
. Genome sequence of the necrotrophic plant
13

pathogen
Pythium ultimum

reveals original pathogenicity mechanisms and
14

effector repertoire. Genome Biol. 11:R73.

15

Maere S, Heymans K, Kuiper M. 2005. BiNGO: a Cytoscape plugin to assess
16

overrepresentation of gene
ontology categories in biological networks.
17

Bioinformatics. 21:3448

3449.

18

Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, et al. 2010a.
19

eggNOG

v2.0: extending the evolutionary genealogy of genes with enhanced
20

non
-
supervised orthologous groups, species and functional annotations.
21

Nucleic Acids Res. 38:D190

5.

22

Muller J, Creevey CJ, Thompson JD, Arendt D, Bork P. 2010b. AQUA:
23

automated quality impr
ovement for multiple sequence alignments.
24

Bioinformatics. 26:263

265.

25

Seidl MF, Van den Ackerveken G, Govers F, Snel B. 2011. A domain
-
centric
26

analysis of oomycete plant pathogen genomes reveals unique protein
27

organization. Plant Physiol. 155:628

644.

28

Stam
atakis A. 2006. RAxML
-
VI
-
HPC: maximum likelihood
-
based phylogenetic
29

analyses with thousands of taxa and mixed models. Bioinformatics. 22:2688

30

2690.

31

Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies.
32

Proc. Natl. Acad. Sci. U.S.A
. 100:9440

9445.

33

Supek F, Bošnjak M, Škunca N, Šmuc T. 2011. REVIGO summarizes and
34

visualizes long lists of gene ontology terms. PLoS ONE. 6:e21800.

35

Tatusov RL, Koonin EV, Lipman DJ. 1997. A genomic perspective on protein
36

families. Science. 278:631

637.

37

Th
ompson JD, Plewniak F, Ripp R, Thierry JC, Poch O. 2001. Towards a
38


13

reliable objective function for multiple sequence alignments. J. Mol. Biol.
1

314:937

951.

2

Thompson JD, Thierry JC, Poch O. 2003. RASCAL: rapid scanning and
3

correction of multiple sequence al
ignments. Bioinformatics. 19:1155

1161.

4

Tyler BM et al. 2006.
Phytophthora

genome sequences uncover evolutionary
5

origins and mechanisms of pathogenesis. Science. 313:1261

1266.

6

Van Dongen S. 2000. A cluster algorithm for graphs. Report INS
-
R0010,
7

National
Research Institute for Mathematics and Computer Science in the
8

Netherlands, Amsterdam.

9

Zdobnov EM, Campillos M, Harrington ED, Torrents D, Bork P. 2005. Protein
10

coding potential of retroviruses and other transposable elements in vertebrate
11

genomes. Nucleic

Acids Res. 33:946

954.

12


13