Analysis and comparison of very large ... - Chuan-Yih, Yu

hordeprobableΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 8 μέρες)

54 εμφανίσεις

Analysis and comparison of very
large
metagenomes

with fast

clustering and functional
annotation

Weizhong

Li,
BMC
Bioinformatics 2009

Present by Chuan
-
Yih

Yu

Outline


Rapid Analysis of Multiple
Metagenomes

with a
Clustering and Annotation Pipeline (RAMMCAP)


Goal


Methodology


Metagenome

comparison


Conclusion


Discussion


Goal


Reduce computation time


Global Ocean Survey(GOS): 1 M CPU Hours = 144 yrs


Discover the novel gene or protein families


Metagenomic

Profiling of Nice Biomes(BIOME) :
~90% sequences unknown


GOS: double the protein families


Compare
metagenome

data


Clustering
-
based


Protein family
-
based


RAMMCAP

RNA

RAMMCAP

Meta_RNA & tRNA‐scan


High sensitivity, Low specificity(Except 16S)


“Identification of ribosomal RNA genes in
metagenomic

fragments.“, Huang
, Y.,
Gilna
, P. & Li, W. Z.
Bioinformatics


tRNAscan
-
SE: a program for improved detection of transfer RNA genes in genomic sequence.“, Lowe
, T.M. and Eddy, S.R.

Nucleic
Acids
Res

CLUSTERING

CD
-
HIT

RAMMCAP

CD
-
HIT


G
reedy incremental clustering algorithm


Whole pairwise alignment avoid


Short word (2~5)


Index table

"Clustering
of highly homologous sequences to reduce the size of large protein database",
Weizhong

Li,
et al.
Bioinformatics
, (2001)

"
Tolerating some redundancy significantly speeds up clustering of large protein databases",
Weizhong

Li,
et al.
Bioinformatics
, (2002)

"
Cd
-
hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences",
Weizhong

Li,
et al.
Bioinformatics
, (2006
).


Limitation of CD
-
HIT


Evenly distributed mismatches





Greedy issue


Group in first meet cluster

CD
-
HIT Performance

ORF
S

CLUSTERING

RAMMCAP

Why Cluster ORFs


Function studies


Novel genes finding


ORF
Prediction


ORF_finder






Metagene

ORF Prediction Performance


MetaSim


Average 100, 200, 400, 800
bp
, 1 million reads


True ORF (
sensitivity
)


Overlap 30 AA with NCBI annotated ORF


Predicted ORF (
specificity
)


50% overlap with true ORF


ORF Clustering


Run 1 clustering


90~95% identity


Run 2 clustering


60
% identity over 80%
of length (454)


30% identity over 80% of length (Sanger)


Merge run 1 & 2 result




Clustering Evaluation


Test sets


GOS
-
ORF (30%),BIOME (95%),BIOME
-
ORF (60%)

BIOME
Microbiomes

&
Viromes


Microbial sequences
are more conserved than
viral
sequences.

Clustering Quality


Need conservative threshold


Use only >30 AA
Pfam

sequence


Discard short sequence in overlapping
Pfam

sequence


Place into different cluster


Sequence in the same
Pfam
, place into different
cluster.

Clustering Validation


Generate a clusters whose sequences from the same
Pfam


Minimize the number of clusters


Good clusters : >95% members from the same
Pfam


>97% sequences are in good clusters


~30 times more than bad clusters

Number of sequences

Number of clusters

Cluster Size

RAMMCAP

Protein Family Annotation


Pfam

(24.0, Oct. 2009, 11912 families)


textual descriptions, other resources and literature
references


TIGRFAMs (9.0, Nov. 2009, 3808 models)


GO,
Pfam

and
InterPro

models


COG(2003,

4873

clusters of

orthologous

groups)


3 lineages and ancient conserved domain


RPS‐BLAST(Reverse psi
-
blast)


E values
≤ 0.001

Novel Protein Families Discovery


Spurious ORFs in a large size of cluster without
homology match may contain novel protein
families.


In GOS only 1.3% of clusters with cluster size

10 map to 93% of true ORFs


In BIOME only 1.0% of clusters with cluster
size

5 map to 28% of true ORFs

METAGENOME

COMPARISON

Statistical Comparison of Metagenomics


Occurrence profile coefficient



z score, why?
(not
Rodriguez
-
Brito's

require

10
5
simulated samples
)




Low occurrence cut off





H
A
=4 (0.95) z=1.96

H
A
=7 (0.99) z=2.58

1.z> cut off

2.P
A


f x P
B


Comparison between Rodriguez
-
Brito's

method and z test
method.

C
lustering
-
based Comparison

GOS ORF clusters

r
AB

No. of cluster

Clustering
-
based Comparison


BIOME samples are more
diverse than GOS

BIOME clusters

Protein Family
-
based Comparison


Merge
Pfam
,
Tigrfam

and COG into super
families


Pfam
-

clans,
Tigrfam
-

role categories,
and
COG
-

functional classes


Compare with a specific super family

Protein Family
-
based Comparison


(a) GOS on COG Class F, (b) GOS on COG Class T, (c) BIOME on COG Class F, (d) BIOME on COG Class T

Conclusion


RAMMCAP

improve performance


CD
-
HIT


z test


Novel protein families discovery


ORFs clustering


Metagenome

comparison


Cluster
-
based


Protein family
-
based

Discussion


How much improvement when apply RNA
prediction before raw reads?


How to determine significant factor?


P
A



f * P
B

(f>1)