COFECO: Composite Function Annotation Enriched by Protein Complex Data

splashburgerInternet και Εφαρμογές Web

22 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

78 εμφανίσεις

1

COFECO: Composite Function Annotation Enriched by
Protein Complex Data


Supplementary Material


1. Collection and preprocesses of protein complex datasets


Table
1
. Statistics of protein complex resources and organisms combined by COFECO

Complex resources

Complexes

Web sites

Reactome

9,173

http://www.reactome.org

MIPS CORUM

1,507

http://mips.gsf.de/genre/proj/corum

MIPS MPact

217

http://mips.gsf.de/genre/proj/mpact

PINdb

76

http://pin.mskcc.org

Gene Ontology

541

http://www.geneontology.org

Ho dataset (2002)

642

http://www.ncbi.nlm.nih.gov/pubmed/11805837

Gavin dataset (2006)

2,166

http://www.ncbi.nlm.nih.gov/pubmed/16429126

Krogan dataset (2006)

4,332

http://www.ncbi.nlm.nih.gov/pubmed/16554755


In COFECO, the protein complex datasets are downloaded from the above websites (Table1). Then
they are preprocessed and stored in Oracle DBMS. They can be categorized into two
characteristics:
human curated protein complexes and systematically analyzed protein complexes. The former
includes Reactome (7), CORUM (8), MPact (6), GO (1), and PINdb (3). They are high quality
manually curated protein complex databases which contain a
lot of protein complexes in higher
eukaryotes such as mammalian, yeast, fly, C.elegans, etc. The latter contains yeast protein
complexes datasets published by Ho et al (2), Gavin et al (4), and Krogan et al (5) through high
-
throughput TAP/MS. Even though t
he biological functions of some systematically analyzed protein
complexes are not completely identified, these datasets are much more plentiful and novel than
human curated datasets. Therefore, systematically analyzed protein complexes are suitable to
anno
tate a list of genes because protein complexes can give us significant insight of functional units.


(1) Preprocessing of MPact dataset

MPact contains yeast complex data through individual and systematic experiments. Among them
Gavin dataset (2002) and Kro
gan dataset (2004) were excluded because they were covered in
datasets acquired by Gavin (2006) and Krogan (2006). Three systematic datasets are involved in the
complexcat.scheme file (ftp://ftpmips.gsf.de/yeast/catalogues/complexcat) as following (Table2)
:


550 Complexes by Systematic Analysis

550.1 Gavin AC, et al.

550.2 Ho Y, et al.

550.3 Krogan NJ, et al.


Table
2
.
Statistics of protein complexes of MPact

Method

Dataset

Complexes

Literat
ure

Individual


217

a lot of literatures

Systematic

Ho (2002)

232

(9)

Gavin (2002)

2,166

(10)

Krogan (2004)

91

(11)


(2) Preprocessing of GO dataset

The complex specified GO terms in the cellular component category were at first selected if they
satisfy the approximate keyword searches: ‘*complex’, ‘*some’, and ‘*ase’. These GO terms were
filtered again by low confident evidence codes such as IEA (Inferred from Electronic Annotation), NAS
(Non
-
traceable Author Statement), ND (No biological Data av
ailable), and NR (Not Recorded).


Table 3 shows the statistics of protein complexes and their proteins depending on model organisms.


2

Table 3. Statistics of protein complexes and proteins of COFECO.

NCBI TAX

Organisms

#.Complexes

#.Proteins

4932

Saccharomyces cerevisiae

8,049

5,246

9606

Homo sapiens

2,858

4,123

10090

Mus musculus

2,118

5,670

10116

Rattus norvegicus

1,850

2,957

9031

Gallus gallus

1,287

1,820

7227

Drosophila melanogaster

865

1,583

3702

Arabidopsis thaliana

619

1,606

6239

Caenorhabditis elegans

600

806

4896

Schizosaccharomyces pombe

526

899

44689

Dictyostelium discoideum

451

718

11676

Human immunodeficiency virus 1

149

364

9913

Bos taurus

95

240

1773

Mycobacterium tuberculosis

60

82

562

Escherichia coli

37

62

11320

Influenza A virus

30

17

7955

Danio rerio

30

74

8355

Xenopus laevis

28

66

9986

Oryctolagus cuniculus

25

40

2287

Sulfolobus solfataricus

22

47

9601

Pongo abelii

19

23

2190

Methanocaldococcus jannaschii

18

15

9823

Sus scrofa

18

32

46245

Drosophila
pseudoobscura pseudoobscura

17

25

9615

Canis lupus familiaris

16

26

8364

Xenopus tropicalis

11

21

9598

Pan troglodytes

11

11

9541

Macaca fascicularis

11

16

7159

Aedes aegypti

9

9

10141

Cavia porcellus

8

11

9544

Macaca mulatta

8

8

9685

Felis catus

7

8

10029

Cricetulus griseus

6

6

9940

Ovis aries

6

8

1491

Clostridium botulinum

6

9

9534

Chlorocebus aethiops

6

4

6238

Caenorhabditis briggsae

5

6

7165

Anopheles gambiae

5

5

9600

Pongo pygmaeus

5

6

7244

Drosophila virilis

5

5

3

10036

Mesocricetus
auratus

4

3

39947

Oryza sativa Japonica Group

4

5

9925

Capra hircus

4

3

7245

Drosophila yakuba

4

6

1148

Synechocystis sp. PCC 6803

3

13

31033

Takifugu rubripes

3

2

9595

Gorilla gorilla gorilla

3

3

9580

Hylobates lar

3

3

57486

Mus musculus
molossinus

3

3

13616

Monodelphis domestica

2

2

7238

Drosophila sechellia

2

2

7243

Drosophila teissieri

2

2

8127

Oreochromis mossambicus

2

1

9796

Equus caballus

2

2

9490

Saguinus oedipus

2

2

9543

Macaca fuscata fuscata

2

2

9483

Callithrix jacchus

2

3

7668

Strongylocentrotus purpuratus

2

2

7240

Drosophila simulans

2

3

3055

Chlamydomonas reinhardtii

2

2

7226

Drosophila mauritiana

2

3

Etc

Sixty five organisms contain only one protein complex


2. Collection of input gene (or protein) identifiers
and their annotations


In COFECO, gene (or protein) identifiers and their annotations are downloaded from the above
websites (Table4). Then they are preprocessed and stored in Oracle DBMS.


Table 4. Available web sites of other resources

Resources

Web site
s

KEGG pathways

http://www.genome.jp/kegg/download/ftp.html

GO annotations

http://www.geneontology.org/GO.current.annotations.shtml

InParanoid

(12)

http://inparanoid.sbc.su.se/download

UniProt Knowledgebase

http://www.uniprot.org/downloads

iProClass

http://pir.georgetown.edu/pirwww/download/

Gene symbol ID

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

RefSeq ID

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ gene2refseq.gz

UniGene ID

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ gene2unigene

Affymetrix probe ID

http://www.affymetrix.com/support/technical/annotationfilesmain.affx

Agilent probe ID

http://www.chem.agilent.com/cag/bsp/gene_lists.asp


3. A modified apriori algorithm: greedy
algorithm


As
the number of annotation resources increases and the minimal number of concurrent genes (k)
decreases, the computational complexity might drastically grow to enumerate
all possible
compositions
of annotation terms
.
In addition,
the
protein

redundancy among

complexes

may
also

lead to
huge
computation

in COFECO analysis
. To solve this problem, we developed
the
greedy algorithm that
select the top ranked K terms using
quota rule
.

This greedy algorithm does not
guarantee

to find the
global optimum but is very efficient from a computational point of view.


The
main idea

of
two
-
way greedy algorithm

consists of a quota rule and a cutoff rule illustrated in
4

figure 1. In first step (itemset=1), the number of selected annotati
on resources is used to define the
maximum number (MN) of annotation terms from each resource. MN supports unbiased recombination
among annotation terms from heterogeneous resources. At first, the annotation terms satisfying the
minimal number of concurren
t genes (MG) are selected. Then the enriched P
-
values for selected terms
in this step are calculated by using statistical test. Top MN ranked annotation terms for each annotation
resource are selected. From second step (itemset=2), the annotation terms sel
ected in first step are
used to build composite terms similar to original apriori algorithm. The composite annotation terms
satisfying MG are selected. The enriched P
-
values for selected terms are calculated by using statistical
test. Top K composite terms

are selected. The process iteratively continues until the longest composite
terms are found. The number of top rank annotations (K) and annotation resources
(R) are customized
by users.

The summation of MNs is
K ≈ ∑MNi (i=1,2,.., R)

which means that
∑MNi
is greater than K
when the enriched p
-
values of annotation terms are same.



Figure
1.
The workflow of
the
greedy
algorithm


4
.
Result and comparison of COFECO with related tools.

We have already introduced an appropriate example in COFECO manuscript. A
dditionally,

we present
an
example of analys
i
s by
COFECO. This result may explain the functionalities and the advantages of
COFECO over other software.

Because full contents of examples are so huge, key annotations
among them are only illustrated.
To compa
re
COFECO

to other related tools we examined the
examples by GENECODIS
(
13
)
.



Example analysis of protein expression in liver cancer.


1) The advantage of protein complexes in composite annotations

47 co
-
expressed proteins in human liver cancer using 2D
-
gel/MS were analyzed. This dataset was not
yet published. The top ranked annotations provide composite annotations including various protein
complexes.
A

lot of MIPS mammalian complexes (CORUM) were e
nriched by those proteins. Both
COFECO and GENECODIS results provided protein complex,

GO:0016514: SWI/SNF complex


that
has sixty three proteins. However, COFECO was able to provide new and more specific information.
For example, the first composite anno
tation in Table 5 revealed
tha
t the protein complex,

p300
-
CBP
-
5

p270
-
SWI/SNF complex


had seven proteins and was exactly matched to seven proteins in the
analyzed list. In addition, the first composite annotation revealed that these proteins were co
-
annotat
ed with

nucleus

,

transcription

,

regulation of transcription, DNA
-
dependent

, and

p300
-
CBP
-
p270
-
SWI/SNF complex

. The importance of this observation is that protein complexes of COFECO
specify the annotated protein groups well and provide novel biolog
ical meanings by combining
annotation terms.


Table 5. COFECO output table.



Table 6. GENECODIS output table.



2) Detail
annotation
s of protein complexes

Table 6 shows detail annotations of
‘p300
-
CBP
-
p270
-
SWI/SNF complex’
. User can see seven co
-
complexed proteins, detection method, published literature, curator comment, and original resource
link. Direct GO and KEGG annotations of seven co
-
complexed proteins are provided by symbolic
colors in table7.




6

Table 7. Protein c
omplex table of COFECO.




3) Summarization and exploration by GO and KEGG viewers

Annotated GO terms of

ARID1A


protein are shown in Table 7. The GO terms selected in Table 7 can
be displayed in GO
hierarchical

viewer (Figure 2). Input GO terms painted

by symbolic color can be
explored
efficiently
.



7


Figure
2.
GO viewer of COFECO


Annotated KEGG pathway,

mTOR signaling pathway


of

ARID1A


protein is shown in table 7.
C
o
-
complexed proteins painted by symbolic color are displayed in KEGG pathway viewe
r (Figure 3) if
they are annotated in this pathway.



Figure
3.
KEGG viewer of COFECO


4) Comparative analysis with orthologs

Yeast orthologs of 47 co
-
expressed proteins in human liver cancer was simultaneously annotated by
COFECO. The result of within
-
comparison on the annotation level shows different and common
8

annotations between 47 co
-
expressed proteins and their orthologs.


Table 8. Within
-
comparison in the annotation level





5) Comparative analysis with orthologs

47 co
-
expressed proteins in human liver cancer and 58 liver
-
specific genes in human liver tissue (14)
were annotated by COFECO.
20090411020252_annotation.ser

is COFECO output of 58 genes and
200904110202
39
_annotation.ser

is an output of 47 proteins. The result of cross
-
comparison on the
annotation level shows different and common annotations between them in Table 9.










9

Table 9. Cross
-
comparison in the an
notation level




10




Feature comparison

of

COFECO

vs. GENECODIS

vs. ProfCom



COFECO

GENECODIS

ProfCom

Composite
algorithm

Composite annotation by
traditional apriori algorithm
,

resource combination methods,
and term
-
filtering

C
omposite annotation

by
traditional apriori algorithm.

Composite annotation
by
Boolean

operation

Solution for
computational
complex

Two way greedy algorithm

No

Top rank
-
selected
greedy algorithm

Annotations
supported


Gene Ontology,

KEGG pathways

Protein complexes

Gene
Ontology,

Interpro Motifs,

KEGG pathways


Gene Ontology,

Interpro Motifs,

FunCat

Identifiers

Supported

27

12

9

Organisms

supported

21


11

6

Annotation
comparison

Yes

(Within & cross)

No

No

Orthology
supported

Yes

No

No

Viewer

KEGG & GO

No

No


5.
References


1.

Ashburner,M., Ball,C.A., Blake,J.A., Botstein,D., Butler,H., Cherry,J.M., Davis,A.P., Dolinski,K.,
Dwight,S.S., Eppig,J.T. et al. (2000) Gene ontology: tool for the unification of biology. The Gene
Ontology Consortium. Nat. Genet., 25, 25.29.

2.

H
o Y., Gruhler A., Heilbut A., Bader GD., Moore L., Adams SL. et al. (2002) Systematic
identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature.
415(6868):180
-
3.

3.

Luc PV, Tempst P. (2004) PINdb: a database of nuclear prote
in complexes from human and yeast.
Bioinformatics., 20(9):1413
-
1415

4.

Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, et al. (2006)
survey reveals modularity of the yeast cell machinery. Nature., 440(7084):631
-
6

5.

Krogan NJ, Cagn
ey G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP,
Punna T, Peregrin
-
Alvarez JM, et al. (2006) Global landscape of protein complexes in the yeast
Saccharomyces cerevisiae. Nature., 440(7084):637
-
43

6.

Guldener U., Munsterkotter M., O
esterheld M., Pagel P., Ruepp A., Mewes HW., Stumpflen V.
(2006) MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res. 34, D436
-
D441.

11

7.

Vastrik I, D'Eustachio P, Schmidt E, Gopinath G, Croft D, de Bono B, Gillespie M, Jassal B, Lewis
S, M
atthews L, Wu G, Birney E, Stein L. (2007) Reactome: a knowledge base of biologic pathways
and processes. Genome Biol., 8(3):R39

8.

Ruepp A, Brauner B, Dunger
-
Kaltenbach I, Frishman G, Montrone C, Stransky M, Waegele B,
Schmidt T, Doudieu ON, Stumpflen V,
Mewes HW. (2008) CORUM: the comprehensive resource
of mammalian protein complexes. Nucleic Acids Res. 36, D646
-
D650.

9.

Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier

K,
Yang L, et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae
by mass spectrometry. Nature., 415(6868), 180
-
183

10.

Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM,
Cruciat CM, R
emor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M,
Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR,
Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Serap
hin B,
Kuster B, Neubauer G, Superti
-
Furga G. (2002) Functional organization of the yeast proteome by
systematic analysis of protein complexes. Nature., 415(6868), 141
-
147

11.

Krogan NJ, Peng WT, Cagney G, Robinson MD, Haw R, Zhong G, Guo X, Zhang X, Canadien
V,
Richards DP, Beattie BK, Lalev A, Zhang W, Davierwala AP, Mnaimneh S, Starostine A, Tikuisis AP,
Grigull J, Datta N, Bray JE, Hughes TR, Emili A, Greenblatt JF. (2004) High
-
definition
macromolecular composition of yeast RNA
-
processing complexes. Mol Cel
l., 13(2), 225
-
39.

12.

Berglund AC, Sjolund E, Ostlund G, Sonnhammer EL. (2008) InParanoid 6: eukaryotic ortholog
clusters with inparalogs. Nucleic Acids Res., 36, D263
-
D266

13.

Carmona
-
Saez,P., Chagoyen,M., Tirado,F., Carazo,J.M. and Pascual
-
Montano,A. (2007)
GE
NECODIS: a web
-
based tool for finding significant concurrent annotations in gene lists.
Genome Biol., 8, R3.

14.

Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM,
Moqrich A, Patapoutian A, Hampton GM, Schultz PG. Hogene
sch JB.

(2002)
Large
-
scale analysis
of the human and mouse transcriptomes. Proc Natl Acad Sci., 99(7)
,
4465
-
4470.