Lecture 5. Topics in Functional

schoolmistInternet and Web Development

Oct 22, 2013 (3 years and 5 months ago)

78 views

Lecture 5. Topics in Functional
Enrichment Analysis

The Chinese University of Hong Kong

CSCI5050 Bioinformatics and Computational Biology

Lecture outline

1.
Formal knowledge representation: ontology


Biological ontology

2.
Practical use of ontology: functional
enrichment analysis

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

2

BIOLOGICAL ONTOLOGY

Part 1

Knowledge representation


How to systematically organize knowledge about
genomic elements?


Define the concepts associated with standard terms


E.g., A gene is a genomic region that can transcribe RNA


Define the properties of each concept


E.g., A gene has an official name, and zero or more aliases


Define the relationships between different concepts


E.g., A protein
-
coding gene is a gene the RNA of which can be
translated into proteins


E.g., An exon is part of a gene


We want to define an “ontology”:


“The philosophical study of the nature of being, existence,
or reality as such, as well as the basic categories of being
and their relations” (Wikipedia)

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

4

Resource Description Framework (RDF)


Everything represented in triples:


Subject


Predicate


Object


All non
-
literal values have a type,
which is represented by a
Uniform Resource Identifier (URI)


Subclass, domain, range (for
defining predicates), etc.
supported by RDF schema (RDFS)


File formats: Notation 3 (N3),
RDF/XML, ...


Databases management systems:
Jena/Sesame, Oracle, ...


Query languages: RDQL, RQL,
SeRQL, SQARQL, ...

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

5

Image credit: W3C RDF Primer

Web Ontology Language


Web Ontology Language (OWL)


Defines transitivity, symmetry, cardinality, etc., for
logical inference


Three levels:


OWL Lite, OWL DL (description language), OWL Full


Increasing expressiveness (e.g., OWL Lite only supports
binary cardinality constraints, while OWL Full has no
limitation)


Decreasing computability and efficiency (e.g., OWL DL is
computationally complete and all logical expressions are
decidable while OWL full does not guarantee)


Both RDF and OWL are components of the
semantic web


Web with formal, machine
-

understandable information

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

6

Image source: http://www.w3.org/2001/12/semweb
-
fin/swlevels.png

Logical inference examples


Each gene has a unique ID


Since a protein coding gene is a (cub
-
class of) gene, each protein coding gene has a
unique ID


Since p53 is (an instance of) a gene, it has a unique ID


“Interacts with” is a symmetric relationship


If protein A physically interacts with protein B, protein B also physically interacts with
protein A


“Upstream of” is a transitive relationship


If gene A is upstream of gene B, and gene B is upstream of gene C, then gene A is
upstream of gene C


“Regulates in a cell type” is a sub
-
relationship of “regulates”


If protein A regulates the expression of gene B in a certain cell type, then protein A
regulates gene B


Why OWL instead of WOL: “Why not be inconsistent in at least one aspect
of a language which is all about consistency?”


Guus Schreiber

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

7

Ontology editor: Protégé

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

8

Image source: http://protege.stanford.edu/overview/po
-
screenshots.html

Biological ontology


Biological terms, their
properties and their
relationships


The Open Biological
and Biomedical
Ontologies (OBO)


Many ontologies


“The nice thing about
standards is that you
have so many to
choose from”
(Andrew Tanenbaum,
Computer Networks
,
2
nd

ed., P.254)


Challenges:


Accuracy


Completeness


Simplicity


Community use

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

9

Image source: http://obofoundry.org/

Gene Ontology (GO)


An ontology for gene products produced by the Gene
Ontology Consortium


Most frequently
-
used biological ontology


3 sub
-
ontologies:


Molecular function


Biological process


Cellular component


2 parts:


The ontologies


Directed acyclic graphs (DAG)


Some terms have multiple parents


Edges indicate three types of relationships: is
-
a, part
-
of, regulates


Organism
-
specific instances (each gene can have 0, 1 or
more annotated terms)

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

10

Tree view

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

11

Image source: http://amigo.geneontology.org/

Directed acyclic graph view

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

12

Image source: http://elbo.gs.washington.edu/courses/GS_559_11_wi/slides/11A
-
GeneAnnotation.pdf

GO in RDF/XML

<?xml version="1.0" encoding="UTF
-
8"?>

<!DOCTYPE go:go PUBLIC "
-
//Gene Ontology//Custom XML/RDF Version 2.0//EN"
"http://www.geneontology.org/dtd/go.dtd">

<go:go xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22
-
rdf
-
syntax
-
ns#">


<rdf:RDF>



<go:term rdf:about="http://www.geneontology.org/go#GO:0046812">




<go:accession>GO:0046812</go:accession>




<go:name>host cell surface binding</go:name>




<go:definition>Interacting selectively and non
-
covalently with the surface of a host cell.</go:definition>




<go:is_a rdf:resource="http://www.geneontology.org/go#GO:0005488" />



</go:term>



<go:term rdf:about="http://www.geneontology.org/go#GO:0015643">




<go:accession>GO:0015643</go:accession>




<go:name>toxin binding</go:name>




<go:definition>Interacting selectively and non
-
covalently with a toxin, a poisonous substance that
causes damage to biological systems. Toxins are differentiated from simple chemical poisons and vegetable alkaloids by
their high molecular weight and antigenicity (they elicit an antibody response).</go:definition>




<go:is_a rdf:resource="http://www.geneontology.org/go#GO:0005488" />




<go:dbxref rdf:parseType="Resource">





<go:database_symbol>InterPro</go:database_symbol>





<go:reference>IPR000290</go:reference>




</go:dbxref>

...

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

13

Evidence codes


Specifying how terms are used to annotate particular
instances (
http://www.geneontology.org/GO.evidence.shtml
)


Very important, but usually neglected

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

14

Experimental Evidence Codes


EXP: Inferred from Experiment


IDA: Inferred from Direct Assay


IPI: Inferred from Physical Interaction


IMP: Inferred from Mutant Phenotype


IGI: Inferred from Genetic Interaction


IEP: Inferred from Expression Pattern

Computational Analysis Evidence Codes


ISS: Inferred from Sequence or Structural Similarity


ISO: Inferred from Sequence Orthology


ISA: Inferred from Sequence Alignment


ISM: Inferred from Sequence Model


IGC: Inferred from Genomic Context


IBA: Inferred from Biological aspect of Ancestor


IBD: Inferred from Biological aspect of Descendant


IKR: Inferred from Key Residues


IRD: Inferred from Rapid Divergence


RCA: inferred from Reviewed Computational Analysis

Author Statement Evidence Codes


TAS: Traceable Author Statement


NAS: Non
-
traceable Author Statement


Curator Statement Evidence Codes


IC: Inferred by Curator


ND: No biological Data available

Automatically
-
assigned Evidence Codes


IEA: Inferred from Electronic Annotation

Obsolete Evidence Codes


NR: Not Recorded

Logical inference using a biological ontology


Example: classification of protein
phosphatases


Given the sequence of a
phosphatase, find the most
specific class that could
instantiate it


Phosphatases:


Removal of phosphate group (vs.
kinases)


Two main families:


Serine/threonine phosphatases


Tyrosine phosphatases


Associated with serious human diseases:
cancers, neurodegenerative conditions,
diabetes, etc.


Well
-
defined subfamilies

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

15

Image credit: Wolstencroft et al.,
Bioinformatics

22(14):e530
-
e538, (2006)

Logical inference using a biological ontology


Define necessary and sufficient conditions for each phosphatase subfamily


Test for class membership using description logic, return the most specific
class


Some class definitions:


Class ReceptorTyrosinePhosphatase Complete


(Protein and



(hasDomain some TyrosinePhosphataseCatalyticDomain) and



(hasDomain some TransmembraneDomain))


Class R5Phosphatase Complete


(Protein and



(hasDomain two TyrosinePhosphataseCatalyticDomain) and



(hasDomain some TransmembraneDomain) and



(hasDomain some FibronectinDomain) and



(hasDomain some CarbonicAnhydraseDomain) and



(hasDomain only




(TyrosinePhosphataseCatalyticDomain and





TransmembraneDomain and





FibronectinDomain and





CarbonicAnhydraseDomain))


Example: LAR (2 tyrosine phosphatase catalytic domains, 1 transmembrane
domain, 9 fibronectin domains, 3 immunoglobulin domains)

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

16

The class structure

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

17

Image credit: Wolstencroft et al.,
Bioinformatics

22(14):e530
-
e538, (2006)

Some results


Main limitation: few things in biology can be defined so
rigorously without exceptions

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

18

Image credit: Wolstencroft et al.,
Bioinformatics

22(14):e530
-
e538, (2006)

Pathways


Gene ontology provides a simple relationship
between different objects: they are both
annotated with a common term


Pathways describe detailed mechanistic
relationships between the objects


E.g., a metabolic pathway records how
metabolites are converted to other metabolites
through the actions of enzymes


Wikipedia: “A biological pathway is a series of
actions among molecules in a cell that leads to
a certain product or a change in a cell”

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

19

KEGG


Kyoto Encyclopedia of Genes and Genomes
Pathway
Database


One of the most commonly used pathway databases


Provides non
-
species
-
specific reference pathways, as
well as species
-
specific versions


Different types of pathways:


Metabolic pathways


Genetic information processing


Environmental information processing


Cellular processes


Organismal systems


human diseases


Drug development

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

20

The global map for metabolic pathways

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

21

Glycolysis/

gluconeogenesis

Glycolysis/gluconeogenesis


The reference pathway:


The human version (in
green):

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

22

FUNCTIONAL ENRICHMENT
ANALYSIS

Part 2

Gene set analysis


A lot of times we obtain a
certain set of genes, e.g.,


Genes with co
-
expression (or
simultaneous differential
expression)


Genes of which the promoters
are bound by a common
transcription factor


Genes that harbor some
mutations in patients of a
certain disease


We want to study if these
genes have any relationships

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

24

Image credit: Eisen et al.,
PNAS

95(25):14863
-
14868, (1998)

Functional enrichment


If each gene is annotated with some standard terms
(e.g., from GO), we can look for enriched terms


Enrichment: statistically significant


Contingency table for a term (e.g., binding):






Null hypothesis H
0
: the term and gene set are independent




Compute one
-
sided p
-
value: Pr(m


m
1

| H
0
); Define cutoff

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

25

Genes in target

set

Genes not in target set

Total

Genes annotated with a term

m
1

m
2

m
1
+m
2

Genes

not annotated with a term

n
1
-
m
1

n
2
-
m
2

n
1
+n
2
-
m
1
-
m
2

Total

n
1

n
2

n
1
+n
2

Interpreting analysis results


Generating hypotheses based on enrichment
results:


If many genes in the target set are annotated with a
certain term, the term may be related to the
phenotypic observation


If many genes in the target set are annotated with a
certain term, the other genes in the set may also be
annotated with this term
--

“guilt by association”


If the genes are annotated with two terms, the terms
may be related


If no statistically significant terms can be found, the
phenotypic observation may be non
-
genetic, largely
affected by other (e.g., environmental) factors, or is
affected by many loci

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

26

Problems


Choosing a proper background


Whole genome vs. all genes studied in the experiment


Suppose your microarray contains 100 genes, 20 of which are annotated with a certain GO term,
but overall only 10% of all human genes are annotated with that term


Different evidence codes


Filtering


Hierarchical relationships between different terms


Problems:


Redundancy


Different levels of detail/number of annotated genes


affecting p
-
values


Different curators used terms at different levels: deeper if more is known


Some proposed solutions:


GO slims


Fixed level (e.g., level 3)


More complex analysis involving term distances, information content, etc.


Multiple hypothesis testing

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

27

Graph
-
based methods


If a gene is annotated with a term,
and another gene is annotated with
a sub
-
term (child node in the DAG),
they do not share terms but are
clearly related


Some ideas for tackling the problem
based on the graph:


Use shortest
-
path in the graph as
distance. Consider distance instead of
common terms


Add edge weights based on number of
instances annotated with the two terms


Consider the lowest common ancestor

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

28

Image source: http://elbo.gs.washington.edu/courses/GS_559_11_wi/slides/11A
-
GeneAnnotation.pdf

Multiple hypothesis testing


Suppose you have a set of genes from a certain
experiment (e.g., those with a 2
-
fold increase of
expression in cancer samples)


You perform enrichment analysis using GO terms,
KEGG biological pathways, OMIM disease annotations,
etc., and find a term with p
-
value 0.001


Recap of the meaning: Say you have n
1

genes in your set
and m
1

of them are annotated with this term, if the term is
randomly assigned to m
1
+m
2

genes among all n
1
+n
2

genes,
the probability of having m
1

or more genes annotated with
the term in a random set of n
1

genes is 0.001


Is it statistically significant?


Is it biologically significant?

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

29

Tools for enrichment analysis


AmiGO


Default software on the Gene Ontology web site


BiNGO (A Biological Network Gene Ontology tool)


Flanders Interuniversitary Institute for Biotechnology


Plugin for the Cytoscape network viewer (next lecture)


DAVID (Database for Annotation, Visualization and
Integrated Discovery)


National Cancer Institute


One of the most cited tool for functional enrichment analysis


GSEA (Gene Set Enrichment Analysis)


Broad Institute


... (see
http://www.geneontology.org/GO.tools_by_type.term_enrichment.shtml
)

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

30

DAVID


Functionality:


ID conversion


Enrichment analysis


Disease


Functional categories


Gene ontology


Protein domains


Pathways


...


Correction for multiple hypothesis testing

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

31

Sample results


Functional annotation chart:

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

32

Sample results


Functional annotation chart:



RT: related terms


Count: number of genes on your list annotated with the term


LT: number of genes on your list


PH: number of genes in the whole genome annotated with the term


PT: number of genes in the whole genome


%: Count / LT


P
-
Value: uncorrected p
-
value based on a procedure similar to Fisher’s exact test (more
conservative)


Fold Enrichment: (Count / LT) / (PH / PT)


Bonferroni: p
-
value corrected for multiple hypothesis testing based on the Bonferroni
procedure


Benjamini: p
-
value corrected for multiple hypothesis testing based on the Benjamini
-
Hochberg procedure


FDR: False discovery rate (highly related to the Benjamini
-
adjusted p
-
value)


Fish Exact: uncorrected p
-
value based on Fisher’s exact test

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

33

Thresholding issue


Sometimes the way to define the gene set is quite arbitrary


Differential expression


Should we use 1.5
-
fold or 2
-
fold?


TF binding


How far away from a gene should we still consider a site as a
promoter of a gene?


An illustration:


2
-
fold over
-
expressed:


One
-
sided p
-
value: 0.5333


1.5
-
fold over
-
expressed:


One
-
sided p
-
value: 0.1833

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

34

Gene

Expression fold
change in cancer

Annotated with

term “binding”

1

2.7

Yes

2

2.5

No

3

1.9

Yes

4

1.2

No

5

1.1

No

6

0.9

No

7

0.7

Yes

8

0.6

No

9

0.4

No

10

0.2

No

Over
-
expressed

“binding”

Y敳



Y敳

1

2

No

1

6

Over
-
expressed

“binding”

Y敳



Yes

2

1

No

1

6

GSEA


GSEA tries to avoid using arbitrary thresholds to call gene sets


Ideas:


For each term, find the set of genes annotated with it


Check the ranks of these genes based on a phenotypic measure for
calling gene sets (e.g., expression fold
-
change). Compute a statistic
(enrichment score) based on these ranks: increase score when a gene
with the term is encountered, decrease score otherwise


If the genes annotated with the term is randomly distributed across the whole list
of genes, this process is essentially a random walk. The point with maximum
distance can be used for evaluating the deviation from this random case.


Evaluate statistical significance of the score by permuting phenotypic
measure values

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

35

GSEA (cont’d)


Example:

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

36

GSEA (cont’d)


Example (maximum values highlighted in blue):

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

37

GSEA (cont’d)


Example:


Since in 3 of the 10
random sets, the
maximum score


1,
the observed
maximum score for
the real data is not
significant (p=0.3)


It is possible to
compute the p
-
value
without performing
the simulation

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

38

GSEA (cont’d)


Real examples:

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

39

Image credit: Subramanian et al.,
PNAS

102(43):15545
-
15550, (2005)

Case study: Identifiers of biological objects


Identifiers are a big problem in bioinformatics


Different databases use different identifiers


p53: 11998 (HGNC), 7157 (Entrez), ENSG00000141510 (ENSEMBLE), P04637
(UniProtKB), ...


Different convention for gene, protein, etc.


p53 gene (wild type),
p53
(mutant), P53 (protein)


Multiple genes with the same name


One gene with many names (aliases)


Multiple versions of the same gene


Different organisms use very different naming methods


Eyeless, BRCA1, MAP kinase kinase kinase, starry night

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

40

Image credit: American Scientist 89(6):1, (2001); The Museum of Modern Art

Case study: Identifiers of biological objects


Standardization efforts:


Official names


Saccharomyces Genome Database (SGD) for yeast


Standard open reading frame (ORF) names:

Y<chromosome: A..P><relative to centromere: L|R><ID:
XXX><strand: W|C>, e.g., YKL074C


HUGO Gene Nomenclature Committee (HGNC) for human


...


ID converters


DAVID


ID lookup tables


Unique ID for everything


Life Science Identifier (LSID)

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

41

Summary


Gene ontology


Molecular function


Biological process


Cellular component


Biological pathways


Functional enrichment analysis for gene sets


DAVID


GSEA

Last update: 11
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

42