Diapositiva 1

scacchicgardenSoftware and s/w Development

Dec 13, 2013 (3 years and 7 months ago)

55 views

Annotation, Databases,

GO, Pathways,and

all those things

2

Information on microarray data


Consists of different
types of informations


Genes annotations


Samples annotations


Genes expression
levels


Covariates


Experimental design


etc

Annotation: Relating
probesets to genes

4

Use of microarray clone
annnotation


Often, the result of microarray data analysis is a list of
genes.


The list has to be summarized with respect to its
biological meaning.


For this, information about the genes and the related
proteins has to be gathered.


If the list is small (let’s say, 1

30), this is easily done by reading
database information and/or the available literature.


Sometimes, lists are longer (100s or even 1000s of genes).
Automatic parsing and extracting of information is needed.

5

From clone information to

genes and proteins (1)



Microarrays are produced using
information on
expressed sequences
as
EST clones, cDNAs, partial cDNAs etc.



At the other end, functional information is
generated (and available) for
proteins
.


Hence, there is
a need to map a clone
sequence ID to a protein ID
.


This is a non
-
trivial task

6

From clone information to genes
and proteins: a non
-
trivial task


First, there are usually hundreds of ESTs (and
several cDNA sequences) that map to the same
gene.


The Database
Unigene
tries to resolve this multiplicity
problem by sequence clustering.



An alternative approach is taken by
Locus Link
. This
is a quite stable repository of genomic loci, supposed
to be a single gene.


Since the emphasis is on well
-
characterised loci,
Locus Link is not complete.


7


There are other projects like RefSeq (NCBI) or TIGR Gene Indices.


According to the cross
-
references available for a certain microarray,
one or the other may be advantageous

From clone information to genes
and proteins: multiple ways to go

8

An example:

The human genome


With the completion of the human genome sequence,
you’d think that such ambiguities can be resolved. In
fact, that is not the case.


Part of the problem is due to the fact that it is hard to predict
gene structure (intron/exon) without knowing the entire mRNA
sequence, which happens for about two
-
thirds of all genes.


Then, there are errors in the assembly (putting together the
sequence snippets). A typical symptom is that a gene appears to
map to multiple loci on the same chromosome, with very high
sequence similarity.


But there are also sequences that are nearly indentical, but
duplicated. This has happened not long ago in evolution by
means of transposable elements.


9

The human genome:

Some figures


Currently, it’s estimated that the
human genome

contains about
25,000


30,000 genes that code for 50,000


100,000 different
transcripts (and thus, proteins).


Unigene

(human section) contains
105,680

clusters, but 45,999


of them are of size 2 or less.


RefSeq

DNA contains
28,097

human sequences.


ENSEMBL

contains
21,787

predicted genes, 31,609 predicted
transcripts.


Fully computational methods like Genscan produce more than
65,000

predictions.


Locus

Link

contains
15,248

genes with known function, and further
6038

genes without function annotation

10

Function annotation


Probably, the most important thing you want to know is what the
genes or their products are concerned with, i.e. their
function
.


Function annotation is difficult:


Different people use different words for the same function, or


may mean different things by the same word.


The context in which a gene was found (e.g. “TGF
-
induced gene”) may
not be particularly associated with its function.


Inference of function from sequence alone is error
-
prone and
sometimes unreliable.


The best function annotation systems (GO,SwissProt) use human
beings who read the literature before assigning a function to a gene


11

The Gene Ontology


To overcome some of the problems, an
annotation system has been created: The Gene
Ontology (
http://www.geneontology.org
).


It represents a unified, consistent system, i.e. terms
occur only once, and there is a dictionary of allowed
words.


Furthermore, terms are related to each other: the
hierarchy goes from very general terms to very
detailed ones.

12

Cross
-
references with GO


The GO database exists independently
from other annotation databases


There exist cross
-
references (GOA)
enabling to relate other annotations with
those contained in GO

13

Bioconductor and annotations


Annotation information is managed in Bioconductor
through metadata packages


These packages contain one
-
to
-
one and one
-
to
-
many
mappings for frequently used chips, especially Affymetrix


Information available includes
gene names
,
gene
symbol
,
database accession numbers
,
Gene Ontology
function description
,
enzmye classification number
(EC),
relations to PubMed abstracts, and others.


There are several packages implementing functionalities
to deal with annotation information:
annotate,
Annbuilder, ontoTools, GOstats

and many more

Static vs. Dynamic Annotation

Static Annotation:


Bioconductor packages containing annotation
information that are installed locally on a
computer


well
-
defined structure


reproducible analyses


no need for network connection

Dynamic Annotation:


stored in a remote database


more frequent updates


possibly different
result when repeating analyses


more information


one needs to know about the structure of the
database, the API of the webservice etc.

15


EntrezGene


is a catalog of genetic loci that connects curated sequence
information to official nomenclature. It replaced LocusLink.



UniGene


defines sequence clusters. UniGene focuses on protein
-
coding
genes of the nuclear genome (excluding rRNA and
mitochondrial sequences).



RefSeq



is a non
-
redundant set of transcripts and proteins of known
genes for many species, including human, mouse and rat.



Enzyme Commission (EC)



numbers are assigned to different enzymes and linked to genes
through EntrezGene.


Available Metadata

16


Gene Ontology (GO)



is a structured vocabulary of terms describing gene products
according to molecular function, biological process, or cellular
component



PubMed


is a service of the U.S. National Library of Medicine. PubMed
provides a rich resource of data and tools for papers in journals
related to medicine and health. While large, the data source is not
comprehensive, and not all papers have been abstracted

Available Metadata

17


OMIM


Online Mendelian Inheritance in Man is a catalog of human genes
and genetic disorders.



NetAffx


Affymetrix’ NetAffx Analysis Center provides annotation resources
for Affymetrix GeneChip technology.



KEGG


Kyoto Encyclopedia of Genes and Genomes; a collection of data
resources including a rich collection of pathway data.



IntAct


Protein Interaction data, mainly derived from experiments.



Pfam


Pfam is a large collection of multiple sequence alignments and
hidden Markov models covering manycommon protein domains
and families.

Available Metadata

18


Chromosomal Location



Genes are identified with chromosomes, and where appropriate
with strand.



Data Archives


The NCBI coordinates the Gene Expression Omnibus (GEO);
TIGR provides the Resourcerer database, and the EBI runs
ArrayExpress.

Available Metadata

19


An early design decision was to provide metadata on a per chip
-
type basis (e.g.
hgu133a, hgu95av2
)


Each annotation package contains objects that provide mappings
between identifiers (genes, probes, …) and different types of
annotation data


One can list the content of a package:

> library("hgu133a")

> ls("package:hgu133a")

[1] "hgu133a" "hgu133aACCNUM"

[3] "hgu133aCHR" "hgu133aCHRLENGTHS"

[5] "hgu133aCHRLOC" "hgu133aENTREZID"

[7] "hgu133aENZYME" "hgu133aENZYME2PROBE"

[9] "hgu133aGENENAME" "hgu133aGO"

[11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE"

[13] "hgu133aLOCUSID" "hgu133aMAP"

[15] "hgu133aMAPCOUNTS" "hgu133aOMIM"

[17] "hgu133aORGANISM" "hgu133aPATH"

[19] "hgu133aPATH2PROBE" "hgu133aPFAM"

[21] "hgu133aPMID" "hgu133aPMID2PROBE"

[23] "hgu133aPROSITE" "hgu133aQC"

[25] "hgu133aREFSEQ" "hgu133aSUMFUNC_DEPRECATED"

[27] "hgu133aSYMBOL" "hgu133aUNIGENE"

Annotation Packages

A little bit of history...

(the pre
-
SQL era)

before: hgu95av2




now:
hgu95av2.db

21


Objects in annotation packages used to be environments,


hash tables for mapping


now things are stored in SQLite
DB


Mapping only from one identifier to another, hard to reverse


quite unflexible


The user interface still supports many of the old
environment
-
specific interactions:



You can access the data directly using any of the standard


subsetting or extraction tools for environments:



get, mget, $ and [[.


> get("201473_at", hgu133aSYMBOL)

[1] "JUNB"

> mget(c("201473_at","201476_s_at"), hgu133aSYMBOL)

$`201473_at`

[1] "JUNB"

$`201476_s_at`

[1] "RRM1"

> hgu133aSYMBOL$"201473_at"

[1] "JUNB"

> hgu133aSYMBOL[["201473_at"]]

[1] "JUNB"

Annotation Packages

22



Suppose we are interested in the gene BAD.


> gsyms <
-

unlist(as.list(hgu133aSYMBOL))

> whBAD <
-

grep("^BAD$", gsyms)

> gsyms[whBAD]

1861_at 209364_at

"BAD" "BAD"

> hgu133aGENENAME$"1861_at"

[1] "BCL2
-
antagonist of cell death"

Working with Metadata

23

Find the pathways that BAD is associated with.


> BADpath <
-

hgu133aPATH$"1861_at"

> kegg <
-

mget(BADpath, KEGGPATHID2NAME)

> unlist(kegg)

01510

"Neurodegenerative Disorders"

04012

"ErbB signaling pathway"

04210

"Apoptosis"

04370








"Colorectal cancer"

05212

"Pancreatic cancer"

05213

"Endometrial cancer"

05215

Working with Metadata

24

We can get the GeneChip probes and the unique EntrezGene loci
in each of these pathways. First, we obtain the Affymetrix IDs


> allProbes <
-

mget(BADpath, hgu133aPATH2PROBE)

> length(allProbes)

[1] 15

> allProbes[[1]][1:10]

[1] "206679_at" "209462_at" "203381_s_at"
"203382_s_at"

[5] "212874_at" "212883_at" "212884_x_at"
"200602_at"

[9] "211277_x_at" "214953_s_at"


> sapply(allProbes, length)

01510 04012 04210 04370 04510 04910 05030 05210
05212 05213

85 169 162 137 413 243 39 167 156 111

05215 05218 05220 05221 05223

194 137 160 117 110

Working with Metadata

25

And then we can map these to their Entrez Gene values.


> getEG = function(x) unique(unlist(mget(x,
hgu133aENTREZID)))

> allEG = sapply(allProbes, getEG)

> sapply(allEG, length)

01510 04012 04210 04370 04510 04910 05030
05210 05212 05213

37 84 81 67 187 130 18 82 72 51

05215 05218 05220 05221 05223

85 68 74 53 53

Working with Metadata

26


Data in the new .db annotation packages is stored in
SQLite databases






much more efficient and flexible


old environment
-
style access provided by objects of class
Bimap
(package AnnotationDbi)


left

object

right

object

left

object

right

object

left

object

right

object

.db Packages

27


Data in the new .db annotation packages is stored in
SQLite databases






much more efficient and flexible


old environment
-
style access provided by objects of class
Bimap
(package AnnotationDbi)


left

object

right

object

left

object

right

object

left

object

right

object



bipartite graph

name

attr1 = value1

attr2 0 value2

.db Packages

28



collection of classes and methods for database interaction


they abstract the particular implementations of common
standard operations on different types of databases


resultSet: operations are performed on the database, the user
controls how much information is returned




dbSendQuery
create result set


dbGetQuery
get all results

dbGetQuery(
connection
,
sql query
)

DBI

29

Notice that there are a few more entries here. They give you
access to a connection to the database.


> library("hgu133a.db")

> ls("package:hgu133a.db")

[1] "hgu133aACCNUM" "hgu133aALIAS2PROBE"

[3] "hgu133aCHR" "hgu133aCHRLENGTHS"

[5] "hgu133aCHRLOC" "hgu133aENTREZID"

[7] "hgu133aENZYME" "hgu133aENZYME2PROBE"

[9] "hgu133aGENENAME" "hgu133aGO"

[11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE"

[13] "hgu133aMAP" "hgu133aMAPCOUNTS"

[15] "hgu133aOMIM" "hgu133aORGANISM"

[17] "hgu133aPATH" "hgu133aPATH2PROBE"

[19] "hgu133aPFAM" "hgu133aPMID"

[21] "hgu133aPMID2PROBE" "hgu133aPROSITE"

[23] "hgu133aREFSEQ" "hgu133aSYMBOL"

[25] "hgu133aUNIGENE" "hgu133a_dbInfo"

[27] "hgu133a_dbconn" "hgu133a_dbfile"

[29] "hgu133a_dbschema"

.db Packages

30

> con <
-

hgu133a_dbconn()

> q1 <
-

"select symbol from gene_info“

> head(dbGetQuery(con ,q1))


symbol

1 A2M

2 NAT1

3 NAT2

4 SERPINA3

> toTable(hgu133aSYMBOL)[1:3,]


probe_id symbol

1 217757_at A2M

2 214440_at NAT1

3 206797_at NAT2


extract information from a database table as data.frame

reverse mapping

> revmap(hgu133aSYMBOL)$BAD

[1] "1861_at" "209364_at"

31

Lkeys, Rkeys
: Get left and right keys of a Bimap object

> head(Lkeys(hgu133aSYMBOL))

[1] "1007_s_at" "1053_at" "117_at"
"121_at" "1255_g_at" "1294_at"


> head(Rkeys(hgu133aSYMBOL))

[1] "A2M" "NAT1" "NAT2"
"SERPINA3" "AADAC" "AAMP"

> table(nhit(revmap(hgu133aSYMBOL)))



1 2 3 4 5 6 7 8 9
10 11 12 13 18 19

8101 2814 1273 475 205 77 19 15 5
3 4 1 2 1 1

nhit
: number of hits for every left key in a Bimap object