Politecnico di Milano School of Information Engineering Master ...

websterhissΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

76 εμφανίσεις


1

Politecnico di Milano

School of Information Engineering

Master Degree

in Information Engineering

Course “Bioinformatics and Computational Biology for Medicine”



Exercises about

DAVID

(
Database for Annotation, Visualization and
Integrated Discovery)

Arif
Canakoglu,
Davide Chicco

c
anakoglu@elet.polimi.it




Exercise 0


D
atabase for
A
nnotation,
V
isualization and
I
ntegrated

D
iscovery



What is DAVID?


DAVID (the Database for Annotation, Visualization and Integrated Discovery) is a free online
bioinformatics

resource
developed by the Laboratory of Immunopathogenesis and Bioinformatics (LIB) of
SAIC
-
Frederick (Science Applications International Corporation, in Frederick, Maryland, Usa).
All tools

in
the DAVID Bioinformatics Resources
aim to provide functional
interpretation of large lists of genes
derived from genomic studies
, e.g. microarray and proteomics studies. DAVID can be found at
http://david.abcc.ncifcrf.gov



The DAVID Bioinformatics Resources consists of

the DAVID Knowledgebase

and
five integrated, web
-
based functional annotation tool suites
: the DAVID Gene Functional Classification Tool, the DAVID
Functional Annotation Tool, the DAVID Gene ID Conversion Tool, the DAVID Gene Name Viewer and the
DAVID NIAI
D Pathogen Genome Browser. The expanded DAVID Knowledgebase now integrates almost all
major and well
-
known public bioinformatics resources centralized by the DAVID Gene Concept, a single
-
linkage method to agglomerate tens of millions of diverse gene/protei
n identifiers and annotation terms
from a variety of public bioinformatics databases. For any uploaded
gene list
, the DAVID Resources now
provides not only the typical gene
-
term enrichment analysis, but also
new tools

and
functions
that allow
users

to cond
ense large gene lists into gene functional groups, convert between gene/protein
identifiers, visualize many
-
genes
-
to
-
many
-
terms relationships, cluster redundant and heterogeneous
terms into groups, search for interesting and related genes or terms, dynamic
ally view genes from their
lists on bio
-
pathways and more
.

(from Wikipedia in English)


Which tools are provided?



The Database for Annotation, Visualization and Integrated Discovery (DAVID ) v6.7 is an update to the
sixth version of our original web
-
acce
ssible programs. DAVID now provides a comprehensive set of
functional annotation tools

for investigators to
understand biological meaning behind large list of
genes
. For any given gene list, DAVID tools are able to:




Identify enriched biological themes, pa
rticularly GO terms


2



Discover enriched functional
-
related gene groups



Cluster redundant annotation terms



Visualize genes on BioCarta & KEGG pathway maps



Display related many
-
genes
-
to
-
many
-
terms on 2
-
D view.



Search for other functionally related genes not in

the list



List interacting proteins



Explore gene names in batch



Link gene
-
disease associations



Highlight protein functional domains and motifs



Redirect to related literatures



Convert gene identifiers from one type to another.



and more...

(from DAVID websit
e homepage)


[Question 0.0] What is DAVID resource for? Show an example of bioinformatics problem or question that could
be solved using DAVID.

[Question 1.0] Many DAVID tools are for enrichment analysis. What is intended for “enrichment analysis”?




Exe
rcise 1


DAVID data format




Since DAVID is a yield of web

services and tools focused on genes and gene list, the first issue that we have to
study is how gene data are represented.

Check out DAVID website homepage:




On the right, you can find a panel

showing “Shortcut to DAVID Tools”.


Fig. 1
-

DAVID homepage on http://david.abcc.ncifcrf.gov


3

Click on the first “Functional annotation”.

You should get to webpage shown on Fig. 2


The system asks you to insert a gene list.

Click on “Upload”, and select a gene list provided by the system, like “Demolist 1”.

Clic
k on “Show gene list” and then on “Download file”.



Fig. 2
-

Functional annotation tool


4

You should see what is shown in Fig. 3

As you can see, the gene list is provided just in three columns: one (on the left) for the gene id (in Affymetrix
format, in this case); another (in the center) for

the name of the gene; and a third column (on the right) for the
species name (in this case: Homo Sapiens).


[Question 1.0] Are there any other gene list data format accepted for DAVID? If yes, list them.

[Question 1.1] In this list, is

there any gene for
mat that you have already used in the past, for other experiments
or exercises? Which formats and when? (e.g. Yes, format XXX for the exercises on YYY tool)




Exercise 2


Functional annotation tool




This tool suite, introduced in the first version of D
AVID, mainly provides typical batch annotation and gene
-
GO
term enrichment analysis to highlight the most relevant GO terms associated with a given gene list. It shows
information about annotations related to the genes present in the gene

list


If you clic
k back on “List”, a Gene List Manager bar should appear, as in Fig. 4

Fig. 3
-

Datalist1 gene list


5



If you inserted demolist1, then you have to click on “use”, and the “Annotation Summary Results” table, as
shown in Fig.4, should appear.


The buttons and information you get are expl
ained in Fig.5





Fig. 4 Gene
list manager of function annotation tool


6



In this summary table, you can see that annotations retrieved are divided into categories (Disease,
Functional_Categories, etc), based on annotation origin.

In Fig. 5, you can see the explanation of all the possible commands.


At the
end of th
i
s page, you can find three commands for the operations you can carry on:



Functional annotation clustering



Functional annotation chart



Functional annotation table


If you click on the first, “Functional annotation clustering”, you should get to a
webpage similar to Fig. 6






Fig. 5
-

Summary results as exp
lained on the website


7



Check out options. Click on the cross button to expand them (Fig. 6b)





This option parameters allow you to define better your search:


Similarity Term Overlap

(any value >=0; default = 4): the minimum number of annotat
ion terms overlapped
between two genes in order to be qualified for kappa calculation. This parameter is to maintain necessary statistical
power to make kappa value more meaningful. The higher value, the more meaningful the result is.

Similarity Threshold

(any value between 0 to 1; Default = 0.35): the minimum kappa value to be considered
biological significant. The higher setting, the more genes will be put into unclustered group, which lead to higher
quality of functional classification result with a fewe
r groups and a fewer gene members. Kappa value 0.3 starts
giving meaningful biology based on our genome
-
wide distribution study. Anything below 0.3 have great chance to
be noise.

Initial Group Members

(any value >=2; default = 4): the minimum gene number i
n a seeding group, which affects
the minimum size of each functional group in the final. In general, the lower value attempts to include more genes
in functional groups, particularly generates a lot small size groups.

Final Group Members

(any value >=2; de
fault = 4): the minimum gene number in one final group after “cleanup”
Fig. 6
-

Functional clustering

Fig. 6b
-

Options


8

procedure. In general, the lower value attempts to include more genes in functional groups, particularly generates
a lot small size groups. It co
-
functions with previous parameters to c
ontrol the minimum size of functional groups.
If you are interested in functional groups containing only 2 or 3 genes, you need to set it to a very low value.
Otherwise, the small group will not be displayed and will be put into the unclustered group.

Mult
i
-
linkage Threshold

(any value between 0% to 100%; default = 50%): It controls how seeding groups merge
each other, i.e. two groups sharing the same gene members over the percentage will become one group. The
higher percentage, in general, gives sharper se
paration i.e. it generates more final functional groups with more
tightly associated genes in each group. In addition, changing the parameter does not contribute extra genes into
unclustered group.


If you click on the second, “Functional annotation chart”
, you should get to a webpage similar to Fig. 7


The third option, “Functional annotation table”, will give you what shown on Fig. 8

Fig.7
-

Functional annotation chart


9



Answer these questions by navigating through DAVID system, and by repeating the previously described
procedure.

[Questi
on 2.0.0] Consider dataset “datalist2” provided by DAVID website. How many genes does it contain?
Which species are they of?

[Question 2.0.1] Consider gene
tumor protein p53.
Which are its related genes?


[Question 2.1.0] Use the “Functional annotation cl
ustering” tool for datalist2, with default options. Save data on
your pc. How many clusters did you get?

[Question 2.1.1] Analyze the annotations of the 1
st

cluster. For every annotation, retrieve and report the
description of its term.

[Question 2.1.2] A
nalyze the statistical values of every annotation in the cluster:
Enrichment Score, Count,
P_Value, Benjamini
. The
Enrichment Score

is based on EASE score, that comes from
Fisher's Exact Score
. How
does this score is defined? Search it on the web and descr
ibe it briefly.


[Question 2.2] Use the “Functional annotation clustering” tool for datalist2, with the following options: Similarity
Term Overlap=5; Similarity Threshold=0.5. Leave the ot
h
er option parameters as default. Save results on your pc.
Compare t
hese results with results saved form [Question 2.1.0]. Which differences do you notice?




Exercise 3


Gene Functional Classification tool




What does this tool do?




Classify large gene list into functional related gene groups



Rank the importance of the

discovered gene groups



Summarize the major biology of the discovered gene groups



Search other functionally related genes from genome, but not in your list



Visualize genes and their functional annotations in a group by a single 2
-
D view



Explore global vi
ew of gene groups in a Fuzzy Heat Map visualization

Fig.8
-

Functional annotation table


10


The advantage of the tool: A novel gene
-
centric annotation approach



Your genes are highly organized so that they are more readable and understanable.



Your genes are ranked so that you can quickly focu
s on the most likely important ones.



Your genes are displayed with their annotation in one single view so that you can cross compare
them.



Your genes can be extended so that you have chance to know other functionally related genes, but
not in your list.

(from DAVID website)


This tool groups genes into gene groups, dividing them on the basis of functional relationships.

If you upload the datalist1 and execute the tool, you get a page like Fig. 9




If you click on the black and green little image, you
move to the 2D Gene
-
Term associations view (Fig.10)







Fig.9
-

Classification of genes


11


[Question 3] Rerun the Gene Functional Classification tool, with the same dataset datalist1, with a different
option Similarity Threshold parameter. Then, look at the 2D view green/black image of

a gene group. Did
something change from the previo
u
s 2D image of the same gene list? What? Why? Find the correlation between
Similarity Threshold parameter changes and then 2D heat map changes.



Exercise 4


Gene ID Conversion Tool




A very easy tool t
hat allows you to convert a gene id from a format to another.


Fig. 10
-

2d view

Fig.
11
-

Gene ID conversion tool


12

[Question 4] Convert gene ids of datalist1 to Ensembl gene format. Look at the results: did all your gene ids got
converted? Did you have some unsuccesful convertion? Why?



Tool manuals: