Politecnico di Milano School of Information Engineering Master ...


Oct 1, 2013 (3 years and 8 months ago)



Politecnico di Milano

School of Information Engineering

Master Degree

in Information Engineering

Course “Bioinformatics and Computational Biology for Medicine”

Exercises about


Database for Annotation, Visualization and
Integrated Discovery)

Davide Chicco


Exercise 0

atabase for
isualization and


What is DAVID?

DAVID (the Database for Annotation, Visualization and Integrated Discovery) is a free online

developed by the Laboratory of Immunopathogenesis and Bioinformatics (LIB) of
Frederick (Science Applications International Corporation, in Frederick, Maryland, Usa).
All tools

the DAVID Bioinformatics Resources
aim to provide functional
interpretation of large lists of genes
derived from genomic studies
, e.g. microarray and proteomics studies. DAVID can be found at

The DAVID Bioinformatics Resources consists of

the DAVID Knowledgebase

five integrated, web
based functional annotation tool suites
: the DAVID Gene Functional Classification Tool, the DAVID
Functional Annotation Tool, the DAVID Gene ID Conversion Tool, the DAVID Gene Name Viewer and the
D Pathogen Genome Browser. The expanded DAVID Knowledgebase now integrates almost all
major and well
known public bioinformatics resources centralized by the DAVID Gene Concept, a single
linkage method to agglomerate tens of millions of diverse gene/protei
n identifiers and annotation terms
from a variety of public bioinformatics databases. For any uploaded
gene list
, the DAVID Resources now
provides not only the typical gene
term enrichment analysis, but also
new tools

that allow

to cond
ense large gene lists into gene functional groups, convert between gene/protein
identifiers, visualize many
terms relationships, cluster redundant and heterogeneous
terms into groups, search for interesting and related genes or terms, dynamic
ally view genes from their
lists on bio
pathways and more

(from Wikipedia in English)

Which tools are provided?

The Database for Annotation, Visualization and Integrated Discovery (DAVID ) v6.7 is an update to the
sixth version of our original web
ssible programs. DAVID now provides a comprehensive set of
functional annotation tools

for investigators to
understand biological meaning behind large list of
. For any given gene list, DAVID tools are able to:

Identify enriched biological themes, pa
rticularly GO terms


Discover enriched functional
related gene groups

Cluster redundant annotation terms

Visualize genes on BioCarta & KEGG pathway maps

Display related many
terms on 2
D view.

Search for other functionally related genes not in

the list

List interacting proteins

Explore gene names in batch

Link gene
disease associations

Highlight protein functional domains and motifs

Redirect to related literatures

Convert gene identifiers from one type to another.

and more...

(from DAVID websit
e homepage)

[Question 0.0] What is DAVID resource for? Show an example of bioinformatics problem or question that could
be solved using DAVID.

[Question 1.0] Many DAVID tools are for enrichment analysis. What is intended for “enrichment analysis”?

rcise 1

DAVID data format

Since DAVID is a yield of web

services and tools focused on genes and gene list, the first issue that we have to
study is how gene data are represented.

Check out DAVID website homepage:

On the right, you can find a panel

showing “Shortcut to DAVID Tools”.

Fig. 1

DAVID homepage on http://david.abcc.ncifcrf.gov


Click on the first “Functional annotation”.

You should get to webpage shown on Fig. 2

The system asks you to insert a gene list.

Click on “Upload”, and select a gene list provided by the system, like “Demolist 1”.

k on “Show gene list” and then on “Download file”.

Fig. 2

Functional annotation tool


You should see what is shown in Fig. 3

As you can see, the gene list is provided just in three columns: one (on the left) for the gene id (in Affymetrix
format, in this case); another (in the center) for

the name of the gene; and a third column (on the right) for the
species name (in this case: Homo Sapiens).

[Question 1.0] Are there any other gene list data format accepted for DAVID? If yes, list them.

[Question 1.1] In this list, is

there any gene for
mat that you have already used in the past, for other experiments
or exercises? Which formats and when? (e.g. Yes, format XXX for the exercises on YYY tool)

Exercise 2

Functional annotation tool

This tool suite, introduced in the first version of D
AVID, mainly provides typical batch annotation and gene
term enrichment analysis to highlight the most relevant GO terms associated with a given gene list. It shows
information about annotations related to the genes present in the gene


If you clic
k back on “List”, a Gene List Manager bar should appear, as in Fig. 4

Fig. 3

Datalist1 gene list


If you inserted demolist1, then you have to click on “use”, and the “Annotation Summary Results” table, as
shown in Fig.4, should appear.

The buttons and information you get are expl
ained in Fig.5

Fig. 4 Gene
list manager of function annotation tool


In this summary table, you can see that annotations retrieved are divided into categories (Disease,
Functional_Categories, etc), based on annotation origin.

In Fig. 5, you can see the explanation of all the possible commands.

At the
end of th
s page, you can find three commands for the operations you can carry on:

Functional annotation clustering

Functional annotation chart

Functional annotation table

If you click on the first, “Functional annotation clustering”, you should get to a
webpage similar to Fig. 6

Fig. 5

Summary results as exp
lained on the website


Check out options. Click on the cross button to expand them (Fig. 6b)

This option parameters allow you to define better your search:

Similarity Term Overlap

(any value >=0; default = 4): the minimum number of annotat
ion terms overlapped
between two genes in order to be qualified for kappa calculation. This parameter is to maintain necessary statistical
power to make kappa value more meaningful. The higher value, the more meaningful the result is.

Similarity Threshold

(any value between 0 to 1; Default = 0.35): the minimum kappa value to be considered
biological significant. The higher setting, the more genes will be put into unclustered group, which lead to higher
quality of functional classification result with a fewe
r groups and a fewer gene members. Kappa value 0.3 starts
giving meaningful biology based on our genome
wide distribution study. Anything below 0.3 have great chance to
be noise.

Initial Group Members

(any value >=2; default = 4): the minimum gene number i
n a seeding group, which affects
the minimum size of each functional group in the final. In general, the lower value attempts to include more genes
in functional groups, particularly generates a lot small size groups.

Final Group Members

(any value >=2; de
fault = 4): the minimum gene number in one final group after “cleanup”
Fig. 6

Functional clustering

Fig. 6b



procedure. In general, the lower value attempts to include more genes in functional groups, particularly generates
a lot small size groups. It co
functions with previous parameters to c
ontrol the minimum size of functional groups.
If you are interested in functional groups containing only 2 or 3 genes, you need to set it to a very low value.
Otherwise, the small group will not be displayed and will be put into the unclustered group.

linkage Threshold

(any value between 0% to 100%; default = 50%): It controls how seeding groups merge
each other, i.e. two groups sharing the same gene members over the percentage will become one group. The
higher percentage, in general, gives sharper se
paration i.e. it generates more final functional groups with more
tightly associated genes in each group. In addition, changing the parameter does not contribute extra genes into
unclustered group.

If you click on the second, “Functional annotation chart”
, you should get to a webpage similar to Fig. 7

The third option, “Functional annotation table”, will give you what shown on Fig. 8


Functional annotation chart


Answer these questions by navigating through DAVID system, and by repeating the previously described

on 2.0.0] Consider dataset “datalist2” provided by DAVID website. How many genes does it contain?
Which species are they of?

[Question 2.0.1] Consider gene
tumor protein p53.
Which are its related genes?

[Question 2.1.0] Use the “Functional annotation cl
ustering” tool for datalist2, with default options. Save data on
your pc. How many clusters did you get?

[Question 2.1.1] Analyze the annotations of the 1

cluster. For every annotation, retrieve and report the
description of its term.

[Question 2.1.2] A
nalyze the statistical values of every annotation in the cluster:
Enrichment Score, Count,
P_Value, Benjamini
. The
Enrichment Score

is based on EASE score, that comes from
Fisher's Exact Score
. How
does this score is defined? Search it on the web and descr
ibe it briefly.

[Question 2.2] Use the “Functional annotation clustering” tool for datalist2, with the following options: Similarity
Term Overlap=5; Similarity Threshold=0.5. Leave the ot
er option parameters as default. Save results on your pc.
Compare t
hese results with results saved form [Question 2.1.0]. Which differences do you notice?

Exercise 3

Gene Functional Classification tool

What does this tool do?

Classify large gene list into functional related gene groups

Rank the importance of the

discovered gene groups

Summarize the major biology of the discovered gene groups

Search other functionally related genes from genome, but not in your list

Visualize genes and their functional annotations in a group by a single 2
D view

Explore global vi
ew of gene groups in a Fuzzy Heat Map visualization


Functional annotation table


The advantage of the tool: A novel gene
centric annotation approach

Your genes are highly organized so that they are more readable and understanable.

Your genes are ranked so that you can quickly focu
s on the most likely important ones.

Your genes are displayed with their annotation in one single view so that you can cross compare

Your genes can be extended so that you have chance to know other functionally related genes, but
not in your list.

(from DAVID website)

This tool groups genes into gene groups, dividing them on the basis of functional relationships.

If you upload the datalist1 and execute the tool, you get a page like Fig. 9

If you click on the black and green little image, you
move to the 2D Gene
Term associations view (Fig.10)


Classification of genes


[Question 3] Rerun the Gene Functional Classification tool, with the same dataset datalist1, with a different
option Similarity Threshold parameter. Then, look at the 2D view green/black image of

a gene group. Did
something change from the previo
s 2D image of the same gene list? What? Why? Find the correlation between
Similarity Threshold parameter changes and then 2D heat map changes.

Exercise 4

Gene ID Conversion Tool

A very easy tool t
hat allows you to convert a gene id from a format to another.

Fig. 10

2d view


Gene ID conversion tool


[Question 4] Convert gene ids of datalist1 to Ensembl gene format. Look at the results: did all your gene ids got
converted? Did you have some unsuccesful convertion? Why?

Tool manuals: