Applied Bioinformatics - Department of Biomedical Informatics ...

lambblueearthBiotechnology

Sep 29, 2013 (3 years and 6 months ago)

61 views

Applied Bioinformatics
Bing Zhang

Department of Biomedical Informatics
Vanderbilt University
bing.zhang@vanderbilt.edu

Course overview


What is bioinformatics


Data driven science:
the creation and
advancement of databases, algorithms, and
computational and statistical methods to
solve theoretical and practical problems
arising from the management and analysis of
biological data.


Major research areas:
sequence alignment,
gene finding, genome assembly, protein
structure prediction, gene expression and
regulation, protein interaction, drug design,
genome-wide association studies,
computational evolutionary biology etc.


Applied bioinformatics module


Not a comprehensive guide to all facets of
bioinformatics


To equip you with the computational
understanding and expertise needed to solve
bioinformatics problems that you will likely
encounter in your research.
Applied Bioinformatics, Spring 2011
2
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html

Course overview


What is bioinformatics


Data driven science:
the creation and
advancement of databases, algorithms, and
computational and statistical methods to
solve theoretical and practical problems
arising from the management and analysis of
biological data.


Major research areas:
sequence alignment,
gene finding, genome assembly, protein
structure prediction, gene expression and
regulation, protein interaction, drug design,
genome-wide association studies,
computational evolutionary biology etc.


Applied bioinformatics module


Not a comprehensive guide to all facets of
bioinformatics


To equip you with the computational
understanding and expertise needed to solve
bioinformatics problems that you will likely
encounter in your research.
Applied Bioinformatics, Spring 2011
3
http://www.bioinformatics.ca/links_directory/

Course content and grades
Applied Bioinformatics, Spring 2011
4
!
Date

Subject

Instructor

Homework (HW)

2/14

Finding information about genes

Zhang


2/16

Navigating sequenced genomes

Zhang


2/18

Pairwise sequence alignment and database search

Zhao


2/21

Multiple sequence
alignment

Zhao


2/23

Infe
r
ring
phylogenetic relationships from sequence data

Zhao


HW I distribution

20 pts Zhao + 10 pts Zhang

2/25

Protein sequence annotation

Tabb


2/28

Protein structure prediction and visualization

Tabb

HW I due

3/2

Protein
identification by mass spectrometry

Tabb

HW II distribution

20 pts Tabb

3/4

Gene prediction and annotation

Bush


3/7

Finding regulatory and conserved elements in DNA sequence

Bush

HW II due

3/9

Assessing the impact of genetic variation


Bush


HW III
distribution

20 pts Bush

3/11

Supervised analysis of gene expression data

Zhang


3/14

Unsupervised analysis of gene expression data

Zhang

HW III due

3/16

Functional interpretation of gene lists

Zhang


3/1
8

Biological pathways

Zhang


3/21

Biological
networks

Zhang

HW IV distribution

30 pts Zhang

3/25

HW assignments will be graded by each instructor for their respective
sections. Final Grade = sum of the hw scores (100 pts in total). A: 85
-
100;
B: 70
-
84; C: 55
-
69; D: 40
-
54; F: 0
-
39

Homework IV due by
5pm

!
Course materials and assignments


Lecture slides available at
https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php
before each lecture


Homework assignments available at
https://medschool.mc.vanderbilt.edu/iidea/admin/admin_login.php
on the distribution
date (2/23, 3/2, 3/9, 3/21)


Homework assignments are due at
5pm
on the due date (2/28, 3/7, 3/14, 3/25).
There will be a 10% per day deduction for late reports.


Email your reports in the
pdf
, doc, or
docx
format
to corresponding
instructor(s
)


HW I:
bing.zhang@vanderbilt.edu
;
zhongming.zhao@vanderbilt.edu



HW II:
david.l.tabb@vanderbilt.edu



HW III:
william.s.bush@vanderbilt.edu



HW IV:
bing.zhang@vanderbilt.edu



Text book (
optional
): Dear, Paul H. (2007)
Methods Express: Bioinformatics.
Scion,
ISBN 978-1904842163.
Applied Bioinformatics, Spring 2011
5
Finding information about genes
Bing Zhang

Department of Biomedical Informatics
Vanderbilt University
bing.zhang@vanderbilt.edu

When do we need gene information?


Case 1


From Prof. Randy Blakely (Pharmacology):
“We have hit an
uncharacterized gene in our hunt for SERT interacting proteins=******
that appears to be highly depleted when extracts are made from SERT
KO mice. Can you help us come up with some ideas as to what this
gene might be.”


Case 2


From Prof. Kevin
Schey
(Biochemistry):

“I’ve attached a spreadsheet of
our proteomics results comparing 5 Vehicle and 5
Aldosterone
treated
patients. We’ve included only those proteins whose summed spectral
counts are >30 in one treatment group. Would it be possible to get the
GO annotations for these? The
Uniprot
name is listed in column A and
the gene name is listed in column R. If this is a time consuming task
(and I imagine that it is), can you tell me how to do it?”
Applied Bioinformatics, Spring 2011
7
Resources


Entrez
Gene


http://www.ncbi.nlm.nih.gov/gene



NCBI/NIH


All completely sequenced genomes


One gene per page


Ensembl

BioMart



http://www.ensembl.org/biomart/martview



EMBL-EBI and Sanger Institute


Vertebrates and other selected eukaryotic
species


Batch information retrieval


Gene Cards


http://www.genecards.org



Weizmann Institute of Science, Israel


Comprehensive information on human
genes


WikiGenes



http://www.wikigenes.org



MIT


Collaborative annotation in a wiki
system


GLAD4U


http://bioinfo.vanderbilt.edu/glad4u



Vanderbilt


Genes related to a specific topic
Applied Bioinformatics, Spring 2011
8
Learning objectives


To gain a basic understanding of the
Entrez
Gene system


To be able to retrieve information for individual genes using
Entrez

Gene


To gain a basic understanding of the
Ensembl

BioMart
system


To be able to retrieve information for a list of genes using
Ensembl

BioMart

Applied Bioinformatics, Spring 2011
9
Entrez
Gene: overview


Data source


Automated analyses and
curation
by NCBI staff


Data stored in flat files


Updated continuously


Unique gene identifier


Entrez
Gene uses unique integers (
GeneID
) as stable identifiers for genes, e.g.
GeneID
for
human tumor protein p53 (TP53) is 7157


GeneID
assigned to each record is species specific, e.g.
GeneID
for the mouse
ortholog
of
TP53 (Trp53) i s 22059


St at i st i cs as of February 2011


7.2 mi l l i on records di st ri but ed among 7039
t axa



45,227 records f or human


Query syst em


Ent rez

Appl i ed Bi oi nf ormat i cs, Spri ng 2011
10
Entrez
Gene:
Entrez



An integrated search and retrieval system that provides access to
many discrete databases at the NCBI website.


All databases indexed by
Entrez
can be searched via a single query
string, including
Entrez
Gene


Supports Boolean operators


AND, OR, NOT


Supports search term tags to limit search to particular fields


Title, organism, etc.


Sample query


transporter[ti tl e
] AND (”Homo
sapi ens"[organi sm
] OR "
Mus

muscul us"[organi sm
])
Appl i ed Bi oi nformati cs, Spri ng 2011
11
Entrez
Gene: search result
Applied Bioinformatics, Spring 2011
12
Display
Setting
Summary
record
Advanced
search
Filtering
Related
data
Help
Entrez
Gene: Gene record (I)


Each Gene record integrates multiple types of information


Gene type:
tRNA
,
rRNA
,
snRNA
,
scRNA
,
snoRNA
,
miscRNA
, protein-
coding, pseudo, other, and unknown


Nomenclature, summary descriptions, accessions of gene specific and
gene product-specific sequences, chromosomal location, reports of
pathways and protein interactions, associated markers and phenotypes


Links to other databases at NCBI including literature citations,
sequences, variations, and
homologs



Links to databases outside of NCBI
Applied Bioinformatics, Spring 2011
13
Entrez
Gene: Gene record (II)
Applied Bioinformatics, Spring 2011
14
http://www.ncbi.nlm.nih.gov/gene/7157
Help
Expand
Export
New
search
Entrez
Gene: advanced ways of accessing


FTP download


ftp://
ftp.ncbi.nlm.nih.gov
/gene/README


E-Utilities (
Entrez
Programming Utilities)


Server-side programs that provide a stable interface into the
Entrez
query and
database system


Uses a fixed URL syntax that translates a standard set of input parameters into
the values necessary for various NCBI software components to search for and
retrieve the requested data, including nucleotide and protein sequences, gene
records, three-dimensional molecular structures, and the biomedical literature.


Works with any computer language that can send a URL to the E-utilities server
and interpret the XML response, e.g. Perl, Python, Java, and C++.


Combining E-utilities components to form customized data pipelines within these
applications is a powerful approach to data manipulation.
Applied Bioinformatics, Spring 2011
15
Entrez
Gene: documentation and publications
Applied Bioinformatics, Spring 2011
16
http://www.ncbi.nlm.nih.gov/books/NBK3841/

Maglott
et al. NAR, 39:D52-D57, 2011
Entrez
Gene: exercise


Questions


How many records can we get for a simple search of “
kinase
” in
Entrez
Gene?


Use Boolean operators and search term tags to search for mouse genes located
on chromosome 1 and with
kinase
in title. With the default display setting, what is
the first hit?


Click on the first hit and identify how many publications in
PubMed
are associated
with this gene.


Identify which proteins interact with the protein product of this gene.


Answers


244,301 records


Query term:
kinase[title
] AND
mouse[Organism
] AND 1[Chromosome]


Epha4


Bibliograph
section: 220 citations in
PubMed



Interactions section: 3 proteins, Epha4,
Ngef
, and Vav2
Applied Bioinformatics, Spring 2011
17
Ensembl



Genome databases for vertebrates and other selected eukaryotic species


Automated annotation system at EBI


Data stored in a relational database


Updated periodically with versions


Unique gene identifier


Ensembl
uses unique strings (
Ensembl
gene ID) as stable identifiers for genes, e.g.
Ensembl

gene stable ID for human tumor protein p53 (TP53) is ENSG00000141510


GeneID
assigned to each record is species specific, e.g.
Ensembl
gene stable ID for the
mouse
ortholog
of TP53 (Trp53) is ENSMUSG00000059552


Clear gene, transcript, and protein relationship, e.g. ENSG00000141510 => 17 transcripts
(e.g. ENST00000445888) => 13 proteins (e.g. ENSP00000391478)


Statistics as of February 2011 (version 61)


55 species


53,630 genes for human


Other species available in the recently expanded system
EnsemblGenomes



http://
www.ensemblgenomes.org

Applied Bioinformatics, Spring 2011
18


Biomart
is a query-oriented data management system.


Batch information retrieval for complex queries


Particularly suited for providing 'data mining' like searches of
complex descriptive data such as those related to genes and
proteins


Open source and can be customized


Originally developed for the
Ensembl
genome databases


Adopted by many other projects including
UniProt
,
InterPro
,
Reactome
, Pancreatic Expression Database, and many others (see
a compl et e l i st and get access t o t he t ool s f r om
http://www.biomart.org/
)
Biomart
: a batch information retrieval system
Applied Bioinformatics, Spring 2011
19
BioMart
: basic concepts


Dataset


Filter


Attribute


From Prof. Kevin
Schey
(Biochemistry):

“I’ve attached a
spreadsheet of our proteomics results comparing 5 Vehicle and 5
Aldosterone
treated
patients
. We’ve included only those proteins
whose summed spectral counts are >30 in one treatment group.
Would it be possible to get the
GO annotations

for these? The
Uniprot
name is listed in column A

and the gene name is listed in
column R. If this is a time consuming task (and I imagine that it is),
can you tell me how to do it?”


From
all human genes
, selected those with the
listed
Uniprot
IDs
,
and retrieve
GO annotations
.

Applied Bioinformatics, Spring 2011
20


Choose dataset


Choose database:
Ensembl
Genes 61


Choose dataset:
Homo sapiens
genes (GRch37)


Set filters


Gene: a list of genes/proteins identified by various database IDs (e.g. IPI IDs)


Gene Ontology: filter for proteins with specific GO terms (e.g. cell cycle)


Protein domains: filter for proteins with specific protein domains (e.g. SH2 domain)


Region: filter for genes in a specific chromosome region (e.g. chr1 1:1000000 or 11q13)


Others


Select output attributes


Gene annotation information in the
Ensembl
database, e.g. gene description, chromosome
name, gene start, gene end, strand, band, gene name, etc.


External data: Gene Ontology, IDs in other databases


Expression: anatomical system, development stage, cell type, pathology


Protein domains: SMART, PFAM,
Interpro
, etc.
Ensembl

Biomart
analysis
Applied Bioinformatics, Spring 2011
21
Ensembl

BioMart
: query interface
Applied Bioinformatics, Spring 2011
22
Choose
dataset
Set filters
Help
Results
Count
Perl API
Select output attributes
Ensembl

Biomart
: sample output
Applied Bioinformatics, Spring 2011
23
Export all
results to a file
Ensembl

Biomart
: documentation and publications
Applied Bioinformatics, Spring 2011
24
http://www.ensembl.org/info/website/tutorials/index.html

Smedley
et al. BMC Genomics, 10:22, 2009
Ensembl

Biomart
analysis: exercise 1


Question


I have two
Ensembl
gene IDs, ENSG00000162367 and ENSG00000187048. How do I get
their gene names from HGNC, IDs from
EntrezGene
, and any probes that contain these gene
sequences from the
Affymetrix
microarray platform HC G110?


Choose data set


Database:
Ensembl
Gene 61


Dataset: Homo sapiens genes (GRCh37.p2)


Set filters


Under GENE: check ID list limit box


Select Header:
Ensembl
Gene IDs, Enter the gene IDs into the box.


Select output attributes


Select Features (default)


Under EXTERNAL: External References, Select 'HGNC Symbol' and '
EntrezGene
ID’


Under EXTERNAL: Microarray, Select '
Affy
HC G110’


Click on Count and then Results


Export all results to File, TSV


Applied Bioinformatics, Spring 2011
25
Ensembl

Biomart
analysis: exercise 2


Question


How can I get the 2kb upstream sequences for all genes on chromosome 1?


Choose data set


Database:
Ensembl
Gene 61


Dataset:
Mus

musculus
genes (NCBIM37)


Set filters


Under REGION: check Chromosome, select 1


Select output attributes


Select Sequences


Under SEQUENCES: select Flank (Gene)


Under Upstream flank: check and enter 2000 into the box


Under Header Information, Gene Information, check Description


Click on Count (1916/36817) and then Results


Export all results to File, FASTA format

Applied Bioinformatics, Spring 2011
26
Summary


Entrez
Gene


http://www.ncbi.nlm.nih.gov/gene



NCBI/NIH


All completely sequenced genomes


Data stored in flat files


Updated continuously


Unique gene identifier:
Entrez
Gene ID


Query system:
Entrez



Output: one-gene-at-a-time


Ensembl

BioMart



http://www.ensembl.org/biomart/martview



EMBL-EBI and Sanger Institute


Mainly vertebrates


Data stored in a relational database


Updated periodically with versions


Unique gene identifier:
Ensembl
Gene ID


Query system:
BioMart



Output: multiple genes at the same time
Applied Bioinformatics, Spring 2011
27