Bioinformatics Tutorial Society for Developmental Biology 2008

dasypygalstockingsΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 11 μήνες)

219 εμφανίσεις

Introduction to NCBI

(and Other Online Bioinformatics Resources)


Society for Developmental Biology 2008

Li
-
San Wang

Penn Center for Bioinformatics

University of Pennsylvania

lswang@mail.med.upenn.edu

http://people.pcbi.upenn.edu/~lswang/


Outline


Introduction of the NCBI databases and
web services


Introduction to some concepts in
bioinformatics


Hands
-
on experience


Other online resources:


UCSC Genome Browser and NIAID DAVID

http://www.ncbi.nlm.nih.gov/

http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
(Nov 2004)

Entities

Genome, Chromosome

Gene,
Exon
,
Intron

Protein, Domain, SNP …

Annotations

Phenotype

Publication

Gene Expression


Relations

Homology

Taxonomy

Ontology

OMIM

etc.

Some Common Tasks


Find information about a gene/genome,
etc.


Find homologs


Find genes related to a phenotype


Find similar sequences to an input
sequence (BLAST)

NCBI Entrez (Google “Entrez”)

http://www.ncbi.nlm.nih.gov/sites/gquery

Accession Numbers



Example: TP53


http://www.ncbi.nlm.nih.gov/Sequin/acc.html

NM_000546.4


NP_000537.3 tumor protein p53 isoform a


NM_000546.4 gi:
187830767

NP_000537.3 gi:
120407068


File Format

Fasta

GFF

XML

NCBI Entrez Gene

(previously LocusLink)

http://www.ncbi.nlm.nih.gov/sites/entrez

Add Limits in Your Query

Exercise (NCBI Minicourse)


Retrieve human entries related to "prion
protein" in Entrez Gene.


Name the map location of this gene on the
human genome.


What is the function of this protein?


What are the alternate gene symbols?


Name the phenotypes associated with the
mutations in this gene.


How many alternatively spliced products have
been annotated for the gene?

Entrez Gene and dbSNP


Retrieve human prion protein by Entrez Gene
(PRNP)


Identify the variations annotated on this gene
by clicking on the SNP:geneView.


How many of them are nonsynonymous changes?


Are there known SNPs in the coding region of a
gene associated with any phenotype?


NCBI Map Viewer

Exercise


Find human GDNF on Map Viewer


Download the gene sequence and 5kb upstream by
using the "dl" link.


Add the
Component
and
Contig
maps for this region.
Name the contig and GenBank accession numbers for
the sequence covering this region. Are the sequences
finished?


Add the
Ab initio
(model) and
Transcript
(RNA)
maps. How many alternatively spliced transcripts have
been annotated for the gene


Display the current data as "Data As Table View".


Add the
phenotype
map. Name the disease with which
the GDNF gene is associated. Obtain more information
about the disease by linking to the corresponding
OMIM record.

NCBI Genome and Genome Project

http://www.ncbi.nlm.nih.gov/Genomes/

Relations Between Sequence Data


Gene


Unigene


Homologene


Taxonomy

UniGene

http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=2723799&TAXID=9606&SEAR
CH=tp53%20AND%20human%20[organism]

HomoloGene

Taxonomy

Exercise


Locate chimpanzee using TaxBrowser. What is
its lineage? How many sub
-
species are there?


How many genome projects are under
Mammalia class?


Find the common tree of the following species:


Human/Chimp/Dog/Horse/Mouse/Rat/Chicken/Zebra
fish


Which of mouse or dog is closer to human?


Which species diverged earliest from the human
lineage?

CDD

Example Query


Gene: Prion Protein (PRNP) (or your
preferred gene)


How many proteins does the gene encode?


What proteins in other organisms are
homologous to this protein?


What are the domains in the protein? Find a
sequence alignment to its homologs


View the conserved regions on the 3D
structure (download NCBI CN3D)

GEO

http://www.ncbi.nlm.nih.gov/geo/

OMIM (Online Mendelian Inheritance in Man)

Examples


What human genes are related to hypertension? Which of those
genes are on chromosome 17?



List the OMIM entries that describe genes on chromosome 10.



List the OMIM entries that contain information about allelic
variants.



Retrieve the OMIM record for the cystic fibrosis transmembrane
conductance regulator (CFTR), and link to related protein
sequence records via Entrez.



Find the OMIM record for the p53 tumor protein, and link out to
related information in Entrez Gene and the p53 Mutation Database.

http://www.ncbi.nlm.nih.gov/Omim/omimhelp.html#SampleQuestions

Complex Queries

cancer[titl] AND 11[chrom] AND autosomal dominant
[clin]



The Boolean operators, AND, OR, NOT, should be
written in upper case. Use parentheses for precedence.


Search field tags are enclosed in square brackets

Save your search history

Quick Review


Genbank


Entrez Gene, HomoloGene, Unigene


Protein structures and CCD


Taxonomy


GEO


OMIM


Complex queries

PubMed

MeSH

Example (NCBI PubMed tutorial
exercise 4)



Use the MeSH Database to build a strategy that will find
citations to articles about
schizophrenia resulting
from prenatal exposure to influenza
. Schizophrenia
and influenza should be the major topics of the articles.

Basic Local Alignment Search Tool (BLAST)


Usage
: Find sequences in a database that
are similar to the input sequence


Applications
:


Infer the function of newly sequenced genes


Predict new members of gene families


Explorer evolutionary relationships


Predict the location and function of protein
-
coding and transcription
-
regulation regions in
genomic DNA

How BLAST works


Sequence databases are preprocessed for
faster access by BLAST


Given an input sequence S:


List all k
-
mers (e.g. k=11 for DNA) of S


Find sequences in DB having similar k
-
mers


Extend the matched words to form High
-
Scoring
Pairs (HSPs)


Evaluate the significance of HSP

http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome

>gi|187960039|ref|NM_001127233.1| Mus musculus transformation related protein 53 (Trp53), transcript variant 2,
mRNA

TTTCCCCTCCCACGTGCTCACCCTGGCTAAAGTTCTGTAGCTTCAGTTCATTGGGACCATCCTGGCTGTAGG
TAGCGACTACAGTTAGGGGGCACCTAGCATTCAGGCCCTCATCCTCCTCCTTCCCAGCAGGGTGTCACGCT
TCTCCGAAGACTGGATGACTGCCATGGAGGAGTCACAGTCGGATATCAGCCTCGAGCTCCCTCTGAGCCAG
GAGACATTTTCAGGCTTATGGAAACTACTTCCTCCAGAAGATATCCTGCCATCACCTCACTGCATGGACGAT
CTGTTGCTGCCCCAGGATGTTGAGGAGTTTTTTGAAGGCCCAAGTGAAGCCCTCCGAGTGTCAGGAGCTCC
TGCAGCACAGGACCCTGTCACCGAGACCCCTGGGCCAGTGGCCCCTGCCCCAGCCACTCCATGGCCCCTG
TCATCTTTTGTCCCTTCTCAAAAAACTTACCAGGGCAACTATGGCTTCCACCTGGGCTTCCTGCAGTCTGGG
ACAGCCAAGTCTGTTATGTGCACGTACTCTCCTCCCCTCAATAAGCTATTCTGCCAGCTGGCGAAGACGTGC
CCTGTGCAGTTGTGGGTCAGCGCCACACCTCCAGCTGGGAGCCGTGTCCGCGCCATGGCCATCTACAAGA
AGTCACAGCACATGACGGAGGTCGTGAGACGCTGCCCCCACCATGAGCGCTGCTCCGATGGTGATGGCCT
GGCTCCTCCCCAGCATCTTATCCGGGTGGAAGGAAATTTGTATCCCGAGTATCTGGAAGACAGGCAGACTTT
TCGCCACAGCGTGGTGGTACCTTATGAGCCACCCGAGGCCGGCTCTGAGTATACCACCATCCACTACAAGT
ACATGTGTAATAGCTCCTGCATGGGGGGCATGAACCGCCGACCTATCCTTACCATCATCACACTGGAAGACT
CCAGTGGGAACCTTCTGGGACGGGACAGCTTTGAGGTTCGTGTTTGTGCCTGCCCTGGGAGAGACCGCCGT
ACAGAAGAAGAAAATTTCCGCAAAAAGGAAGTCCTTTGCCCTGAACTGCCCCCAGGGAGCGCAAAGAGAGC
GCTGCCCACCTGCACAAGCGCCTCTCCCCCGCAAAAGAAAAAACCACTTGATGGAGAGTATTTCACCCTCA
AGATCCGCGGGCGTAAACGCTTCGAGATGTTCCGGGAGCTGAATGAGGCCTTAGAGTTAAAGGATGCCCAT
GCTACAGAGGAGTCTGGAGACAGCAGGGCTCACTCCAGCCTCCAGCCTAGAGCCTTCCAAGCCTTGATCAA
GGAGGAAAGCCCAAACTGCTAGCTCCCATCACTTCATCCCTCCCCTTTTCTGTCTTCCTATAGCTACCTGAA
GACCAAGAAGGGCCAGTCTACTTCCCGCCATAAAAAAACAATGGTCAAGAAAGTGGGGCCTGACTCAGACT
GACTGCCTCTGCATCCCGTCCCCATCACCAGCCTCCCCCTCTCCTTGCTGTCTTATGACTTCAGGGCTGAG
ACACAATCCTCCCGGTCCCTTCTGCTGCCTTTTTTACCTTGTAGCTAGGGCTCAGCCCCCTCTCTGAGTAGT
GGTTCCTGGCCCAAGTTGGGGAATAGGTTGATAGTTGTCAGGTCTCTGCTGGCCCAGCGAAATTCTATCCAG
CCAGTTGTTGGACCCTGGCACCTACAATGAAATCTCACCCTACCCCACACCCTGTAAGATTCTATCTTGGGC
CCTCATAGGGTCCATATCCTCCAGGGCCTACTTTCCTTCCATTCTGCAAAGCCTGTCTGCATTTATCCACCC
CCCACCCTGTCTCCCTCTTTTTTTTTTTTTTACCCCTTTTTATATATCAATTTCCTATTTTACAATAAAATTTTGT
TATCACTTAAAAAAAAAA

Blast Types

Input seq

Seq being
matched

Description

blastn

DNA

DNA

Nucleotide
-
Nucleotide Blast

blastp

Protein

Protein

Protein
-
Protein Blast

tblastn

Protein

DNA

Protein
-
Nucleotide 6
-
frame
translation

blastx

DNA

Protein

Nucleotide
-
Protein 6
-
frame
translation

tblastx

DNA

DNA

Nucleotide 6
-
frame translation
-
nucleotide 6
-
frame translation

megablast

DNA

DNA

Large numbers of similar hits
(>=95% identity)

psi
-
blast

Protein

Protein

Position
-
Specific Iterative BLAST

Databases


Protein


nr
/
refseq
/ swissprot / pat / pdb / month /
env_nr


Nucleotide


nr
/
refseq_rna
/
refseq_genomic
/ est /
est_human / est_others / gss / htgs / pat / pdb
/ month / dbsts / chromosome / wgs / env_nt


http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases

Other Options


Number of hits to display


Weights for matching


Nucleotide: matching


Protein: scoring matrix


Weights for gap


(open) + (k
-
1) * (extend)


Organism


Mask low
-
complexity regions


Blast two sequences (bl2seq)

>gi|4759254|ref|NP_004611.1| TNF receptor
-
associated factor 6 [Homo sapiens]

MSLLNCENSCGSSQSESDCCVAMASSCSAVTKDDSVGGTASTGNLSSSFMEEIQGYDVEFDPPLESKYECPICLMALREAVQT
PCGHRFCKACIIKSIRDAGHKCPVDNEILLENQLFPDNFAKREILSLMVKCPNEGCLHKMELRHLEDHQAHCEFALMDCPQCQ
RPFQKFHINIHILKDCPRRQVSCDNCAASMAFEDKEIHDQNCPLANVICEYCNTILIREQMPNHYDLDCPTAPIPCTFSTFGC
HEKMQRNHLARHLQENTQSHMRMLAQAVHSLSVIPDSGYISEVRNFQETIHQLEGRLVRQDHQIRELTAKMETQSMYVSELKR
TIRTLEDKVAEIEAQQCNGIYIWKIGNFGMHLKCQEEEKPVVIHSPGFYTGKPGYKLCMRLHLQLPTAQRCANYISLFVHTMQ
GEYDSHLPWPFQGTIRLTILDQSEAPVRQNHEEIMDAKPELLAFQRPTIPRNPKGFGYVTFMHLEALRQRTFIKDDTLLVRCE
VSTRFDMGSLRREGFQPRSTDAGV

>gi|22027612|ref|NP_066961.2| TNF receptor
-
associated factor 2 [Homo
sapiens]MAAASVTPPGSLELLQPGFSKTLLGTKLEAKYLCSACRNVLRRPFQAQCGHRYCSFCLASILSSGPQNCAACVHE
GIYEEGISILESSSAFPDNAARREVESLPAVCPSDGCTWKGTLKEYESCHEGRCPLMLTECPACKGLVRLGEKERHLEHECPE
RSLSCRHCRAPCCGADVKAHHEVCPKFPLTCDGCGKKKIPREKFQDHVKTCGKCRVPCRFHAIGCLETVEGEKQQEHEVQWLR
EHLAMLLSSVLEAKPLLGDQSHAGSELLQRCESLEKKTATFENIVCVLNREVERVAMTAEACSRQHRLDQDKIEALSSKVQQL
ERSIGLKDLAMADLEQKVLEMEASTYDGVFIWKISDFARKRQEAVAGRIPAIFSPAFYTSRYGYKMCLRIYLNGDGTGRGTHL
SLFFVVMKGPNDALLRWPFNQKVTLMLLDQNNREHVIDAFRPDVTSSSFQRPVNDMNIASGCPLFCPVSKMEAKNSYVRDDAI
FIKAIVDLTGL

http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

gi: 4759254

gi:22027612

Example (NCBI Minicourse #6)


Problem: A laboratory has generated an EST library from a
hemochromatosis patient and wants to identify the gene(s)
causing the phenotype.


We will follow these steps to solve the problem:


Compare ESTs from a hemochromatosis patient to the human
genome (using BLAST).


Identify the gene(s) aligning the ESTs and download their
sequences (using Map Viewer).


Identify whether the ESTs contain any known nucleotide
variations (single nucleotide polymorphisms) (using dbSNP).


Determine whether a mutant form of the gene is known to
cause a phenotype (using OMIM).

Sequences


TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCA
GTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCC
AGAACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAAT
GGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATG
GGGATGGGACCTACCAGGGCTGGATAACCTTGGCTGTACC
CCCTGGGGAAGAGCAGAGATATACGTACCAGGTGGAGCAC
CCAGGCCTGGATCAGCCCCTCATTGTGATCTGGG

http://www.ncbi.nlm.nih.gov/Class/minicourses/di
seasegene.html

http://people.pcbi.upenn.edu/~lswang/seq1.txt

UCSC Genome Browser

http://genome.ucsc.edu/


Google “Genome Browser”

http://genome.ucsc.edu/

Example


Locate MLL (
myeloid/lymphoid or mixed
-
lineage leukemia
) on the human genome


Find relevant information


Conservation across the gene



Retrieve the sequences of human MLL
and divide into exon/intron regions


Retrieve the 5’ and 3’ flanking region
sequences

BLAT


Blast
-
like alignment tool


Quickly finds genomic regions highly
similar to the input query sequence


Example

>hg18_knownGene_uc002gil.1_1 range=chr17:7531420
-
7531642 5'pad=0
3'pad=0 strand=
-

repeatMasking=none

ACTTGTCATGGCGACTGTCCAGCTTTGTGCCAGGAGCCTCGCAGGG
GTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGC
TAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGC
AGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGC
GTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGG

http://genome.ucsc.edu/cgi
-
bin/hgBlat?command=start&org=Human&db=hg18&hgsid=110368820

Other Tasks for Genome Browser


Download the database


Retrieve genomic sequences/annotations


Upload your own annotation (customized
track)
using .bed format

and visualize on
the browser


Many tasks are easier using the Galaxy
web service from Penn State U (Google
“Galaxy Trac” or go to

http://galaxy.psu.edu/)



DAVID (NIAID)


Google “NIH DAVID”

Gene Ontology


Biological Process (BP)


Molecular Function (MF)


Cellular Component (CF)


Terms are (almost) hierarchically related


Evidence Codes


To browse the hierarchy of gene ontology
and learn definitions: Amigo


Google “gene ontology amigo”

http://amigo.geneontology.org/cgi
-
bin/amigo/go.cgi

Convert IDs

Annotation analysis:

Fisher’s Exact Test


Question 1: Is the gene associated with metabolism?


Question 2: Is the gene up
-
regulated in the microarray
experiment?


Q
2
:

Yes

Q
2
:

No

Total

Q1: Yes

10

40

50

Q1: No

90

360

450

Total

100

400

500

Q
2
:

Yes

Q
2
:

No

Total

Q1: Yes

40

10

50

Q1: No

90

360

450

Total

130

370

500

P=0.491

P<2E
-
16

Multiple tests


Using p
-
value threshold 0.05, we expect to
see 50 out of 1000 tests to be significant just
by chance


If 100 of the 1000 tests are significant, 50% of
them are false positives!


Solution: more stringent p
-
value cutoff


Bonferroni correction: use 0.05/1000=5E
-
5 as
new cutoff; usually too stringent


False Discovery Rate: 50% for 0.05 cutoff in the
example above; find a new cutoff so FDR is lower

Exercise


Set of probeset IDs with age
-
correlated
gene expression levels in human frontal
cortex


http://people.pcbi.upenn.edu/~lswang/p.txt


Affymetrix HG
-
U95AV2 platform

1.
Convert the IDs to Entrez Gene IDs

2.
Find the biological significance of the set

1.
Gene ontology

2.
Pathway


Summary


Introduction to the NCBI website


Demonstration of some common tasks


PubMed


BLAST


UCSC Genome Browser


NIAID DAVID

Additional Information


NCBI minicourse

http://www.ncbi.nlm.nih.gov/Class/minicou
rses/