Bioinformatics, Part 1 - Dr. Christopher King

fabulousgalaxyΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 11 μέρες)

101 εμφανίσεις

1


Bioinformatics

Adapted from a paper

(
http://www.lifescied.org/cgi/content/full/4/3/207
;
http://www.nslc.wustl.edu/e
lgin/genomics/Bio3055/manual.pdf
)

by
April Bednarski

and
Himadri Pakrasi that was

f
unded by a grant from

the

Howard Hughes Medical Ins
titute of
Washington University.

Glossary

Genome



The entire amount of genetic information for an organism. The human g
enome
is the set of 46 chromosomes.

Homologous



With regard to amino acids, homologous amino acid
s

have
similar chemical
properties and size
s
. For example, glutamate can be considered homologous to aspartate
because both residues
have
similar size
s

and b
oth residues contain a carboxylic acid
side
chain
.

S
equence alignment



a sequence alignment is a way of arranging the sequences
present
in

DNA, RNA, or protein
s so as
to identify
regions

that are

similar
.

Multiple s
equence alignment



a sequence alignment

of three or more biological
sequence
s
.

Conserved


the amino acid residues at
a position in a multiple sequence alignment

are
identical

throughout the alignment.

Conservative residue change


the amino acid residues at a position in a multiple
sequence al
ignment

are
homologous
.

ClustalW


A program for making multiple sequence alignments.

www.ebi.ac.uk/clustalw/index.html

EC number
-

Enzyme
Commission
number
-

Assigned

by the IUBMB (International

Uni
on
of Biochemistry and Molecular Biology)
;

classifies enzymes according to

the reaction
catalyzed
.
A
n EC Number is composed of four numbers
separated
by

dot
s
.
F
or example the
alcohol dehydrogenase has the EC Number 1.1.1.1
.

BLOSUM


BLOcks of Amino Acid
SUbstitution Matrix


A type of substitution matrix that is
used by

programs like BLAST to give sequences a score based on similarity to another

sequence
.
The scoring matrix gives a score to conservative substitutions of

amino acids
.
A
conservative subst
itution is a substitution of an amino acid

similar in size and chemical
properties to the amino acid in the query sequence
.

BLAST


Basic Local Alignment Search Tool


can be accessed from the NCBI
website,
blast.ncbi.nlm.nih.gov/Blast.cgi
.


A program that compares
a given input
sequence to all the
sequences in a
specified
database
.
T
his

program aligns the most similar segments between
sequences
.
B
LAST aligns

sequences using a scoring matrix simila
r to BLOSUM (see entry)
.
T
his scoring

method gives penalties for gaps and gives the highest score for identical

residues
.
S
ubstitutions are scored based on how conservative the changes are
.
T
he output
is

a list of sequences, with the highest scoring seq
uence at the

top
.
T
he scoring output is
given as an E
-
value
.
T
he lower the E
-
value, the

higher scoring the sequence is
.
E
-
values in
2


the range of 1
0
-
100

to 1
0
-
50

are very

similar (or even identical) sequences
.
S
equences with
E
-
values 1
0
-
10

and higher

ne
ed to be examined based on other methods to determine
homology
.
A
n E
-
value

of 1
0
-
10

for a sequence can be interpreted as, “a 1 in 1
0
10

chance that
the

sequence was pulled from the database by chance alone (has no homology to

the query
sequence).”

ExPASy


Expert Protein Analysis System
-

us.expasy.org/

A server maintained by the
Swiss

Institute of Bioinformatics
.
H
ome of SWISS
-
PROT, the most extensive and

annotated
protein database
.
T
he Swiss
-
Pdb Viewer protein
-
view
ing program is

also available at this
site for free download
.

FASTA


Fast Alignment Search Tool
-
All (since it works on both nucleotide and amino acid
sequences
)
.

Associated with this software is a
way of formatting a nu
c
leic acid or protein
sequence
.
I
t is important

because many bioinformatics programs require that the
sequence be in FASTA

format
.
T
he FASTA format has a title line for each sequence that
begins

with a “>” followed by any needed text to name the sequence
.
T
he end of

the
title line is s
ignified by a paragraph mark (hit the return key)
.
B
ioinformatics
programs will know that the title line isn’t part of the sequence if

you have it formatted
correctly
.
T
he sequence itself does NOT have any returns,

spaces, or formatting of any
kind
.
T
he

sequence is given in one
-
letter code
.
A
n

example of a protein in correct FASTA
format is shown below:

>K
-
Ras protein Homo sapiens

MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI

LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP

MVLVGNKCDLPS
RTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK

HKEKMSKDGKKKKKKSKTKCVIM

GenBank
-

a database

of nucleotide sequences from over
26
0,000 organisms
.

http://www.ncbi.nlm.nih.gov/genbank/

T
his

is the main datab
ase for nucleotide
sequences
.
I
t is a historical database,

meaning it is redundant
.
W
hen new or updated
information is entered into

GenBank, it is given a new entry, but the older sequence
information is also kept

in the database
.
G
enBank belongs to an
international collaboration
of sequence

databases, which also includes EMBL (European Molecular Biology
Laboratory)

and DDBJ (DNA Data Bank of Japan)
.
I
n contrast, the RefSeq database (see

entry) is non
-
redundant and contains only the most current sequenc
e information

for
genetic loci
.

Gene


an NCBI database of genetic loci
.
It may be a
ccess
ed

through the NCBI homepage by
selecting “Gene”

from the Search
dro
p
-
down menu.

T
his database used to be called

LocusLink
.
E
ntries provide links to RefSeqs, articl
es in PubMed, and other

descriptive
information about genetic loci
.
T
he database also provides

information on official
nomenclature, aliases, sequence accession numbers,

phenotypes, EC numbers, OMIM
numbers, UniGene clusters, map information,

and relevant

web sites
.

KEGG


Kyoto Encyclopedia of Genes and Genomes


http://www.genome.ad.jp/kegg/

This website is used for

accessing metabolic pathways
.
A
t this website, you can search a
process, gene,

protein, or

metabolite and obtain diagrams of all the metabolic pathways

3


associated with your query
.
Y
ou will see a link to the KEGG entry at the end of

the Gene
entry for a gene.


NCBI


National Center for Biotechnology Information


www.ncbi.nlm.nih.gov

This center
was formed

in 1988 as a division of the NLM (National Library of Medicine) at the NIH

(National Institute of Health)
.
A
s part of the NIH, NCBI is funded by the US

government
.
T
he main goal of the center is
to provide resources for biomedical

researchers as well as
the general public
.
T
he center is continually developing

new materials and updating
databases
.
T
he entire human genome is freely

available on this website and is updated
daily as ne
w and better d
ata become

available
.
NCBI also maintains an extensive education
site, which offers online tutorials of

its databases and programs:

www.ncbi.nlm.nih.gov/About/outreach/courses.html

OM
IM
-

Online Mendelian Inheritance in Man


www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

a continuously updated catalog of

human genes and genetic disorders, with links to associated l
iterature

references, sequence
records, maps, and related databases
.

PubMed


http://www.ncbi.nlm.nih.gov/pubmed/

W
hen writing a paper on a particular
science/medical topic, you should always check Pub
Med
.
It is
a retrieval system
containing citations, abstracts, and indexing terms

for journal articles in the biomedical
sciences
.
P
ubMed contains the complete

contents of the MEDLINE and PREMEDLINE
databases
.
I
t also contains some

articles and journal
s considered out of scope for
MEDLINE, based on either

content or on a period of time when the journal was not indexed,
and therefore is

a superset of MEDLINE.

RefSeq
-

NCBI database of Reference Sequences
.
C
urated,

non
-
redundant set

including
genomic DNA

contigs, mRNAs, proteins, and entire chromosomes
.
A
ccession numbers have
the format of two letters, an underscore bar, and six

digits
.
E
xample: NT_123456
.
C
ode:
NT, NC, NG = genomic; NM = mRNA;

NP = protein (
for
more of the two letter codes
,

s
ee
the
NC
BI site map
).

Sequence Manipulation Suite


bioinformatics.org/sms/

a website that contains a
collection of web
-
based

programs for analyzing and formatting DNA and protein
sequences.



Bioinformatics is a fie
ld of study that merges math, biology, and

computer science
.
Researchers in this field have developed a wide range of

tools to help biomedical
researchers work with genomic, biochemical, and

medical information
.
Some types of
bioinformatics tools include

data base

storage and search programs as well as software
programs for analyzing

genomic and proteomic data.

We

will be working through a tutorial on web
-
based

bioinformatics programs
.
The
tutorial is based on the enzymes phospholipase C
-
gamma (believed

to be the major enzyme
of fertilization), and
cyclooxygenase
-
2 (COX
-
2), which

also has the name prostaglandin
synthase
-
2 (PTGS2)
.
I
n this tutorial, the bioinformatics tools from

the NCBI (National
Center for Biotechnology Information) website will be

int
roduced
.
NCBI is a division of the
National Institute of Health (NIH)
.

4


These

tools include
Gene, GenBank, RefSeq, and PubMed
.
G
ene

is a database of

genes in which each entry contains a brief summary, the common gene symbol,

information
about the gene
function, and links to websites, articles, and sequence

information for that
gene
.

GenBank

is a historical database of gene sequences,

which means it contains every
sequence that was published, even if the same

sequence was published more than once
.
T
he
refore,
GenBank

is considered a

redundant database
.

RefSeq

is a database of sequences that is edited by NCBI

and is NON
-
redundant,
meaning that it contains what NCBI determines is the

most reliable

sequence data for each
gene.

Finally, we will be learnin
g to use
ClustalW
, which is a multiple sequence

alignment
program
.
I
t allows you to enter a series of gene or protein sequences

that you believe are
similar and may be evolutionarily related
.
T
hese sequences

are usually obtained by
performing a
BLAST

sea
rch
.
C
lustalW

then aligns the

sequences, so that the
fewest
gaps
are

introduced and the
largest
nu
mber

of similar residues
is

aligned with each other
.
C
lustalW

uses a

scoring matrix similar to BLOSUM
-
62, which will be

presented in
a
lecture.

Introduction

to
Phospholipase C
-
gamma and
COX
-
2 (PTGS2)

Phospholipase C
-
gamma is believed to be
a

major enzyme of fertilization
.
T
he
pathway of fertilization in
Xenopus laevis

is thought to
be the following:

1)

Sperm binds to the egg
.

2)

This binding somehow activates
th
e
1b

form

of phospholipase D (PLD1b)

3)

PLD1b breaks
the

lipid phosphatidylcholine

down in
to phosphatidic acid (PA)
and
choline
.

4)

PA stimulates a tyrosine kinase called Src
.
Tyrosine kinases are enzymes that
transfer
a phosphate from ATP

to
other proteins
.
T
his “phosphorylation”
can turn
another protein
on or off.

5)

The activated

Src phosphorylates the gamma form of Phospholipase C (PLC
-
γ).

6)

PLC
-
γ breaks
the

lipid

PIP2


down
to

IP3


and

DAG

.
IP3 diffuses from the
cell
membrane to release calcium
stored

in
the endoplasmic reticulum.

7)

The calcium floods into the cytoplasm
to cause the events of fertilization
.

T
he calcium travels across the zygote from the sperm binding site, causing a
wave of cortical granule exocytosis, a wave of elevation of the fertilizati
on
envelop, a wave surface contraction (that we visualized); and initiation of
other developmental events leading to

first cleavage (or cytokinesis
.

COX
-
2 (PTGS2)

is called

prostaglandin H2 synthase
-
2 and cyclooxygenase
-
2 (COX
-
2)
.
C
OX
-
2 has been

thoroughl
y studied because of its role in prostaglandin synthesis
.
P
rostaglandins

have a wide range of roles in our body from aiding in digestion to
propagating

pain and inflammation
. A
spirin is a general inhibitor of prostaglandin
synthesis

and
,

therefore, helps

reduce pain
. H
owever, aspirin also inhibits the synthesis of

5


prostaglandins that aid in digestion
.
T
herefore, aspirin is a poor choice for pain

and
inflammation management for those with ulcers or other digestion problems
.
R
ecent
advances in targeting
specific prostaglandin
-
synthesizing enzymes have

led

to the
development of Celebrex, which is marketed as an arthritis therapy
.
C
elebrex is a potent
and specific inhibitor of COX
-
2
.
C
elebrex is considered

specific because it doesn’t inhibit
COX
-
1, which
is involved in synthesizing

prostaglandins that aid in digestion
.
T
his is a
remarkable accomplishment given

the great similarity between COX
-
1 and COX
-
2
.
T
his
achievement has paved the

way for developing new therapies that bind more specifically to
their

target and

therefore have fewer side effects.

Understanding the enzyme structures of COX
-
1 and COX
-
2 helped

researchers develop
a drug that would only bind and inhibit COX
-
2
.
M
any of the

types of information and tools
used by researchers for these types
of studies are

freely available on the web
.
I
n this
tutorial, and throughout this lab course, you

will be introduced to the databases and freely
available software programs that

are commonly used by professionals in research and
medicine to study genes,

p
roteins, protein structure and function, and genetic disease
.

Gene Database
:

Follow these directions to access the entries for PTGS1 and PTGS2 in the

“Gene” database
at the NCBI Website:

1)

G
o to the NCBI homepage:


http://www.ncbi.nlm.nih.gov

2)

J
ust after the word “Search,” s
elect “Gene” from the
database
drop
-
down

menu
.
Enter

“PTGS” in the

“for” textbox, and click the
Search

button
.

3)

Find

the results for the “Homo sapiens” entries

called “PTGS1” and one called
“PTG
S2.”

(In Firefox, try Ctrl
-
F, and enter
Homo sapiens
.)

4)

S
elect each entry by clicking on its name, then read the paragraph

under the
Summary

section for each entry.

A
nswer the

following
questions
.


1.

P
TGS1 and PTGS2 are isozymes
:
I
sozymes catalyze the same

reaction,

but are
coded by
separate genes
.
Based on the summary, w
hat types of reactions
d
o PTGS enzymes

catalyze?



2.

Which gene forms multiple transcript variants?





3.

Which isozyme would you want to inhibit to stop inflammation?

6





4.

According to the
Pat
hways

section,
what
KEGG
pathway
s

are
listed for
these enzymes
(other than “Metabolic pathways”)
?





The next two questions are not discussed in the summaries
-

just read the questions and
think about the answers.

5.

T
he drug Celebrex selectively inhibits PTG
S2 while aspirin and other

NSAID’s
inhibit both PTGS1 and PTGS2 in the same way
. W
hy do you

think researchers wanted
to discover a selective inhibitor to PTGS2?




6.

D
escribe how studying 3
-
D structures of PTGS1 and PTGS2 could help

researchers
design a dru
g that binds to PTGS1, but not to PTGS2.





7.

Now
start over and search for the gene for
“Phospholipase C
-
gamma”

in Homo sapiens
.
Find the
PLCG1
and PLCG2 entries

(case matters).
On w
hat chromosome are these
found?


8.

Now, g
o to the PLCG1 entry. From the
summary, w
hat do IP3 and PIP2 stand for (spell
out the complete chemical name):



9.

What is the official symbol of phospholipase C, gamma 1?

7



HUGO is the acronym for the Human Genome Organization.
The HUGO Gene Nomenclature
Committee’s acronym is HGNC.
Cl
ick
on the
HGNC:9065
link next to “Primary Source
.


This brings up
the
“S
ymbol
R
eport


page. Find the section, “OMIM ID”, and

click on the link
associated with the

entry

1724
2
0
.
OMIM stands for the Online Mendelian Inheritance in
Man

database
.
T
he OMI
M database was started at John Hopkins University and is

now
maintained by NCBI
.
T
he OMIM database contains entries for both diseases with known

genetic links and entries for the genes that have been linked to a disease
.
E
ach

OMIM entry
is a summary of t
he research that has been performed on the

disease or gene and contains
links to the research articles that it summarizes
.
You will be able to read about the clinical
and biochemical research that has

been performed related to the mutation you are
studyi
ng
. I
s any information
available related to mutations or mutants
for PLC gamma?
YES NO

Each link in the OMIM

entry will open an abstract from the PubMed database
.
P
ubMed is a
literature

database, and is also maintained by NCBI
.
P
ubMed is a sea
rchable database of

medical and life science journal articles
.
M
ost of the abstracts for these articles

can be
accessed through PubMed, but in order to access the entire article, you

need to go to each
individual journal website and have a subscription to

the

journal
.
T
he
Troy University

library has subscriptions to electronic versions of

many of these journals that you can
access through the E
-
journal link on the

library home page
.
M
ost journals have their
articles available online as

.pdf files for art
icles published between 1995 to present
.
H
owever, the older

articles must still be accessed through the paper versions stored in
libraries
.

Go back to the “Symbol Report” page.
In the section, “
Accession Numbers
”, c
lick on the
GenBank

link.

A
n exampl
e of a

GenBank

entry
is shown
below.

8



For PLC
-
gamma

1
, fill in the following info:

Number of base pairs:

Gene sequence was obtained from

“Molecul
e

Type”
:

Date of latest modification:

9


A
ccession number
(
Very important number
)
:

Both
the AMINO ACID (
beginning

with


/
translation”) and then the GENE sequences
(in ATGC) are
listed
.
A
mino acids have
both
a
3
-
letter and 1
-
letter abbreviation

databases

use the
1
-
letter abbreviation
s
.


Go back to the original page on PLC
G1 (the page
with “Primary Source” and the HGNC:9065” link

that you
followed). In your browser use
Ct
r
l
-
F

to find


S
H3
” on
in
the Bibliography section of
that page
.
Which journal
published this entry?


Then, search
for “
RET9

in the
Interactions

section.


Which journal published an article listed in PubMed
abo
ut this entry?



You have explored human forms of the enzyme and
its gene
.
Next,
in the Entrez Gene database,
search for a
reference to the presence of the
PLC
-
gamma
enzyme in
Xenopus laevis
.
You have to go back to the original page
that had “Gene” for

the database and “
Phospholipase C
-
gamma

Xenopus laevis

for the search string.

How many
references for
Xenopus

PLC
-
g
amma
did you find?



What is the
preferred

name

(the name before the “
Other
Aliases
” line)

of the enzyme in each reference

(how do
they d
iffer?)
?



Table
1
. 1
-

and 3
-
Letter

Abbreviations of Amino Acid
s.

Amino Acid


3
-
Letter

1
-
Letter

Alanine

Ala

A

Arginine

Arg

R

Asparagine

Asn

N

Aspartic acid

Asp

D

Cysteine

Cys

C

Glutamic acid

Glu

E

Glutamine

Gln

Q

Glycine

Gly

G

Histidine

His

H

Isoleucine

Ile

I

Leucine

Leu

L

Lysine

Lys

K

Methionine

Met

M

Phenylalanine

Phe

F

Proline

Pro

P

Serine

Ser

S

Threonine

Thr

T

Tryptophan

Trp

W

Tyrosine

Tyr

Y

Valine

Val

V

10



For the first reference that you find,
u
nder “
Related Sequences
,” n
ote that

there are t
hree

listed
:


Nucleotide

Protein

mRNA

AB287408
.1

BAF64273
.1

mRNA

AF090111
.1

AAD03594
.1

mRNA

BC070837
.1

AAH70837
.1

The
second

column
is a sequence of
nucleotide
bases;

the
third
is the amino acid list for the
base sequence.

Go
back and select
the second
Xenopus

PLC gamma reference
.
Under
General gene
information
,
you
see “Pathways.”

KEGG

stands for the Kyoto Encyclopedia of Genes and
Genomes
.
I
t is
a database of

metabolic pathways that is maintained by a research institute
in Japan
.
I
t

contains all the known metabolic and signaling pathways
.
E
ach protein in the

pathway and each small molecule metabolite (
e
.g.,

A
TP) has its own entry in the

database

that can be accessed by clicking on the protein or metabolite in the

pathway figure
.
B
y
using this website, you can make predictions about what

would happen to downstream
events in the pathway if the protein you are

studying is either less active or more

active
.
There are

several

links to click on to show how PLC
-
gamma1b is involved in metabolism
.
Click on
the link

related to inositol metabolism
.

In the first link/path, the red arrow below shows where PLC gamma 1b is located
-

it
has a
n enzyme

number o
f 3.1.4.1
1.





PIP2, the reactant, is to the right (1
-
phosphatidyl
-
1D
-
myo
-
inositol 4,5
-
bisphosphate).
What is the full name of

the product

IP3 according to this metabolic pathway?



11



Click on the
next to
last
KEGG link
, about a signaling system
; what
is
the name of this
pathway
?



Essentially, you now have two names for equivalent pathways involving PLC
.
Note that
they show PLC in red lettering and in a green box
.

Locate PIP2 (
top center;
a

substrate for PLC
γ
) and write how they abbreviate

it

here in this
second path:


Write down how they
prefer to
abbreviate IP3

(look for IP3
with some numbers
in
parentheses)
: