Bioinformatics - Places Traveled

websterhissΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 10 μέρες)

99 εμφανίσεις


1

Bioinformatics


Cell and Molecular Biology Lab

(
Created in
part
by: April Bednarski Advised by: Professor Himadri Pakrasi, Funded by a
grant from:

Howard Hughes Medical Institute to Washington University)



First, briefly read through the glossary; you ma
y need it during this exercise. I
expect you to look through the glossary in more detail when you study for the final exam
(final exam questions will come from the list).

We

will be working through a tutorial on web
-
based

bioinformatics programs
.
The
tut
orial is based on the enzymes phospholipase C
-
gamma (believed to be the major
enzyme of fertilization), and
cyclooxygenase
-
2 (COX
-
2), which

also has the name
prostaglandin synthase
-
2 (PTGS2). You can read more

about this protein on the next
page. In this t
utorial, the bioinformatics tools from

the NCBI (National Center for
Biotechnology Information) website will be

introduced.

NCBI is a division of the
National Institute of Health (NIH).

These

tools include
Gene, GenBank, RefSeq, and PubMed
.
Gene

is a dat
abase
of

genes in which each entry contains a brief summary, the common gene symbol,

information about the gene function, and links to websites, articles, and sequence

information for that gene.

GenBank

is a historical database of gene sequences,

which me
ans it contains
every sequence that was published, even if the same

sequence was published more than
once. Therefore,
GenBank

is considered a

redundant database.

RefSeq

is a database of sequences that is edited by NCBI

and is NON
-
redundant,
meaning that i
t contains what NCBI determines is the

strongest sequence data for each
gene.

Finally, we will be learning to use
ClustalW
, which is a multiple sequence

alignment program. It allows you to enter a series of gene or protein sequences

that you
believe are si
milar and may be evolutionarily related. These sequences

are usually
obtained by performing a
BLAST

search.
ClustalW

then aligns the

sequences, so that
the lowest number of gaps is introduced and the highest

numbers of similar residues are
aligned with eac
h other.
ClustalW

uses a

scoring matrix similar to BLOSUM
-
62, which
is explained in your text and will be

presented in lecture.


Introduction to
Phospholipase C
-
gamma and
COX
-
2 (PTGS2)


Phospholipase C
-
gamma is believed to be the major enzyme of fertilizat
ion. We
obtained a partial clone of the gene when we performed RT
-
PCR. Take a look at the
p
aper that Dr. Stith has put on our web site. We will go through the paper more
thoroughly at some point. For now, the pathway of fertilization in
Xenopus laevis

may be
the following:

1.

Sperm binds to the egg

2.

This binding somehow activates one form (called 1b) of phospholipase D
(PLD1b)

3.

This enzyme PLD1b breaks down a lipid (phosphatidylcholine) to phosphatidic
acid (PA) (also producing choline).


2

4.

PA stimulates a tyr
osine kinase called Src. Tyrosine kinases are enzymes that
take a phosphate from ATP, then put that phosphate on other proteins. This
“phosphorylation” turns on (in this case, Src), or can turn off another protein.

5.

Once turned on, Src phosphorylates the
gamma form of Phospholipase C (PLC
-
γ).

6.

PLC
-
γ breaks down a lipid called PIP2 to make IP3 and DAG. IP3 diffuses from
the membrane to release calcium from stores in the endoplasmic reticulum.

7.

The calcium floods into the cytoplasm
to cause the events of fert
ilization (the
calcium travels across the zygote from the sperm binding site, causing a wave of
cortical granule exocytosis, a wave of elevation of the fertilization envelop, a
wave surface contraction (that we visualized); and initiation of other
developm
ental events leading to first cleavage (or cytokinesis).

See our
fertilization lecture for a review of this.


COX
-
2 (PTGS2)
is called

prostaglandin H2 synthase
-
2 and cyclooxygenase
-
2
(COX
-
2). COX
-
2 has been

thoroughly studied because of its role in prostagl
andin
synthesis. Prostaglandins

have a wide range of roles in our body from aiding in digestion
to propagating

pain and inflammation. Aspirin is a general inhibitor of prostaglandin
synthesis

and therefore, helps reduce pain. However, aspirin also inhibits

the synthesis of

prostaglandins that aid in digestion. Therefore, aspirin is a poor choice for pain

and
inflammation management for those with ulcers or other digestion problems.

Recent
advances in targeting specific prostaglandin
-
synthesizing enzymes hav
e

lead to the
development of Celebrex, which is marketed as an arthritis therapy.

Celebrex is a potent
and specific inhibitor of COX
-
2. Celebrex is considered

specific because it doesn’t
inhibit COX
-
1, which is involved in synthesizing

prostaglandins that
aid in digestion.
This is a remarkable accomplishment given

the great similarity between COX
-
1 and
COX
-
2. This achievement has paved the

way for developing new therapies that bind
more specifically to their target and

therefore have fewer side effects.

Und
erstanding the enzyme structures of COX
-
1 and COX
-
2 helped

researchers
develop a drug that would only bind and inhibit COX
-
2. Many of the

types of
information and tools used by researchers for these types of studies are

freely available
on the web. In this

tutorial, and throughout this lab course, you

will be introduced to the
databases and freely available software programs that

are commonly used by
professionals in research and medicine to study genes,

proteins, protein structure and
function, and genetic

disease.



Gene Database
:

Follow these directions to access the entries for PTGS1 and PTGS2 in the

“Gene” database at the NCBI Website:

A. First, go to the NCBI homepage by going to:
http://www.ncbi.nlm.nih.gov

B.
Just after the word “Search,” s
elect “Gene” from the
database pulldown
menu. Type
“PTGS” in the

search box, then click “Go.”

C. Scan the results for the “Homo sapiens” entries. There should be one

called “PTGS1” and one called “PTGS2.”

We do
not want

the references to the enzyme
found in the yeast
Schizosaccharomyces pombe 972h.

D. Select each entry by clicking on its name, then read the paragraph

under the “Summary” section for each entry.


3


After reading
all the 387 PubMed references (this is a joke)
,

and

the “Summary” section
for both of these genes, answer the

questions below
.

1. PTGS1 and PTGS2 are isozymes. Isozymes catalyze the same reaction,

but are
coded by
separate genes. What types of reactions to PTGS enzymes

catalyze? Also, what pathway ar
e these enzymes a part of?







2. How is the expression of PTGS1 and PTGS2 different?






3. Which isozyme would you want to inhibit to stop inflammation?







The next two questions are not discussed in the summaries
-

just read the questions and
think

about the answers.

4. The drug Celebrex selectively inhibits PTGS2 while aspirin and other

NSAID’s inhibit both PTGS1 and PTGS2 in the same way. Why do you

think researchers wanted to discover a selective inhibitor to PTGS2?




5. Describe how studying 3
-
D structures of PTGS1 and PTGS2 could help

researchers design a drug that binds to PTGS1, but not to PTGS2.





E.

Now type in “
Phospholipase C
-
gamma
” and search for this gene.
Click on the first
reference
Read the two forms found in humans (type 1 and 2
), click on the first line in the
reference
list (see the figure below showing the web page

of this first PLC gamma
reference from humans
)
and read the summary.

What does IP3 and PIP2 stand for (spell out the complete chemical name):



4

On what chromosome
a
re the types

1 and 2 forms found?


What is the official symbol of phospholipase C, gamma 1?


Click
on the red
HGNC:9065
link next to “Primary Source
.


On the next page

(symbol
report)
, click on the link associated with the line: 17240 OMIM.
OMIM stands

for the
Online Mendelian Inheritance in Man

database. The OMIM database was started at John
Hopkins University and is

now maintained by NCBI. The OMIM database contains
entries for both diseases with known

genetic links and entries for the genes that have

been linked to a disease. Each

OMIM entry is a summary of the research that has been
performed on the

disease or gene and contains links to the research articles that it
summarizes.

You will be able to read about the clinical and biochemical research tha
t
has

been performed related to the mutation you are studying
-

is there any mutation
information for PLC gamma? YES NO

Each link in the OMIM

entry will open an abstract from the PubMed database. PubMed
is a literature

database, and is also mainta
ined by NCBI. PubMed is a searchable database
of

medical and life science journal articles. Most of the abstracts for these articles

can be
accessed through PubMed, but in order to access the entire article, you

need to go to each
individual journal websit
e and have a subscription to the

journal. The WashU library
system has subscriptions to electronic versions of

many of these journals that you can
access through the E
-
journal link on the

WashU library home page. Most journals have
their articles available

online as

.pdf files for articles published between 1995 to present.
However, the older

articles must still be accessed through the paper versions stored in
libraries.

W
hat
is the
biological impact of
a
mutation
in PLC gamma?


Next, click on the link in
the upper right:
GeneBank. See an example of an entry below.



5



For PLC
-
gamma, fill in the following info:


Number of base pairs:


Gene sequence was obtained from

“Molecular Type”




6

Date of latest modification:


What is its accession number??? Very impor
tant number:


Note that the AMINO ACID (see “translation”) and then the GENE sequences (in
ATGC) are noted next.

Note that amino acids have a name, and a 3 letter and 1 letter
abbreviation
-

databases use the one letter abbreviation. From our text:



Go b
ack to the original page on PLC
-
gamma
1

(you should be able to do this by hitting the
back arrow on Internet Explorer two times)
.
Go to
Edit

in the
I
nternet explorer bar at the
top of the screen, then click on
Find (on this
page)

or simply hit
Ctl
-
F
. The
n search for
Src. In the Bibliography section, find
the
first
paper that links
S
rc to PLC
-
gamma; this
might help our research in
Xenopus

fertilization since we believe that Src turns on
PLC
-
gamma. Print off the first page of the
paper (you need to use Ad
obe reader) and
place it in your lab notebook as evidence that
you have completed this section successfully.



Repeat the search and find a few more papers
in the
Bibliography

(you do not have to print
them off)
. Then, continue the search through
Interact
ions
: what proteins may interact with PLC
-
gamma
? C
lick on PubMed to obtain
the paper
that you find
(and then print off the
first page of the 1994 Jour Biol Chem
paper

for your lab notebook
).


You have explored human forms of the enzyme and its gene.
Next
, search for a reference
to the presence of the
PLC
-
gamma
enzyme in
Xenopus laevis
.
You have to go back to
the original page that had “Gene” for the database and “
Phospholipase C
-
gamma
” (listing
two human genes for PLC gamma 1 and 2 first, then the gene f
rom other organisms next).
These sequences were found by RT
-
PCR.


How many
references for
Xenopus

PLC
-
g
amma
did you find?



What is the exact name of the enzyme in each reference

(how do they differ?)
?



For the first reference that you find, Under “
Rel
ated Sequences,”
note that

there
are two listed
:

Nucleotide





Protein

mRNA






AF090111






AAD03594


mRNA






BC070837






AAH70837



7

The first is a sequence of nucleotide bases, the second is the amino acid list for the base
sequence.


Go to the second
Xenopus

PLC gamma reference.
Under
General gene
information,
yo
u see


Pathways.”

KEGG

stands for the Kyoto Encyclopedia of
Genes and Genomes. It is a database of

metabolic pathways that is maintained by a
research institute in Japan. It

contains all the known metabolic and signaling pathways.
Each protein in the

pat
hway and each small molecule metabolite (ex. ATP) has its own
entry in the

database that can be accessed by clicking on the protein or metabolite in the

pathway figure. By using this website, you can make predictions about what

would
happen to downstream e
vents in the pathway if the protein you are

studying is either less
active or more active.

There are two links to click on to show how PLC
-
gamma1b is
involved in metabolism. Click on
the first link
.


In the first link/path, the red arrow below shows wher
e PLC gamma 1b is located
-

it has a
number of 3.1.4.11. PIP2 is to the right (1
-
phosphatidyl 1D
-
myo
-
inositol 4,5
-
bisphosphate).

What is

the full name of IP3 according to this metabolic pathway?





Click on the next KEGG link; what is
the name of this

pathway
?

Essentially, you now
have two names for equivalent pathways involving PLC.
Note that they show PLC in red
lettering and in a green box.

Locate PIP2 (
top center;
the substrate for PLC) and write how they abbreviate here in this
second path:


W
rite down how they
prefer to
abbreviate IP3

(look for IP3 in parentheses)
:



8

-----------------------------------------------------------------------

Part 1


Getting sequence information and viewing database entries

NCBI


Gene

1. Go back to the “Gene” entr
y for Homo sapiens PTGS2.

Near the end of the file is the
the
RefSeq

accession number for the
mRNA sequence

of Homo

sapiens prostaglandin
-
endoperoxide synthase 2
.


write it here
__________________.


2
. Open the
RefSeq
entry

by clicking on the number

(first
link in this section)
, then
choose “FASTA” from the
DISPLAY
pull
-
down menu. Copy the

nucleotide
sequence
(including the title line designated by the “>” symbol) and paste

it into a word document.


3
. Save the file as
PTGS2rna.doc

on your desktop.

Note: Pl
ease review
the entry for
“FASTA”

in the Glossary (at the end of

this
protocol
). Understanding this definition will
be very important for working

with the bioinformatics programs.


5
.
Next, on the same web page, f
ind
the amino acid sequence by looking for

/translation="”
--

search
the
web page.
Follow the

steps given above to save the
amino acid
sequence in FASTA format as a Word

document called
PTGS2prot.doc

file
on your desktop.

Swiss
-
Prot Entry

6
. Go to the Expasy website
(
http://us.expasy.org/
)
and search for the Swiss
-
Prot entry for
Phospholipase C gamma 1
.

You should use

the gene name
PLCG1
to search and be sure
to select the HUMAN

protein from the search results).

Hint:
P19174 is the assestion
number.


7
. Write

at least three alternate names for this protein.




8
. Where in the cell is this protein located?



9
.
What is its cofactor (needed for the enzyme to function)?


10
.
Under
2D gel databases
,
has someone performed a 2
-
D gel (see our lecture
notes) and found

where this enzyme appears? Click on “get region on 2
-
D gel” to find
out.



11
.
What is the PLC gamma1 amino acid length and molecular weight?




9

Sequence Manipulation

12
. Go to the
Sequence Manipulation Suite

(
http://bioinformatics.org/sms/
).

1
3
. Click on “Translate” under “DNA Analysis” heading from the menu.

1
4
. Clear the data entry box by hitting “Clear”.

1
5
. Copy the mRNA sequence from your Word file
(
PTGS2rna.doc
)
and Paste it into the
data

entry box on

the Sequence Manipulation website.

16
. Select “Reading Frame 3” and “direct” from the pull
-
down menus, then

click “Submit”.

17
. When the Output window opens with your results, copy and past the

sequence into a Word document and save it as, “translate.doc”

on your

desktop.

18
. Compare this sequence in the “translate.doc” file with the sequence in the


PTGS2prot.doc
”.

What are the first residues that are the same in the

sequences?

Do the sequences look like they are the same? (Hint:

protein sequences shoul
d start with a methionine.)





Multiple Sequence Alignment with ClustalW

19
.
The following are
5

FASTA formatted sequences of PTGS2 from

different
organisms
.


>dog [Canis familiaris]

MLARALVLCAALAVVRAANPCCSHPCQNQGICMSTGFDQYKCDCTRTGFYGENCS

TPEFLTRIKLYLKPT

PNTVHYILTHFKGVWNIVNNIPFLRNTIMKYVLTSRSHLIESPPTYNVNYGYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGKKELPDSKEIVEKFLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDHKRGPAFTKGL

GHGVDLNHVYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHV

PEHLQFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSR
LILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTLQIDDQEYNFQQFIYNNSILLEHGL

TQFVESFSRQIAGRV

AGGRNVPAAVQQVAKASIDQSRQMKYQSLNEYRKRFRLKPYTSFEELTGEKEMAA

GLEALYGDIDAMELY

PALLVEKPRPDAIFGETMVEMGAPFSLKGLMGNPICSPDYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTAFS
VQDGQLTKTVTINASSSHSGLDDINPTVLLKERSTEL


>cow [Bos taurus]

MLARALLLCAAVALSGAANPCCSHPCQNRGVCMSVGFDQYKCDCTRTGFYGENCT

TPEFLTRIKLLLKPT


10

PNTVHYILTHFKGVWNIVNKISFLRNMIMRYVLTSRSHLIESPPTYNVHYSYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGRKELPDSKEVVKKVLLRRKFIPDPQGTNLMFAFFAQHFTHQF

FKTDFERGPAFTKGK

NHGVDLSHIYGESLERQHKLRLFKDGKMKYQMINGEMYPPTVKDTQVEMIYPPHV

PEHLKFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWGDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDVFQIDGQEYNYQQFIYNNSVLLEHGL

TQFVESFTRQRAGRV

AGGRNLPVAVEKVSKASIDQSRE
MKYQSFNEYRKRFLVKPYESFEELTGEKEMAA

ELEALYGDIDAMEFY

PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII

NTASIQSLICSNVKG

CPFTSFSVQDTHLTKTVTINASSSHSGLDDINPTVLLKERSTEL


>mouse [Mus musculus]

MLFRAVLLCAALGLSQAANPCCSNPCQNRGECMSTGFDQYKCDCTRTGFYGENCT

TPEFLTRIKLL
LKPT

PNTVHYILTHFKGVWNIVNNIPFLRSLIMKYVLTSRSYLIDSPPTYNVHYGYKSW

EAFSNLSYYTRALPP

VADDCPTPMGVKGNKELPDSKEVLEKVLLRREFIPDPQGSNMMFAFFAQHFTHQF

FKTDHKRGPGFTRGL

GHGVDLNHIYGETLDRQHKLRLFKDGKLKYQVIGGEVYPPTVKDTQVEMIYPPHI

PENLQFAVGQEVFGL

VPGLMMYATIWLREHNRVCDILKQEHPEWGDEQLF
QTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIASEFNTLYHWHPLLPDTFNIEDQEYSFKQFLYNNSILLEHGL

TQFVESFTRQIAGRV

AGGRNVPIAVQAVAKASIDQSREMKYQSLNEYRKRFSLKPYTSFEELTGEKEMAA

ELKALYSDIDVMELY

PALLVEKPRPDAIFGETMVELGAPFSLKGLMGNPICSPQYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPF
TSFNVQDPQPTKTATINASASHSRLDDINPTVLIKRRSTEL


>Rabbit

MLARALLLCAAVALSHAANPCCSNPCQNRGVCMTMGFDQYKCDCTRTGFYGENCS

TPEFLTRIKLLLKPT

PDTVHYILTHFKGVWNIVNSIPFLRNSIMKYVLTSRSHMIDSPPTYNVHYNYKSW

EAFSNLSYYTRALPP

VADDCPTPMGVKGKKELPDSKDVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDL
KRGPAFTKGL

GHGVDLNHIYGETLDRQHKLRLFKDGKMKYQVIDGEVYPPTVKDTQVEMIYPPHI

PAHLQFAVGQEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDEQLFQTSRLILIGETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDTFQIDDQQYNYQQFLYNNSILLEHGL

TQFVESFTRQIAGRV

AGGRNVPPAVQKVAKASIDQSRQMKYQSL
NEYRKRFLLKPYESFEELTGEKEMAA

ELEALYGDIDAVELY

PALLVERPRPDAIFGESMVEMGAPFSLKGLMGNPICSPNYWKPSTFGGEVGFKIV


11

NTASIQSLICNNVKG

CPFTSFNVPDPQLTKTVTINASASHSRLEDINPTVLLKGRSTEL



>pig [Sus scrofa]

MLARALLLCAAVSLCTAAKPCCSNPCQNRGICMSVGFDHYKCDCTRTGFYGENCT

TPEFLTRIKLFLKPT

PNTV
HYILTHFKGVWNIVNNIPFLRNAIMKYVLISRSHLIDSPPTYNMHYGYKSW

EAFSNLSYYTRALPP

VPDDCPTPMGVKGRKELPDSKEVVEKLLLRRKFIPDPQGTNMMFAFFAQHFTHQF

FKTDQKRGPAFTKGQ

GHGVDLSHVYGESLERQHKLRLFKDGKMKYQIIDGEMYPPTAKDTQVEMIYPPHT

PEHLRFAVGHEVFGL

VPGLMMYATIWLREHNRVCDVLKQEHPEWDDERLFQTSRLILIG
ETIKIVIEDYV

QHLSGYHFKLKFDPE

LLFNQQFQYQNRIAAEFNTLYHWHPLLPDAFQIDGHEYNYQQFLYNNSILLEHGI

TQFVESFSRQIAGRV

AGGRNLPAAVQKVSKASIDQSREMRYQSFNEYRKRFLLKPYRSFEELTGEKEMAA

ELEALYGDIDAMELY

PALLVEKPRPDAIFGETMVEAGAPFSLKGLMGNPICSPEYWKPSTFGGEVGFKII

NTASIQSLICNNVKG

CPFTSFSVQDPQ
LAKTVTINASSSHSGLDDINPTVLLKERSTEL


>coral [Gersemia fruticosa]

MVAKFVVFLGLQLILCSVVCEAVNPCCSFPCESGAVCVEDGDKYTCDCTRTGHYG

VNCEKPNWSTWFKAL

IAPSEETKHFILTHFKWFWWIVNNVPFIRNTVMKAAYFSRTDFVPVPHAYTSYHD

YATMEAHYNRSYFAR

TLPPVPKNCPTPFGVAGKKELPPAEEVANKFLKRGKFKTDHTSTSWLFMF
FAQHF

THEFFKTIYHSPAFT

WGNHGVDVSHIYGQDMERQNKLRSFEDGKLKSQTINGEEWPPYLKDVDNVTMQYP

PNTPEDQKFALGHPF

YSMLPGLFMYASIWLREHNRVCTILRKEHPHWVDERLYQTGKLIITGELIKIVIE

DYVNHLANYNLKLTY

NPELVFDHGYDYDNRIHVEFNHMYHWHPFSPDEYNISGSTYSIQDFMYHPEIVVK

HGMSSFVDSMSKGLC

GQMSHHNHGAYTLDVAVE
VIKHQRELRMQSFNNYRKHFALEPYKSFEELTGDPKM

SAELQEVYGDVNAVD

LYVGFFLEKGLTTSPFGITMIAFGAPYSLRGLLSNPVSSPTYWKPSTFGGDVGFD

MVKTASLEKLFCQNI

AGECPLVTFTVPDDIARETRKVLEARDEL



2
0
. Go to the ClustalW website

(
http://ww
w.ebi.ac.uk/clustalw/index.html
)

and enter (by using “copy” and “paste”) all of

the FASTA formatted sequences
above
into
the data entry box. Select “input” for

the Output order.
Choose Full for alignment,
for output format, select aln w/numbers so we can
find particular residues (amino acids)
in the alignment.
Press “Run”

located in the lower left corner.

2
1
.
View the output
-

the
SCORES

table:

SeqA Name Len(aa) SeqB Name Len(aa) Score

===================================================

1 dog

604 2 cow 604 90

1 dog 604 3 mouse 604 89


12


Note that different specific combinations are examined;
DOG TO COW
for example.
You would expect a higher
SCORE

(
right column;
similarity of the gene sequence)

between two mammals than a mouse and the coral.

What is the similarity score
for the

same
gene found in mouse and coral? ________


Go to the bottom of the page; view the Cladogram
. Go to this source to learn more about
cladograms:
http://en.wikipedia.org/wiki/Cladogram
. Summarize your findings about the
evolution of this enzyme in your lab notebook.


Now for the most important part of this ClustalW analysis; an amino acid by amino acid
comparison o
f the same protein from different species.

Go about half way down the web
page and find
ALIGNMENT.

A button labeled 'Show Colors' will be displayed in the
Alignment section of results page. If you press this button the alignment will be show in
color acc
ording to the table below
-

remember our earlier discussion of types of amino
acids
. (This option only works when you have chosen ALN or GCG as the output
format).

AVFPMILW

RED

Small (small+ hydrophobic (incl.aromatic
-
Y))

DE

BLUE

Acidic

RHK

MAGENTA

Basic

STYHCNGQ

GREEN

Hydroxyl + Amine + Basic
-

Q

Others

Gray



CONSENSUS SYMBOLS:

An alignment
will display by default the following
symbols denoting the degree

of
conservation observed in each column:

"*" means that the residues or
nucleotides in that col
umn are identical
in all

sequences in the alignment.

":" means that conserved substitutions
have been observed, according to the

COLOR

table above.

"." means that semi
-
conserved

and
substitutions are observed.


Copy the alignment
of amino acids in various
species
and paste it into a Word document.
To make this file

readable, do the following things:

a. Go to “Page Set
-
up” under “File” and change the page orientation to

landscape.


13

b. Select all text and
change to “Courier” font, size 10.

Courier is the

best
font for
alignments because all the letters are the same width.

This is one of the major secrets of
working with FASTA sequences.

c. Save
and Print
this file to the desktop as “ClustalW.doc”

(send the file to yourself by
email or place on a floppy or flas
h drive).

Place copy in your lab notebook.

2
2
. Review the alignment. What symbols are used for positions in the

alignment that
contain identical, highly homologous, homologous, and

non
-
homologous residues?




Note which
residue
s (amino acids) are
conserv
ed

(same sequence from species to species,
sequence maintained over millions of years of evolutionary time)
?




Why would you expect them to be conserved?



Glossary for Bio3055

BLAST


Basic Local Alignment Search Tool


A program that compares a

sequenc
e (input) to all the sequences in a database (that you choose). This

program aligns the most similar segments between sequences. BLAST aligns

sequences using a scoring matrix similar to BLOSUM (see entry). This scoring

method gives penalties for gaps and g
ives the highest score for identical

residues. Substitutions are scored based on how conservative the changes are.

The output shows a list of sequences, with the highest scoring sequence at the

top. The scoring output is given as an E
-
value. The lower the
E
-
value, the

higher scoring the sequence is. E
-
values in the range of 1^
-
100 to 1^
-
50 are very

similar (or even identical) sequences. Sequences with E
-
values 1^
-
10 and higher

need to be examined based on other methods to determine homology. An Evalue

of 1^
-
10 for a sequence can be interpreted as, “a 1 in 1^10 chance that the

sequence was pulled from the database by chance alone (has no homology to

the query sequence).”

This program can be accessed from the NCBI homepage or:

http://www.ncbi.nlm.nih.gov/BLAST

Reference: Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J.

(1990) "Basic local alignment search tool." J. Mol. Biol. 215:403
-
410.


BLOSUM


Block Scoring Matrix
-

A type of substitution m
atrix that is used by

programs like BLAST to give sequences a score based on similarity to another

sequence. The scoring matrix gives a score to conservative substitutions of

amino acids. A conservative substitution is a substitution of an amino acid

simil
ar in size and chemical properties to the amino acid in the query sequence.

Discussed in the Berg text, p.175


178.



14

Bioinformatics
-

Bioinformatics is a field of study that merges math, biology, and

computer science. Researchers in this field have develo
ped a wide range of

tools to help
biomedical researchers work with genomic, biochemical, and

medical information. Some
types of bioinformatics tools include data base

storage and search programs as well as
software programs for analyzing

genomic and proteo
mic data.


ClustalW


A program for making multiple sequence alignments.

http://www.ebi.ac.uk/clustalw/index.html

W. R. Pearson (1990) “Rapid and Sensitive Sequence Comparison with FASTP

and FASTA”
Methods in Enzymology 183:63
-

98.


Conserved


when talking about a position in a multiple sequence alignment,

“conserved” means the amino acid residues at that position are identical

throughout the alignment.


Conservative residue change


when talking a
bout a position in a mulitple

sequence alignment, a “conservative change” is when there is a change to a

homologous amino acid residue.


cpk coloring mode
-

This coloring mode colors based on atom identity:

red = oxygen

blue = nitrogen

orange = phosphorous

yellow = sulfur

gray = carbon


DeepView/Swiss
-
Pdb Viewer


a program for viewing 3
-
D structures. It loads

“.pdb” files, which contain the 3
-
D coordinates for molecular structures. Swiss
-

Pdb Viewer is easy and free to download on any computer (Mac of PC)
and can

be used no matter what Browser you are using. It is fairly easy to learn to use at

the basic level, however, it also has very advanced capabilities that can be useful

in research. It is also a nice program to use with PovRay, which allows you to

ma
ke graphic files from pdb information. This is important when making figures

for a presentation, report, or journal article. If you would like to download Swiss
-

Pdb Viewer for your own computer, the program is available for free and is easy

to download fr
om the website, “us.expasy.org/spdbv”. A help manual is also

available here if you have further questions that aren’t addressed in this course.

http://us.expasy.org/

To run this program with Mac OSX, you must first ch
ange the monitor settings.

a. Open “System Preferences” on your computer.

b. Double click on the “Displays” icon.

c. On the right
-
hand side of the panel, choose “thousands” of colors

from the list (changing it to “thousands” from “millions”).

d. Then close

System Preferences and then open Swiss
-
Pdb Viewer.

Names of some other structure viewing programs:

RasMol (www.openrasmol.org)


15

Kinemage
(
www.kinemage.biochem.duke.edu)
--
the

one we use!!

Protein Explor
er (www.proteinexplorer.org)


EC number
-

Enzyme Committee number
-

Given by the IUBMB (International

Union of Biochemistry and Molecular Biology) classifies enzymes according to

the reaction catalyzed. An EC Number is composed of four numbers divided by

a

dot. For example the alcohol dehydrogenase has the EC Number 1.1.1.1


ExPASy


Expert Protein Analysis System
-

A server maintained by the Swiss

Institute of Bioinformatics. Home of SWISS
-
PROT, the most extensive and

annotated protein database. The Swiss
-
Pdb Viewer protein
-
viewing program is

also available at this site for free download.

http://us.expasy.org/


FASTA


A way of formatting a nuleic acid or protein sequence. It is important

because
many bioinformatics programs require that the sequence be in
FASTA

format.
The
FASTA format has a title line for each sequence that begins

with a “>” followed by
any needed text to name the sequence. The end of

the title line is signified by a
paragraph mark (hit the return key).

Bioinformatics programs will know th
at the title
line isn’t part of the sequence if

you have it formatted correctly. The sequence itself does
NOT have any returns,

spaces, or formatting of any kind. The sequence is given in one
-
letter code. An

example of a protein in correct FASTA format is
shown below:

>K
-
Ras protein Homo sapiens

MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDI

LDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVP

MVLVGNKCDLPSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRK

HKEKMSKDGKKKKKKSKTKCVIM


GenBank
-

a database o
f nucleotide sequences from >130,000 organisms. This

is the
main database for nucleotide sequences. It is a historical database,

meaning it is
redundant. When new or updated information is entered into

GenBank, it is given a new
entry, but the older sequen
ce information is also kept

in the database. GenBank belongs
to an international collaboration of sequence

databases, which also includes EMBL
(European Molecular Biology Laboratory)

and DDBJ (DNA Data Bank of Japan). In
contrast, the RefSeq database (see

entry) is non
-
redundant and contains only the most
current sequence information

for genetic loci. The GenBank database can be searched at
the NCBI

homepage:

http://www.ncbi.nlm.nih.gov/


Gene


an NCBI database

of genetic loci. This database used to be called

LocusLink. Entries provide links to RefSeqs, articles in PubMed, and other

descriptive information about genetic loci. The database also provides

information on official nomenclature, aliases, sequence acce
ssion numbers,

phenotypes, EC numbers, OMIM numbers, UniGene clusters, map information,

and relevant web sites. Access through the NCBI homepage by selecting “Gene”

from the Search pulldown menu.

Genome


The entire amount of genetic information for an org
anism. The


16

human genome is the set of 46 chromosomes.

Homologous


When referring to amino acids, a homologous amino acid is

similar to the reference amino acid in chemical properties and size. For example,

glutamate can be considered homologous to asparta
te because both residues

are roughly similar in size and both residues contain a carboxylic acid moiety

which gives them similar chemical properties.

KEGG


Kyoto Encyclopedia of Genes and Genomes


This website is used for

accessing metabolic pathways. At

this website, you can search a process, gene,

protein, or metabolite and obtain diagrams of all the metabolic pathways

associated with your query. You will see a link to the KEGG entry at the end of

the Gene entry for a gene.

http://www.genome.ad.jp/kegg/

NCBI


National Center for Biotechnology Information


This center was formed

in 1988 as a division of the NLM (National Library of Medicine) at the NIH

(National Institute of Health). As part of the NIH, NCB
I is funded by the US

government. The main goal of the center is to provide resources for biomedical

researchers as well as the general public. The center is continually developing

new materials and updating databases. The entire human genome is freely

ava
ilable on this website and is updated daily as new and better data becomes

available. The NCBI homepage:

http://www.ncbi.nlm.nih.gov

NCBI also maintains an extensive education site, which offers online tutorials

of

its databases and programs:

http://www.ncbi.nlm.nih.gov/About/outreach/courses.html

OMIM
-

Online Mendelian Inheritance in Man


a continuously updated catalog of

human genes and
genetic disorders, with links to associated literature

references, sequence records, maps, and related databases. Access through

the NCBI homepage or:

http://www.ncbi.nlm.nih.gov/entrez/
query.fcgi?db=OMIM

Protein Data Bank


(PDB)


A database that contains every published 3
-
D

structure of
biological macromolecules. It contains mostly proteins, but also DNA

and RNA
structures. Also see RCSB.

http
://www.rcsb.org/pdb/

A pdb file is a file containing the three
-
dimensional coordinates (x,y,z) for each of

the atoms in the protein. This type of file is made using the data obtained from

either an
X
-
ray crystallography experiment or an NMR experiment. On
ce you

have pdb file of a
protein, you can open the file in various structure viewing

programs to view the protein
structure.

Proteome


the entire set of expressed proteins for an organism. This term is

commonly
used to discuss the set of proteins that ar
e expressed in a certain cell

type or tissue under
specific conditions.


PSIPRED


a server for predicting secondary structure from protein sequences.

The
predictions are made based on a database of known secondary structures

for protein
sequences. These p
redictions are estimated to be correct ~80% of

the time. This server
can also be used to predict transmembrane segments.

http://bioinf.cs.ucl.ac.uk/psipred/

McGuffin LJ, Bryson K, Jones DT. (2000) The PS
IPRED protein structure

prediction
server. Bioinformatics. 16, 404
-
405.

Jones DT. (1999) Protein secondary structure prediction based on position

specific

scoring matrices. J. Mol. Biol. 292: 195
-
202.


17

PubMed


when writing a paper on a particular science/m
edical topic, you should always
check PubMed. It is
a retrieval system containing citations, abstracts, and indexing terms

for journal articles in the biomedical sciences. PubMed contains the complete

contents of
the MEDLINE and PREMEDLINE databases. It a
lso contains some

articles and journals
considered out of scope for MEDLINE, based on either

content or on a period of time
when the journal was not indexed, and therefore is

a superset of MEDLINE.

http://www.nc
bi.nlm.nih.gov/

RCSB


Research Collaborative for Structural Bioinformatics


A non
-
profit

consortium that works to provide free public resources and publication to assist

others and further the fields of bioinformatics and biology dedicated to study of 3
-

D biological macromolecules. Members include Rutgers, San Diego

Supercomputer Center, University of Wisconsin, and CARB
-
NIST (at NIH).

RefSeq
-

NCBI database of Reference Sequences. Curated, non
-
redundant set

including genomic DNA contigs, mRNAs, protein
s, and entire chromosomes.

Accession numbers have the format of two letters, an underscore bar, and six

digits. Example: NT_123456. Code: NT, NC, NG = genomic; NM = mRNA;

NP = protein (See NCBI site map for more of the two letter codes).

Sequence Manipulation Suite


a website that contains a collection of webbased

programs for analyzing and formatting DNA and protein sequences.

http://bioinformatics.org/sms/