Bioinformatics 101 - Recreating the McDonald-Kreitman analysis

moredwarfΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

89 εμφανίσεις

Bioinformatics 101
-

Recreating the McDonald
-
Kreitman analysis

© Charlie Baer 2006


In this assignment you will:


1. Retrieve the original
Drosophila

Adh

sequence data from the McDonald
-
Kreitman paper from
GenBank.


2. Edit out the introns to retain th
e coding sequence.


3. Align the coding sequence using the ClustalW alignment algorithm.


4. Determine all polymorphic sites within each species.


5. Determine all substitutions (fixed differences) between species.


6
. Translate the nucleotide sequ
ence into amino acid sequence and determine which
mutations

result in amino acid replacements.


7
. Construct a 2x2 table with rows labeled "Replacement, Synonymous" and columns labeled
"Fixed, Polymorphic".


8
. Compare your results to McD & K.


INSTRUCTI
ONS


1. Open BioEdit
, the sequence editor we will use. BioEdit is freeware for the PC written by Tom
Hall at NC State; a Mac
-
based equivalent is Se
-
Al, written and maintained by Andrew Rambaut
at Oxford.


2. Go to the "World Wide Web" Menu and then go t
o NCBI (National Center for
Biotechnology
Information
)


3. Find the McDonald
-
Kreitman
Adh

data. The
D. simulans / D. yakuba

sequence accession
numbers are in the paper; the
D. melanogaster

sequences can be found by using the search string
"melanogaster a
nd Kreitman and Adh".


4. Click open one of the records to see what a GenBank record looks like (m
any

of you
may

have
seen such files already). Note the information provided, especially the information about exons
and introns.


5
. Import the sequences
from GenBank into BioEdit.

Go into the "Display" menu and display
the data as "GenBank". Go to the "Show" menu and show 200 records. Then go to the "Send to"
menu and send to file. Then save the file in the appropriate location (e.g., the desktop) as
<
filename.gbk>. You will probably have to save two files, one of the
simulans
/
yakuba

data and
one
of the
melanogaster

data. The two files can then be merged into one large file. I did this by
opening the .gbk file in Word and copying the data from the Dm
el file into the Dsim/yak file.

Then save the composite file with a new name.


6. The next job is to remove the introns and leave only the CDS. There are various ways to do
this.
One way is to make a file with only the CDS from each sequence by sequent
ially copying
and pasting. This method is tedious but relatively safe. A
nother

straightforward way is to open
the sequence file in BioEdit and the GenBank annotation file in a text editor (e.g., Word).
Note
that
BioEdit provides a running record of the
position that the cursor is on.

The intron positions
can be marked in lowercase letters by going to "Sequence"
-
> "Manipulations"
-
> "Lowercase".
Then the introns can be deleted by selecting and deleting the selected regions.



Another way to save sub
-
s
equence is to go to the "Sequence" menu and click "Extract
Positions". You will then get a window in which you can choose a string of sequence to copy.
The chosen region will be copied at the bottom of the sequence window and can then be pasted
into a ne
w file.


NOTE: BioEdit (and all sequence editors) can save a lot of time, but they can sometimes be
difficult. A relatively foolproof method of sequence editing is to cut a
nd

paste strings of
sequence in a text editor and save the file as FASTA. FASTA f
iles can be read by any sequence
editor. FASTA format is as follows:


>Sequence1

ATAATATTCGCGTTC

>Sequence2

AGGTGCGGCCCTTTA

>SequenceN

ATTTTGCGCGCGCTTT


The sequence ID is designated with a ">" followed by the sequence name; the next line is the
actual se
quence.

For BioEdit to read a FASTA file, save it with the suffix .fas.




7
. Go to the "Accessory Application" window and click on ClustalW multiple alignment.
ClustalW is an alignment algorithm (probably the most widely used). Click "ok" at the pro
mpt
and use the default settings. The alignment may take a few minutes. The output will be aligned
nucleotide sequences.

Carefully look at the alignment and see if there are any obvious mistakes.
Algorithms are not perfect and adjustments must frequent
ly be made by eye.
Save the alignment
as a new file, e.g., <filename.align.bio>


8. Select all the sequences and translate them into amino acid sequences. To do this, under the
"Sequence" menu click "toggle translate"
. This will allow you to identify t
he mutations that
result in amino acid substitutions. You can then use a standard codon table to determine what the
actual nucleotide mutations were. Various codon tables are available under the "Options" menu.



9. In the process of this exercise feel
free to explore and play with BioEdit, especially if you have
little experience with such software.