Bioinformatics 101 - Recreating the McDonald-Kreitman analysis


1 Οκτ 2013 (πριν από 4 χρόνια και 7 μήνες)

100 εμφανίσεις

Bioinformatics 101

Recreating the McDonald
Kreitman analysis

© Charlie Baer 2006

In this assignment you will:

1. Retrieve the original


sequence data from the McDonald
Kreitman paper from

2. Edit out the introns to retain th
e coding sequence.

3. Align the coding sequence using the ClustalW alignment algorithm.

4. Determine all polymorphic sites within each species.

5. Determine all substitutions (fixed differences) between species.

. Translate the nucleotide sequ
ence into amino acid sequence and determine which

result in amino acid replacements.

. Construct a 2x2 table with rows labeled "Replacement, Synonymous" and columns labeled
"Fixed, Polymorphic".

. Compare your results to McD & K.


1. Open BioEdit
, the sequence editor we will use. BioEdit is freeware for the PC written by Tom
Hall at NC State; a Mac
based equivalent is Se
Al, written and maintained by Andrew Rambaut
at Oxford.

2. Go to the "World Wide Web" Menu and then go t
o NCBI (National Center for

3. Find the McDonald

data. The
D. simulans / D. yakuba

sequence accession
numbers are in the paper; the
D. melanogaster

sequences can be found by using the search string
"melanogaster a
nd Kreitman and Adh".

4. Click open one of the records to see what a GenBank record looks like (m

of you

seen such files already). Note the information provided, especially the information about exons
and introns.

. Import the sequences
from GenBank into BioEdit.

Go into the "Display" menu and display
the data as "GenBank". Go to the "Show" menu and show 200 records. Then go to the "Send to"
menu and send to file. Then save the file in the appropriate location (e.g., the desktop) as
filename.gbk>. You will probably have to save two files, one of the

data and
of the

data. The two files can then be merged into one large file. I did this by
opening the .gbk file in Word and copying the data from the Dm
el file into the Dsim/yak file.

Then save the composite file with a new name.

6. The next job is to remove the introns and leave only the CDS. There are various ways to do
One way is to make a file with only the CDS from each sequence by sequent
ially copying
and pasting. This method is tedious but relatively safe. A

straightforward way is to open
the sequence file in BioEdit and the GenBank annotation file in a text editor (e.g., Word).
BioEdit provides a running record of the
position that the cursor is on.

The intron positions
can be marked in lowercase letters by going to "Sequence"
> "Manipulations"
> "Lowercase".
Then the introns can be deleted by selecting and deleting the selected regions.

Another way to save sub
equence is to go to the "Sequence" menu and click "Extract
Positions". You will then get a window in which you can choose a string of sequence to copy.
The chosen region will be copied at the bottom of the sequence window and can then be pasted
into a ne
w file.

NOTE: BioEdit (and all sequence editors) can save a lot of time, but they can sometimes be
difficult. A relatively foolproof method of sequence editing is to cut a

paste strings of
sequence in a text editor and save the file as FASTA. FASTA f
iles can be read by any sequence
editor. FASTA format is as follows:







The sequence ID is designated with a ">" followed by the sequence name; the next line is the
actual se

For BioEdit to read a FASTA file, save it with the suffix .fas.

. Go to the "Accessory Application" window and click on ClustalW multiple alignment.
ClustalW is an alignment algorithm (probably the most widely used). Click "ok" at the pro
and use the default settings. The alignment may take a few minutes. The output will be aligned
nucleotide sequences.

Carefully look at the alignment and see if there are any obvious mistakes.
Algorithms are not perfect and adjustments must frequent
ly be made by eye.
Save the alignment
as a new file, e.g., <>

8. Select all the sequences and translate them into amino acid sequences. To do this, under the
"Sequence" menu click "toggle translate"
. This will allow you to identify t
he mutations that
result in amino acid substitutions. You can then use a standard codon table to determine what the
actual nucleotide mutations were. Various codon tables are available under the "Options" menu.

9. In the process of this exercise feel
free to explore and play with BioEdit, especially if you have
little experience with such software.