Red Line Walkthrough

bigskymanΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

145 εμφανίσεις


25


Red

L
ine Walkthrough



A. Identifying repetitive DNA

Example Sequence:

Arabidopsis thaliana (mouse
-
ear cress) Synthetic Contig, 16.4 kb

Tool(s):


RepeatMasker

Concept(s):


Non
-
coding DNA, sequence repeats, mobile genetic elements (transposons)

I. Create Project






1.
Log
-
in

to DNA Subway
.

(dnasubway.iplantcollaborative.org)






2.
Click

‘Annotate a genomic sequence
.
’ (Red Square)

3.
Select

samples sequence: Arabidopsis thaliana (mouse
-
ear cress)
Synthetic Contig.


4.
Provide

your project with a title, then
Click

‘Continue
.



II. Identify and Mask Repeats







1.
Click

‘RepeatMasker’







-

Wait until the flashing icon displays ‘V
.
’ (view)




2.
Click ‘
RepeatMasker’ again to view the results.



Questions:


Q.1:

How many hits were detected in your sample?








__________



Q.2:

RepeatMasker reports the length of the repetitive sequences

(Length) as well as the class (Attributes):



a. What is the average length of sequences identified as “simple repeats”?


__________



b. What is the average length of sequences identified as “low complexity”?



__________



Q.3:

What is the total percentage of repetitive DNA in your sequence?

(Sum of the length of all repetitive sequence / sequence length (16.47KB)


__________




A
dditional Investigation:
In the resul ts tabl e under ‘Attri butes’ each repeat sequence i s l abel ed “RepeatMasker#
-
XXX” The ‘#’
is the ordi nal number of the hi t, the XXX is the cl ass of DNA el ement (e.g. “Si mpl e_repeat” or “Low_compl exi ty”). There are
other t
ypes of repeti ti ve el ements such as transposons and pseudogenes (e.g. Hel i tron and COPIA)
Use online resources to
learn more:

(http://gydb.org/i ndex.php/Mai n_Page).

Simple repeats:

1
-
5bp repeats
(e.g. repetitive dinucleotides
‘AT’ etc.)


Low Complexity DNA:
Pol y
-
puri ne/ poly
-
pyrimidine
stretches, or regions of
extremely high AT or
GC

content.


Processed Pseudogenes,
SINES, Retrotranscripts:
Non
-
functi onal RNAs present within
genomic sequence


Transposons (DNA, Retroviral,
LINES):
Genetic elements which
have the ability to be amplified
and redistributed within a
genome.


26


B. Making Gene Predictions

Example Sequence:

Arabidopsis thaliana (mouse
-
ear cress) Synthetic Contig, 16.4 kb
from part A

Tool(s):


Augustus, FGenesH, Snap, tRNA Scan

Concept(s):


Genomic DNA, Gene Structure, Canonical sequences

III. Predict Genes






1.
Click
‘Augustus’ and wait until a green ‘V’ icon appears.






2.
Click

‘Augustus’ again to view a table of results. Use the







results determined b
y ‘Augustus’ to answer question 4.

3. Repeat (one
-
at
-
a
-
time) steps 1
-
3 with ‘FGenesH’, ‘Snap.’ Also run ‘tRNA
Scan’ to answer question 6.




Questions:





Q.4:

Look at the ‘Type’ column in the gene prediction report.

Find the first mention of the

term ‘gene’ and copy down the gene’s
‘start’ (i.e. the starting basepair). Note the number of times you see the
term ‘exon’ (i.e. number of exons predicted).













Q.5: Based on the chart in question 4, did all the gene predictors yield genes starting at the
same location? Did
all the gene predictions have the same number of exons?

________________________
_____________________________________________________________________
________________________
_____________________________________________________________________


Q.6: Looking at the number of results returned by tRNA Scan, why are they so different from results made by
other predictors? Are their places in the genome where tRNAs are more or less

densely concentrated?

______________________
_
______________________________________________________________________
___________________________________________
__________________________________________________
___________________________________________
___
__________________________
________________
_____

Additional Investigation:
Look for the background l i nk at the bottom of the DNA Subway home page and revi ew the secti on
enti tl ed ‘Gene Fi ndi ng,’

Gene Predictor

Start

exons

Augustus gene 1

746

1

Augustus gene 2



Augustus gene 3



FGenesH gene 1



FGenesH gene 2



FGenesH gene 3



Snap gene 1



Snap gene 2



Snap gene 3



Gene Predictor:
A program
that makes use of multiple
s ensors to mode
l entire
gene(s)


Sensor:
An al gorithm that
works to predict specific
features within a s equence
(e.g. a canonical splice site or
an exon).


Ab Initio

Prediction:


Gene
predi ction based solely on a
Genomic DNA s equence.


Hidden Markov Model:
An
al gorithm that represents (in
thi s case) sequence data and
s i gnals (nucleotide patterns) as
s tates. The probabilities of
trans itions between states
(even i f s ome states are
unknown) can be used to
model a gene and i ts
components. (See:
doi:10.1038/nbt10
04
-
1315
)


CDS:
The protein
-
coding exons
of a gene sequence.




27


C. Viewing Gene Predictions in a Browser


Example Sequence:

Arabidopsis thaliana (mouse
-
ear cress) Synthetic Contig, 16.4 kb
from part B

Tool(s):


Local Browser (GBrowse)

Concept(s):


Gene orientation/structure, transposons, chromosome organization

IV. View Gene Predictions






1.
Click
‘Local Browser’ and allow browser to load.

2. Under ‘Scroll/Zoom’
select

‘Show 25kbp.’ When the browser






reloads, it should show ‘Show 16.47 kbp.’ Answer question 6.

3. Under ‘Reports & Analysis,’ ‘Download Decorated FASTA File’ should be
selected.
Click

‘Configure.’


4. On the line ‘Augustus Predicted Gene
s’
Click

the radio button to select
‘BKG’ as ‘red.’ On the line ‘FGenesH Predicted Genes’
Click


Underline
.’

5.
Click

‘Go’ to view the results and answer question 7.





Questions:


Q.6: What observations can you make about the locations of transposons and repetitive DNA in relation to
predicted genes?

_____________________________________________________________________________________________
_______________________________________
______________________________________________________
________________________
_____________________________________________________________________


Q.7: Red highlighted sequence demarcates Augustus predictions, underlined sequences are predictions made
by

FGenesH. What do the gene predictions have in common, and is there any pattern to how the FGenesH
predictions begin and end?

_____________________________________________________________________________________________
_____________________________________
________________________________________________________
________________________
_____________________________________________________________________


Additional Investigation:
Pl ay wi th the Decorated FASTA fi l e to see sequence di fferences between di fferen
t gene predi ctors.
Copy and paste sequence i nto the NCBI ‘BLAST’ server to get i nformati on on predi cted genes. (http://bl ast.ncbi.nl m.ni h.gov/)


Gene Browser:
A GUI
(Graphical User Interface) for
vi ewing biological information.

GBrowse

(DNA Subway’s
Browser) is


designed to vi ew
genomes. It displays a graphical
representation of a section of a
genome, and shows the
positions of genes and other
functi onal
elements. It can be
confi gured to show both
qualitative data such as the
spl icing structure of a gene,
and quantitative data such as
mi croarray expression l evels.

-

http://gmod.org/wiki/GBrowse
_FAQ


Track:
The i ndividual regions of
the di splay where inform
ation
i mported i nto the browser. For
each type (or source) of
i nformation, there is usually an
associated track.





28


D. Adding Experimental (Biological) Evidence

Example Sequence:

Arabidopsis thaliana (mouse
-
ear cress) Synthetic Contig, 16.4 kb
from part C

Tool(s):


BLASTN, BLASTX, Upload Data

Concept(s):


RNA, cDNAs, ESTs, Biological Databases

V. Search databases for Biological Evidence

1.
Click
‘BLASTN’

-

Wait until the flashing icon displays ‘V’ (view)


2.
Click ‘
BLASTN’ again to view the results.


3.
Click
‘BLASTX’

-

Wait until the flashing icon displays ‘V’ (view)


4.
Click ‘
BLASTX’ again to view the results.


Questions:


Q.8:

Both BLASTN and BLASTX returns the ‘Length’ of your resulting matches.
Do you notice differences in the average lengths of BLASTN and BLASTX
matches? Explain.




_____________________
______________________________________________
___________________________________________________________________
___________________________________________________________________
____________
________________________________________
_______________
_________
__________________________________________________________


Q.8:

Under ‘Type’ both BLASTN and BLASTX returns ‘match’ and ‘match_part.’
‘Match’ is describing the overall length of a single match (globally), but
individual significant matches may be fragmented, i.e. ‘match_part.’ Do BLASTN
and BLASTX return ‘match’ and ‘m
atch_part’ results in different frequencies?
Explain.



___________________________________________________________________
___________________________________________________________________
______________________
________________________________________
__
___
___________________________________________________________________
___________________________________________________________________



Additional Investigation:

Under Attri butes i n the BLASTN and BLASTX resul ts there i s a
secti on cal l ed ‘descri pti on.’ Use an i nternet search engi ne and/or other resources to l earn
about the functi onal features of si gni fi cant hi ts.

BLAST:

Basic Local Al ignment Search
Tool (BLAST) is an algorithm
that search database
s of
bi ological sequence
i nformation (e.g. DNA, RNA, or
Protei n sequence) and return
matches. The BLASTN program
i s specific to nucleotide data,
and the BLASTX algorithm
works with sequence data
translated into amino acid
sequences.


UniGene:

A database
of
transcript data, “e
ach UniGene
entry i s a set of transcript
sequences that appear to come
from the same transcription
l ocus (gene or expressed
pseudogene), together wi th
i nformation on protein
si milarities, gene expression,
cDNA cl one reagents, and
geno
mic location.

-

http://www.ncbi.nlm.nih.gov/u
ni gene


cDNA:
DNA produced by
reverse transcribing mRNA
usi ng reverse transcriptase.
cDNAs are used to i nvestigate
mRNA wi thin a biological
sample.


ESTs: “
Smal l pieces of DNA
sequence (usually 200 to 500
nucl
eotides long) that are
generated by sequencing either
one or both ends of an
expressed gene. The idea is to
sequence bits of DNA that
represent genes expressed in
certai n cells, tissues, or organs
from di fferent organisms
.”
-

http://www.ncbi.nlm.nih.gov/A
b
out/primer/est.html


29


Apollo Annotation Tips for Protein Coding Genes


Th
is example assumes you have run the following routines:


RepeatMasker, Augustus, FGenesH, Snap, BlastN, BlastX, Users BlastN (with A.thaliana EST data)


Prepare your workspace


1. Choose one strand to work on.


(
View

> Show forward strand or Show reverse

stand


check/uncheck your selection)


-

Apollo displays data on both strands; most users will want to work on one strand at a time.


2. Display all models and evidence by expanding tiers.


(
Tiers

> Expand all tiers)



-

Apollo may represent multiple data in a single evidence track; expanding the tiers to see all data will
make it easier to manipulate the data during your annotation process.








Apollo 2 strand view Apollo
1 strand view


Tiers collapsed


Tiers expanded



3. Hide unnecessary data.


(
Tiers

> Show types panel

> Show (uncheck your selection, e.g. BLASTX, BLASTX_USER, BLASTN,


BLASTN_USER)


-

Protein (BLASTX/ BLA
STX_USER) and EST (BLASTN_USER) data can usually be worked on in later steps;
un
-
show them in the tiers menu. You will add these data back to your analysis when you are ready to
consider them.


After preparing your workspace, there are 5 steps to creating

a basic manually curated annotation within
Apollo:


A. Create a Gene Model

B. Determine transcript length

C. Determine splice sites and variants.

D. Determine start/stop sites





30


A.
Create
a

Gene Model


Possible
decision c
omponents:


Biological

UniGene
Model

(BLASTN)

Why?



UniGene models are derived from cDNA and ESTs (transcriptome evidence) produced by


experiment. (
http://www.ncbi.nlm.nih.gov/UniGene/help.cgi?item=build2
)


Hypothetical

Gene model of choice


Why?

-

A gene model generated by any of the

prediction algorithms is based on known

biological


constraints, and is a priori hypothesis based only on the genomic sequence.


1. Select a gene model as a scaffold


-
Use transcriptome evidence (UniGene
-
BLASTN) to select the best possible gene model f
or a scaffold. If
no gene model exists or significantly reflects the UniGene model, use the UniGene model itself as a
scaffold (See examples 1, 2).


2. Drag the gene model of choice into the workspace and label the new scaffold. Name the model
using the
‘A
nnotation info editor.’


(Right Click/

Click on the model > Annotation info editor)





Ex.1:
Models and evidence: Top: Augustus; Middle: FGenesH; Bottom: (BLASTN
)


Ex.1:
The Augustus and
Unigene models

are very cl ose. FGenesH (which does not predict UTRs) could be used as a scaffold i f you were not
concerned about modeling the UTR. SNAP (not shown) did not predict a gene at this locus.

In this case,

Augustus is probably the best choice
for a scaffold.



Ex.2:
Models and evidence: Top: Augustus;
2
nd

Track: SNAP, 3
rd

Track:

FGenesH; Bottom: (BLASTN)


Ex.2:

The Augustus and
Unigene

models are very cl ose
,

however SNAP predicts
2

genes at this locus. Barring additional evidence, Augustus
may be the best gene

model to start a scaffold.





31



Ex.3: Named August m
odel

in workspace


B.

Determ
ine

transcript

length


Possible
decision c
omponents:


Biological

UniGene Model

(BLASTN)

Why?



Full length cDNAs (which are components of the Unigene model) give experimentally


determined boundaries for the transcript.


Hypothetical

Gene model

from part A


1.

Drag the BLASTN model into the workspace, and then name it using the ‘Annotation info
editor.’


(Right Click/

Click on the model > Annotation info editor)




Ex.
4
: cDNA evidence and Augustus based model in the workspace


Ex.4: The cDNA supports a transcript that is shorter than the Augustus based model at both the 5’ and 3’ ends of the transcri
pt.


2.

Use the ‘Exon detail editor’ to adjust the lengths of the model transcript.


(Right Click/

Click on the model > Exon de
tail editor)









32


C
. Determine

splice sites and variants


Possible
decision c
omponents:


Biological

EST data (BLASTN_USER)

Why?


Like full length cDNAs, ESTs give valuable information on transcript diversity. ESTs are

generated by high throughput methods, and although the data may be fragmentary, it may

capture biologically relevant information about splice variants.


UniProt
Protein data

(BLASTX/BLASTX_USER)

Why?


Proteins do not contain UTR, but do contain the
initiating amino acid (
methionine
). Their


lengths may give clues to the actual length of the translated protein.


Hypothetical

Gene model
from part B



1. Use the tiers menu to show all available data.


(
Tiers

> Show types panel

> Show (
check your
selection, e.g. BLASTX, BLASTX_USER, BLASTN,


BLASTN_USER)


-

Depending on the database you upload (in the example case, Arabidopsis ESTs) you will have to consider
how to interpret the possible splice variants. BLASTX_USER returns hits from UniProt and ma
y contain hits
from genomes other than the one you are annotating. Gene duplications may also give hits which
annotate to other loci.





Ex.
5
:
AT5G13220 (JAZ10) In Apollo (left) and Phytozome (right)


Ex.5: At a l ocus where there is alternative splicing, gene models may disagree, and the biological evidence may also seem i n
conflict.
Accordi ng to Phytozome (ri ght) there i s the primary annotated transcript (highlighted green) and three alterative transcr
ipts displayed
bel ow it. In Apollo, the UniGene model (BLASTN


bottom track) seems to suggest the first alternative transcript (At5G13320.2) but other
EST evi dence (BLASTN_USER) suggests other transcripts. Depending on the amount of evidence for alternati
ve transcripts at a l ocus, you
may have to create several models.



33



2. Based on available evidence, drag any additional hypothetical gene models (and/or transcriptome models

BLASTN, BLASTX, BLASTN_USER, BLASTX_USER) into the workspace. Rename each mode
l using the ‘Annotation
info editor.’


(Right Click/

Click on the model > Annotation info editor)




Ex.6: Gene models with ‘non
-
canonical’ splice sites highlighted


3. Use the ‘Exon detail editor to “fix” non
-
canonical splice sites (highlighted with yellow arrows).


(Right Click/

Click on the model > Exon detail editor)


3a. (Optional) In some cases you may want to change exonic structure in your model. You can do
this by either
splitting an exon (Select the exon of choice, (Right Click/

Click on the model > Split exon) or merging two exons
(Select the exons of choice, (Right Click/

Click on the models > Merge exons). Lengths of introns and exons can
always be cha
nged in the exon editor.




Ex.
7: BLASTX Model with non
-
canonical splice sites fixed.


4. Calculate the longest ORF (open reading frame) in your model.


(Right Click/

Click on the model >
Calculate longest ORF
)




Ex.
8: BLASTX Model with longest ORF displayed



34


D. Determine start/stop sites


Possible
decision c
omponents:


Biological

UniProt
Protein data

(BLASTX/BLASTX_USER)

Why?


Proteins do not contain UTRs, but do contain the initiating amino acid (
methionine
).

l
engths of the protein hits give clues to the actual length of the translated protein at that locus

and its reading frame.


Hypothetical

Gene model
from part C



1. Use the protein data to establish probable start and stop sites. Drag the start and stop icons into your model
from those displayed above the workspace.






Ex.9: Start and Stop codons are highlighted in red and green in the uppermost part of the screen.



2.
Ensure that your model is
complete after all changes by calculating the longest ORF (open reading frame) in
your
final
model
(s)
.


(Right Click/

Clic
k on the model > Calculate longest ORF)



Once you have finished your model, upload the results back to DNA Subway.


(File

> Upload to DNA Subway)


35


Apollo Visual Glossary






Tiers of gene evidence
(predicted models and
Biological data)

Model building workspace

Additional info (active
when model is selected)

Zoom,

Pan, Horizontal
Scroll

Position

Exon

Intron (bent line)

Alignment gap
(straight line)

UTR

Non
-
cannonical splice

Possible

Start (green)

Stop (red)

sites

Designated start site

Designated stop site


36


Some Useful
Apollo Menus (right
-
click/

click)


Exon Detail Editor
-

adjusts exon
boundaries
(right
-
click/

click)









Tiers Menu


color coding of teirs, and
Sequence menu


extract sequence at selected locus


(show/hide)


Tiers menu
(right
-
click/

click)


Annotation Info Editor


name gene models, and add comments
(right
-
click/

click)


37


Summary of mouse functions

Mouse buttons perform many functions in Apollo. The table below summarizes the functions performed
by the three mouse buttons, with and without holding down the shift key.

If you are on a Mac and have a single
-
button mouse, you can simulate a right mouse
click by holding
down the control or alt key while clicking the mouse, and you can simulate a middle mouse click by
holding down the apple key while clicking the mouse.
If you are trying to copy text from Apollo (for
example, sequence residues from a Seque
nce window) to paste into another application, use ctrl
-
c to
copy the text and apple
-
v to paste it (or, if you are trying to paste into a Web browser, you can use the
'Paste' command from the browser's Edit menu).

Please note that if you are running an ol
d version of Mac OS X (10.2.2 or earlier), you may find strange
mouse
-
button behavior. With a three
-
button mouse, you may find that the behavior of the right and
middle buttons is switched. On a Windows laptop you may find that the middle mouse button will

pop
up a little scrollbar. To simulate middle mouse you might have to use the Alt key with the left mouse
button.

Mouse key

Action

Left

Select feature (or deselect if you're not over any features)

Shift
-
Left

Add feature to current selection (or remove
feature if it's already selected)

Left drag

If you drag a feature into the annotation tier, it will be added as a new transcript (if
editing is enabled)

Shift
-
Left drag

If you shift
-
drag a feature onto an annotation, it will be added as a new exon (if
editing
is enabled)

Middle click

Center display on clicked location

Middle drag

Rubberband multiple features

Shift
-
Middle
drag

Rubberband multiple features, adding to current selection (or remove if they are
already selected)

Right

Popup menu

Shift
-
Right drag

Tier drag
--
move currently selected tier







38



DNA Subway Annotation “Cheat Sheet”





1.

Establish a project or open an existing project.

2.

Run
RepeatMasker
.

3.

Run Gene Predictors (e.g
. Augustus
,
FGenesH
, etc.).

4.

View results in
Local Browser
;
compare and contrast predictions.

5.

Run
BLAST

searches.

6.

View results in
Local Browser
; compare and contrast predictions and
BLAST

results.

7.

Add additional biological evidence in form of cDNA, ESTs, genes, proteins (optional):

a.

To download ESTs for a sample seq
uence, open the “Annotation” directory at
http://gfx.dnalc.org/files/evidence/
, right
-
click the appropriate file and save it to your
computer. (
Do not open

the file, but download or save it to your computer.)

b.

After saving the file to your computer go back
to your project in DNA Subway.

c.

Click ‘
Upload Data
,’ browse to the file, and upload it to DNA Subway.

d.

Click
‘User

BLASTN

(
or User BLASTX

for protein)’ to search the uploaded data.

8.

Synthesize predictions and BLAST search results into gene models using
Apollo
:

a.

General navigation tools (scrolling, zooming) can be accessed on the main Apollo screen

b.

General tools to handle files and data can be accessed in the tab menu at the top of the
Apollo screen.

c.

Editing tools can be access by right
-
clicking (command
-
click o
n Macs) an item on the
workspace.

d.

A utility to record the chances and change the name of a model is included in the editing
tools as “Annotation info editor.”

e.

A tool to lengthen or shorten exons is included in the editing tools as “Exon detail editor.”

f.

Apo
llo indicates items that require specific attention by using triangles; right
-

or left
-
pointing green or red triangles point at potentially missing start or stop codons. Yellow
triangles indicate a non
-
canonical splice sites.







39


A
dvanced Genome
Annotation

Annotating a DNA Contig


Experiment 1:
Predict

Genes in

an Arabidopsis
C
ontig



I.

Create a Project

1.

Enter
DNA Subway at http://www.dnasubway.org.

2.

Click

the red square to annotate a genomic sequence.

3.

Select

sample sequence Arabidopsis thaliana (mous
e
-
ear cress)
Chr5, 100.00 kb
.

4.

Provide

a title (required), a project description (optional) and
click


Continue

.


II.

Mask Repeats

1.

Click


RepeatMasker
.’

2.

Once the bullet has finished blinking,
click


RepeatMasker


again to
view

a listing of repetitive
DNA sequences

RepeatMasker


has identified and masked.


You may wish to note h
ow many and which types of repetiti
ve DNA


RepeatMasker


identified.

Under

the
“Attributes”

menu

you may find unfamiliar terms for defined repeats such as the
transposons
Copia

and
Harbinger
.
You can u
se a search engine to
get additional information.


3.

Close

the table to return to DNA Subway.

4.

Click


Local Browser


to view the results in a graphical i
nterface.

5.

Maximize

the browser window.

6.

Change
Show 10 kb
p

to Show 100 kbp in the Scroll/Zoom utility.


7.

Close

the Local Browser screen to
return

to DNA Subway.


III.

Predict Genes

1.

Click


Augustus
.’

2.

Once

Augustus


has finished
click


FGenesH

. Then,
click


SNAP

. Finally,
click


tRNA Scan


(
Note:
t
he Augustus, FGenesH and SNAP algorithms predict protein
-
coding genes; tRNA Scan identifies
tRNA genes)
.


These gene prediction programs all use different


though sometimes similar


methods to
generate predictions. D
id you notice any difference in runtime?


3.

V
iew

the results

for each predictor by
clicking

the predictor button again to generate a table. You
should also examine the results in t
he Local Browser.


Do the different programs predict the same genes or can you

identify differences among the
predictions?


4.

Close

the table and
browser
screens to
return

to DNA Subway.


40


IV.

Search Databases for
Transcriptome

Evidence

1.

Click

the

BLAST
N


buttons to search
a
database of known genes and transcripts

(e.g.

cDNAs
and

ESTs
) for

a match to the contig sequence.

2.

When BLASTN is complete, then Click

BLASTX


to search for matches in the contig sequence in a
database of experimentally verified protein sequences
.

3.

Go to:
htt
p://gfx.dnalc.org/files/evidence/Annotation

and download the file “
at chr5 est
evidence.fasta
.


4.

Click

Upload Data

,
and upload the above file under “Add DNA data in FASTA format.”

5.

Click

User BLASTN
.’


6.

View

BLAST matches in the table view

by clicking the respective BLAST buttons again. You may
also choose to view the results in
the Local Browser.

7.

Close

the table and browser screens to
return

to DNA Subway.



Experiment 2: Synthesize Gene Predictions and
Transcriptome
Evidence into Gene
Models


Technique 1:
Edit Exons


I.

Build a Gene Model

1.

Click


Apollo.





Your screen should look something like this, with multiple evidence types (gene predictions,
repeats, transcriptome evidence, etc.) displayed by color coded icons.




41


2.

Click

the
Tiers

m
enu

and select Expand Tiers to

view

all available

evidence.

Apollo initially
collapses the different evidence types onto a single line each, regardless of how many pieces of
evidence a
re available for each position.

3.

Under the View menu,
uncheck

“Show rever
se strand.”

4.

Zoom
,
pan

and
scroll

to nucleotide position 29,500
-
33,500 until you can comfortably view details
for a gene on the forward strand in this location.


5.

You should now be able to distinguish gene features such as exons and introns.


Compare

the predictions with each other and with the BLAST evidence


what similarities and
differences can you identify?

The Augustus gene prediction has the same structure as the other
predictions and the BLASTN evidence, however, it is longer than the other pr
edictions and
therefore stronger agrees with the BLASTN evidence than the other predictions.


6.

Double
-
click

the Augustus

prediction and
move
it onto the workspace


this is the foundation for
a model for the gene in this location.

7.

Right
-
Click

on your gene model to name it using the “Annotation Info Editor”

8.

Scroll
down the page and identify the BLASTN evidence.
Double
-
click

and
move

the longest piece
of BLASTN evidence onto the workspace
.




9.

Name the BLAST based model by
Right
-
Clicking

on the model and using the “Annotation Info
Editor”




Compari
ng

the Augustus pre
diction and the BLASTN evidence, y
ou will find that they share the
same exon
-
intron structure, but differ in the overall lengths: the gene model starts and ends
further down
-
stream than the BLASTN evidence.


42


1.

Use Exon Detail Editor to adjust the lengths of the flanking exons of the model:

a.

D
ouble
-
click

the gene model.

b.

R
ight
-
click

the gene model;

c.

S
elect

Exon detail editor

in the pop
-
up window to open the Exon Editor;

d.

the Exon
Editor displays the sequences of the gene model and the BLASTN evidence side
-
by
-
side; a red frame highlights the gene model;

e.

G
rab

and
hold

the edge at the beginning of the model’s first exon and
move

it to the left
to position it flush with the start of th
e BLASTN
-
match;




f.

C
lick

the end of the gene model depicted in the schematic view at the bottom of the
Editor window

to edit that part of the sequence;

g.

G
rab

and
hold

the edge at the end of the last exon and
move

it 89 nucleotides to the left
and up to
position

it flush with the end of the BLASTN
-
match;

h.

Close

the Exon Editor.


2.

To conclude your annotation for this gene’s structure:

a.

R
ight
-
click

the BLASTN evidence on your workspace;

b.

S
elect

Delete selection;

c.

D
elete
any other evidence or prediction from the
workspace until only your gene final
model remains;

d.

C
lick

menu tab File and
select

Upload to DNA Subway.


II.

Browse Your Gene Model

1.

Minimize

or
close

Apollo.

2.

Bring up

the DNA Subway window.

3.

Click


Local Browser


to
browse

your gene model.


43



Technique 2:
Fix
Start Codons


1.

Navigate

to nucleotide position 14,000.



Identify
the differences among the predictions and the BLAST evidence.

Specifically, what start and end points for the gene do the different prediction and evidence
items indicate?


2.

Move
the Augustus

gene prediction and the BLASTN evidence for this gene onto the workspace;
a
djust
the 5’
-

and 3’ ends of the model
.

Name your respective models using the “Annotation Info
Editor.”

Examine

the model’s beginning: Does it have a start codon?
Zoom

in the first third of the first
exon (position 14060 through 14200) to
answer

this question.




3.

To define a start codon for your model:

a.

Z
oom

into the first exon;

b.

E
valuate

whether the biological evidence (BLASTX) provides evidence for a start codon;

if th
e biological evidence does not provide a position for a start codon
choose

the first
ATG/methione instead;

c.

M
ove

your cursor to the upper edge of your screen;

d.

G
rab

and
hold

the first green rectangle located within the first exon;


44


e.

M
ove

the green rectangle a
ll the way down onto your model to insert it as a new start
codon.

4.

To finalize your annotation:

a.

Z
oom

out and verify your model;

b.

D
elete

from the workspace any evidence or predictions other than your final model for
this gene
;

c.

U
pload

your result to DNA Subw
ay.


Technique 3:
Delete Exons


1.

Navigate

to nucleotide position 4
7,500.




Identify
the differences among the predictions and the BLAST evidence.

W
hat is the number of exons for the different predictions and evidence items?


2.

Move

the Augustus gene prediction and the BLASTN evidence onto the workspace.

Name your
respective models using the “Annotation Info Editor.”


3.

Compare

the Augustus
-
derived gene model and the BLASTN evidence. You will find that the
model’s leading exon is not s
upported by BLAST evidence. To remove it:

a.

Click

the first exon in the gene model.

b.

R
ight
-
click

the model;

c.

Click

Delete selection.

4.

Adjust
the 5’
-

and 3’ ends of the model by using Exon Detail Editor to
match it to the BLASTN
evidence.


5.

To finalize your
annotation:

a.

Z
oom

out and verify your model;

b.

D
elete

from the workspace any evidence or predictions other than your final model for
this gene
;

c.

U
pload

your result to DNA Subway.


45


Technique 4:
Split Exons


1.

Navigate

to nucleotide position 18,500
-
21,000.



Identify
the differences among the predictions and the BLAST evidence.

Specifically, what is the number of exons for the different predictions and evidence items?


2.

Move
the Augustus gene prediction and the BLASTN evidence for this gene onto the workspace;
a
djust
the 5’
-

and 3’ en
ds of the model.

3.

Compare

the gene model and the BLASTN evidence. You will find that the gene model shows one
long leading exon where the BLASTN evidence has two. To split this exon:

a.

Z
oom

into the first exon in the gene model;

b.

Click

the first exon in the gene model.

c.

R
ight
-
click

in the first exon approximately at the position where you wish to split it;

d.

S
elect

Split exon to split the first exon into two fragments;

e.

D
ouble
-
click

the gene model.

f.

R
ight
-
click

the gene model;

g.

S
elect

Exon de
tail editor in the pop
-
up window to open the Exon Editor;

the Exon Editor displays the sequences of the gene model and the BLASTN evidence side
-
by
-
side; a red frame highlights the gene model;

h.

M
aximize

the Exon Editor window;

i.

F
ind
the gap in the highlighted

sequence at the spot at which the background color in the
former first exon changes


this is the position where the exon has been split;

j.

G
rab

the 3’
-
edge of the first exon fragment and move it to the left and up to position it
flush with the end of the f
irst BLASTN exon;

k.

G
rab

the 5’
-
edge of the downstream fragment and
move

it to the right and down to
position it flush with the beginning of the second BLASTN exon;

l.

Cl
ose

the Exon Editor.


46


4.

You will find that by splitting the first exon into two you generated
a non
-
canonical splice site. To
adjust the splice site:

a.

D
ouble
-
click

the gene model.

b.

R
ight
-
click

the gene model;

c.

S
elect

Exon detail editor in the pop
-
up window to open the Exon Editor;

d.

A
djust

the beginning of the gene model’s second (new) exon to start fol
lowing (in 3’
-
direction) the nearest AG;

e.

C
lose

the Exon Editor.

5.

To finalize your annotation:

a.

Z
oom

out and verify your model;

b.

D
elete

from the workspace any evidence or predictions other than your final model for
this gene
;

c.

U
pload

your result to DNA Subway.


Techniques 5 & 6:
Merge Exons and Build Alternative Transcripts


1.

N
avigate

to nucleotide position 89500
-
92,500.





2.

Move
the Augustus gene prediction and the longest BLASTN transcript evidence that resembles
the mod
el (5 exons, Exon #4 about 60 nucleotid
es
) onto the workspace; a
djust
t
he 5’
-

and 3’
ends of the model.

3.

Record

your edits and
name

the model
.


4.

Delete

the BLASTN evidence from the workspace.


47


5.

Compare

the gene model with the various biological evidence items. You will find that some
BLASTN evidence shows Exon #4 to be about 110 n
ucleo
t
ides

long as opposed to 58 n
ucleo
t
ides

in the first model.

6.

To build an alternative transcript for this gene:

a.

D
ouble
-
click

the first model;

b.

R
ight
-
click
) the first model;

c.

S
elect

Duplicate transcript to generate the foundation for an alternative transcript.

7.

Move

the BLASTN evidence that contains five exons with an Exon #4 of about 110 n
ucleo
t
ides

in
length onto the w
orkspace.

8.

Extend

the 3’
-
end of Exon #4 in the alternative model to the 3’
-
edge of the BLASTN evidence
using Exon Detail Editor.

9.

To update the open reading frame/coding sequence:

a.

D
ouble
-
click

the new model;

b.

R
ight
-
click

the model;

c.

S
elect

Calculate longest O
RF.

10.

Delete the BLASTN evidence from the workspace.

11.

Record your changes and name the alternative gene model.

12.

Compare

the biological evidence with the two gene models. You will find that some BLASTN
evidence shows a large fourth exon that encompasses Exon #4

and Exon #5 in the current two
models.

13.

To build a third alternative transcript:

a.

Right
-
click

the first model and select “duplicate”
;

b.

S
hift
-
click

the fourth and fifth exons in the third model;

c.

R
ight
-
click

one of the exons;

d.

S
elect

Merge exons.

14.

Update

the thi
rd model’s open reading frame/coding sequence.

15.

Record

your changes and
name

the new alternative gene model.

16.

Compare

the biological evidence with the two gene models. You will find that some BLASTX
PROTEIN evidence shows a large second exon that encompasse
s Exon #2 and Exon #3 in the
previous two models. However, the problem with using this information to build a fourth
alternative transcript is that no biological evidence is available that would allow you to determine
what other exons would be part of this

fourth transcript


therefore you should not build a
fourth alternative model without further evidence.

6.

To finalize your annotation:

a.

Z
oom

out and verify your model;

b.

D
elete

from the workspace any evidence or predictions other than your final model for
this gene
;

c.

U
pload

your result to DNA Subway.








48


Experiment 3
: Identify Gene Homologs

in Yellow Line


I.

Search

Genomes for Homologs to Annotated Genes

1.

Enter

DNA Subway and
open

a project.

2.

Click

Transfers, then
click

Continue to open the Local Browser.

3.

Click

a gene model, prediction, evidence item or repeat.

4.

Select

Detail View and Transfers

5.

Select

a sequence to transfer and
click

the Genome Prospecting button to transfer the
selected
sequence to the Yellow Line.

II.

Example

1.

Transfer

the different gene models built to the Yellow Line.

2.

Transfer

the genes, ORFs and proteins and
compare

the number of matches for each.



Experiment
4
: Compare Your Results to Existing Records


Note:
This function may only be available for projects based DNA Subway sample sequences.


I.

Compare Your Annotations to Those in the Phytozome Genome Hub

1.

Enter

DNA Subway and
open

a project.

2.

Click

Export.

3.

Select
the Phytozome Genome Hub and

click
Continue.

4.

Find

the chromosome and location in the field labeled Landmark or Region.

5.

To compare your results to those presented by Phytozome,
zoom

into the region that contains
your annotation and
compare

it with the gene model present in Phytozome’s Transcript track.

II.

Ex
amine Gene Functions Listed in Phytozome

1.

Roll over

a gene model in Phytozome’s Transcript track to check whether Phytozome lists a
potential function for the gene.

III.

Examine Gene Homologs Listed in Phytozome

1.

To identify in what plants homologs to the gene h
ave been found:

a.

scroll down

to the Tracks section of the Phytozome window;

b.

check

the boxes next to BLASTX and BLATX Plant Peptides;

c.

click

Update Image;

d.

roll over

entries in the Peptide BLASTX and BLATX tracks to identify what plant species
encode homologou
s genes and/or proteins.

e.

Close

the Phytozome window.







49


Biological and Gene Annotation Concepts

RepeatMasker



A genome is an organism’s entire complement of DNA.



DNA is a directional molecule composed of two anti
-
parallel strands.



The genetic code is
read in a 5’ to 3’ direction, referring to the 5’ and 3’ carbons of deoxyribose.



Eukaryotic genomes contain large amounts of repetitive DNA, including simple repeats and transposons.



Transposons can be located in intergenic regions (between genes) or in in
trons (within genes).



Genes and transposons are directional, and can be encoded on either DNA strand.



Repeats are non
-
directional, and, in effect, do occur on both strands.



Transposons can mutate like any other DNA sequence.


Gene Predictors



Protein
-
coding

information in DNA and RNA begins with a start codon, is followed by codons, and ends with a
stop codon.



Codons in mRNA (5’
-
AUG
-
3’, etc.) have sequence equivalents in DNA (5’
-
ATG
-
3’, etc.).



The DNA strand that is equivalent to mRNA is called the “coding s
trand.” The complementary strand is called
the “template strand,” because it serves as the template for synthesizing mRNA.



Non
-
spliced genes, which are characteristic of prokaryotes, are also found in eukaryotes.



Even in a spliced gene, the protein
-
coding
information may be organized as Open Reading Frame (ORF).



Most eukaryotic genes are spliced, whereby intervening segments (introns) are removed and the remaining
segments (exons) are spliced together.



Splice sites (exon
-
intron boundaries) have sequence pat
terns that are recognized by the splicing apparatus
(spliceosome).



Gene prediction programs use consensus sequences around splice sites to predict exon
-
intron boundaries.



Over 90% of eukaryotic introns have “canonical splice sites,” whereby introns begin w
ith GT (mRNA: GU) and
end in AG (mRNA: AG).



The protein
c
o
d
ing
s
equence of a eukaryotic mRNA (or gene) is flanked by 5’
-

and 3’
-
untranslated regions
(UTRs); introns can be located in UTRs.



In most eukaryotic genes, transcripts are alternatively spliced, yi
elding different mRNAs and proteins.



UTRs hold information for the half
-
lives of mRNAs and for regulatory purposes.



Gene > mRNA > CDS.



CDS = nucleotides that encode amino acid sequence.



In mRNA: CDS = ORF.


BLAST Searches



B
asic
L
ocal
A
lignment
S
earch
T
ool
(BLAST) searches databases for matches to a query DNA or protein
sequence.



Gene or protein homologs share sequence similarities due to descent from a common ancestor.



Biological evidence is needed to edit and confirm gene models predicted by computer algor
ithms.



Biological evidence is most often derived from mRNA transcripts (ESTs, cDNAs, RNAseq). Protein sequence
data are available, too, but much less common.



Many ESTs and cDNAs are disrupted by “introns” when they are aligned against genomic DNA.



ESTs &

cDNAs may be incomplete.



The BLAST algorithm does not resolve intron/exon boundaries.



The BLAST algorithm is not restricted to detecting sequences that fully match a query (“global” matches) but,
instead, matches query subsequences as well (“local” match
es).



The BLAST algorithm matches sequences to the fullest extent possible and, often, realigns the same sequence
twice.


50


Web Resources

for Genome Annotation


A.

Major Plant Genome Hubs:


DOE JGI’s
http://www.phyotozme.net



University of Iowa:
http://www.plantgdb.org/



CSHL:
http://www.gramene.org/



ENSEMBL:
http://
plants.ensembl.org/index.html



NCBI:
http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html



NCBI:
http://www.ncbi.nlm.nih.gov/mapview/





B.

Some Plant Genome Portals:


Arabidopsis, TAIR:
http://www.arabidopsis.org/


Corn:
http://www.maizesequence.org/i
ndex.html


Grape:
http://www.cns.fr/externe/GenomeBrowser/Vitis/

Poplar:
http://genome.jgi
-
psf.org/poplar/poplar.home.html


Ric
e:
http://rice.plantbiology.msu.edu/


Tomato:
http://solgenomics.net/about/tomato_sequencing.pl


C.

Browsers:

Ensembl:
http://www.ensembl.org

GBrowse:
http://gmod.org/wiki/GBrowse

JBRowse:
http://jbrowse.org/


UCSC Browser:
http://genome.ucsc.edu


xGDB:
http://brendelgroup.org/bioinformatics2go/bioinformatics2go.php


D.

Annotation Tools:

Apollo:
http://apollo.berkeleybop.org/current/index.html

Ar
temis:
http://www.sanger.ac.uk/resources/software/artemis/

yrGATE:
http://brendelgroup.org/bioinformatics2go/bioinformatics2go.php


E.

Other Resources:

Course download site:

http://gfx.dnalc.or
g/files/evidence

DynamicGene:
http://www.sanger.ac.uk/resources/software/artemis/

GeneBoy:
http://www.dnai.org/geneboy/

BioServers:
http://www.bioservers.org/bioserver/

mRNA/gDNA:
http://www.ncbi.nlm.nih.gov/spidey/

mRNA/gDNA:
http://pbil.univ
-
lyon1.fr/sim4.php

Splice site predictor:

http://www.fruitfly.org/seq_tools/splice.html

Promoter predictor:

http://www.fruitfly.org/seq_tools/promoter.html


51


Yellow Line

Walkthrough



A. Examining Transposons

Example Sequences:

mPing Mite Element, Ping Transposase Gene, Ping Transposase Protein

Tool(s):


Yellow Line TARGeT

Concept(s):


Mobile genetic elements (transposons), Non
-
autonomous

I. Create Project






1.
Log
-
in

to DNA Subway (dnasubway.iplantcollaborative.org)






2.
Click

‘Prospect Genomes using TARGeT’

(Yellow Square)

3.
Select

sample: mPing Mite Element (Oryza sativa/ Rice)


4.
Provide

your project with a title, then
Click

‘Continue’


II. Search the O.s
ativa genome using TARGeT







1.
Click


Oryza sativa japonica
’ in the ‘Select Genomes’ stop










2.
Click ‘
Run’ again to
search the genome.


III.

Identify the number of mPing elements in the
O.sativa

genome



1.

Click ‘
Alignment Viewer
’ to see results returned.

Genome name Hit# Project #






Key to results naming in alignment viewer


*Double clicking the hit name opens the sequence and location in new browser tab.


2. Record the number of hits in the table below.


IV.
Identify the number of Ping transposons (using

DNA sequence and protein)

Repeat the steps above (Sections I
-
III) using
Ping transposase gene
and
Ping Transposase protein

to answer
collect the following data and answer the following questions.



mPing mite element

Ping Transposon (DNA)

Ping Transposon

(Protein)

Number of hits i n O.sativa

52



Hi t number 1


l ocus

Chr: 6



Hi t number 2


l ocus




Hi t number 3


l ocus




Hi t number 4


l ocus




Hi t number 5


l ocus




TARGeT:
TARGeT (
Tree
Analysis
of Rel ated Genes and
Transposons
)
uses either a DNA
or ami no acid ‘seed’ query to:
(i ) automatically i dentify and
retri eve gene family homologs
from a genomic database, (ii)
characterize gene s tructure
and (iii) perform phylogenetic
analysis. Due
to its high speed,
TARGeT i s also able to
characterize very l arge gene
fami lies, including transposable
el ements (TEs)

(
-
f
rom the
abstract of the TARGeT paper
@
doi:
10.1093/nar/gkp295
)


Transposons (DNA, Retroviral,
LINES):

Genetic elements which
have the
ability to be amplified
and redistributed within a
genome.


Non
-
autonomous transposons:

Transposons which lack an
acti ve transposase gene, thus
requi ring help from another
transposon to move.


Autonomous transposons:

Transposons which have a
functi onal tranposase and can
move wi thin the genome.



52


A
dvanced

Yellow Line Example


Prospecting example
:

Finding and
a
nalyzing DNA
t
ransposons

(
Ping

-

DNA transposon in rice)

Background Reading:
http://www.nature.com/nature/journal/v421/n6919/full/nature01214.html


Example:


1.

Open DNA Subway and start a new project in the yellow line selecting the
mPing Mite Element

from
the sample sequences
.

2.

Enter a project title and click ‘Continue.’

3.

In the ‘Search Genomes’ stop select
Oryza sativa japonica

and click ‘Run.’

a.

Click ‘Alignment Viewer’ to view the results of your search. This will open up two screens, one
displaying a tree
and another displaying sequence alignments. How many matches did the
search yield? What is the relationship between the match and the query?

b.

Close all viewers and return to DNA Subway.

4.

Create a new project, this time querying rice with the
Ping transposase

G
ene

[ORF]

as query.

a.

How many matches did this search yield? (Again, use the alignment screen to count.)

b.

To view details about a match, double
-
click its ID (left
-
most column in Alignment Viewer;
enable pop
-
ups in your browser)
.

This screen also has a link

to open Phytozome at the location
of the match.

c.

Using the tree, determine the relationships among the hits. As the query sequence originates in
the rice genome you can identify the match that’s identical to the query sequence.

d.

Close all viewers and return

to DNA Subway.

5.

Create a new project, this time querying rice with the
Ping Transposase
P
rotein
.

a.

How many matches did this search yield? Explain the differences in the number of results for
the three queries.

b.

In the alignment screen, find the row for the q
uery (ID=
Ping
), click its ID field once (left
-
most
column), then bring the tree screen to the foreground and find
Ping

among the matches
displayed.

c.

All matches constitute sequences that are contained in the genome of the rice plant that was
sequenced to de
termine the sequence of the entire rice genome. What do the lengths of tree
branches indicate?

d.

Transposable elements that diverged from a common ancestor more recently will differ from
each other less than they would differ from those that diverged in the
more distant past. How
many groups of transposons contain matches that seem to have diverged from each other
more recently? What would you be looking for in order to answer this question?

6.

Repeat the different kinds of searches and analyses in other genomes
. To date only rice, maize,
and
Arabidopsis

have been exhaustively studied for TEs. Prospecting other genomes will reveal
new information about these organisms.




53


Blue L
ine Walkthrough



A.
Examining DNA Sequence

Example Sequence
s
:

rbcL sample 1

Tool(s):


Sequence Viewer

Concept(s):


DNA Barcoding,
Sanger DNA Sequencing

I. Create Project






1.
Log
-
in

to DNA Subway (dnasubway.iplantcollaborative.org)






2.
Click


Determine Sequence Relationships
.


(
Blue
Square)

3.
Select

project type ‘Barcoding: rbcL
.




4.
Select

sample sequence ‘
rbcL sample 1.






5.

Provide

your project with a title, then
Click

‘Continue
.


Alternatively, i
f


you have sequenced your DNA using your Genewiz account,
Select



‘Import trace files from DNALC.’


Then select

sequences

to

import.


II.

View Sequence


6.
Click
‘Sequence Viewer’ to show a list of your sequences.



7.
Click

on a sequence name to show the sequences’ trace file.




Questions:


Q.1:

What do you notice about the electropherogram peaks and quality scores

at
nucleotide positions
labeled “N”?



_____________________________________________________________________________________________
_____________________________________________________________________________________________
_________________________________
____________________________________________________________


Q.2
:

Where do the ‘N’s’

in the sequence tend to be distributed, and Why?

_____________________________________________________________________________________________
___________________________
__________________________________________________________________
_____________________________________________________________________________________________



Additional Investigation:
Learn more about Sanger Sequence at:
http://www.dnal c.org/vi ew/15479
-
Sanger
-
method
-
of
-
DNA
-
sequenci ng
-
3D
-
ani mati on
-
wi th
-
narrati on.html


DNA Barcoding:
The process of
species i dentification by
examination of DNA Sequence.


rbcL
:
A gene coding the large
subunit of the enzyme RuBisCo,
and one of the important loci
for species identification of
plants.


Sanger DNA Sequencing
:

A
method of DNA sequencing
that uses fl uorescently l abeled
di dexoynucleotide terminators
to generate the seq
uence of a
DNA sample.


Quality (Phred) Score:

Nucl eotide calls read from
sequencing output files are
assigned a quality score of 10,
20, 30, 40, or 50. A score of 50
means that the base is called
wi th a 99.999% accuracy. A
score l ess than 20 i s the cut
-
o
ff
for hi gh quality sequence.


54


B. Assembling and Editing DNA Sequence



Example Sequence
s
:

rbcL sample 1

from
Part A

Tool(s):


Sequence Trimmer, Pair Builder, Consensus Builder

Concept(s):


Sanger DNA Sequencing
, bidirectional reads

I. Trim 5’/3’ ends


1.
Click
‘Sequence Trimmer.’

2.
Click

‘Sequence Trimmer’ again to examine to changes made in the
sequence





II. Pair Builder






1.
Click

‘Pair Builder.’

2.
Select

the check boxes next to the sequences that represent
bidirectional reads of the same sequence set.
Alternatively

Select
the
‘Auto Pair’ function and verify the pairs generated.

3. As necessary,
Reverse Compliment

sequences that were sequenced in
the reverse orientation by
clicking

the ‘F’ next to the sequence name. The
‘F’ will become an ‘R’ to indicate the sequence has been reverse
complimented.

4.
Save

the created pairs.

III.

Consensus Builder




1.
Click

‘Consensus Builder’



2.
Click

‘Consensus Builder’ again to examine the created consensus files.
Any differences between two reads will be highlighted in yellow in the
consensus builder.


3.
Make needed

edits, and
Save

your changes.


Questions:


Q.
3
:

Sequence identified by DNA subway as low quality is marked by a

symbol. What problems might it
cause to generate consensus sequence from low
-
quality DNA sequence?



_____________________________________________________________________________________________
_____________________________________________________________________________________________
______________________________________________________________________
_______________________

_____________________________________________________________________________________________
_____________________________________________________________________________________________

Bidirectional sequence:

DNA sequence generated by
sequencing a DNA strand in the
forward and reverse
ori entation.


Consensus sequence
:

A
sequence that sums the
consensus of two or more DNA
sequenc
es.


55


C. Matching sequence to databases


Exam
ple Sequence
s
:

rbcL sample 1 from
Part B

Tool(s):


BLAST, Upload Data, Reference Data

Concept(s):


BLAST Searches,
GenBank,
BOLD Database

I.
Check for matches in GenBank


1.
Click

BLASTN
.’

2.
Click

the

BLAST


link for the sequence of interest.

3. Examine the BLAST matches for candidate identification. Clicking the
species name given in the BLAST hit will also give additional
information/photos of the listed species.

4. If desired,
select

the check box next t
o any hit, and
select
‘Add BLAST
hits to project’ to add selected sequences to your project.





II. Upload Data (optional)

1. If desired,
Click

‘Upload Data’ to import additional data into your
project. You will need to repeat steps in the ‘Assemble Sequ
ences’ stop
on DNA Subway.





III. Reference Data

(optional)






1.

Click

‘Reference Data.’

2.
Select

one or more groups of sequences from selected reference
samples of
rbcL

sequence.






3.
Select
‘Add ref data’ to add the data to your project.

Questions:


Q.
4
:


BLAST will return the closest matches

present in GenBank.
Will you be able to identify an unknown
species using BLAST alone? Why or Why not?

_____________________________________________________________________________________________
_____________________________________________________________________________________________
______________________________________________________________________
_______________________
_____________________________________________________________________________________________
_____________
________________________________________________________________________________
_______________________________________________
______________________________________________

Additional Investigation:

See the l aboratory: “Usi ng Barcodi ng to i denti fy and Cl assify Li vi ng Thi ngs.”
(
http://www.urbanbarcodeproject.org/fi l es/Barcodi ng_Protocol.pdf
)

BLAST:
Basic Local Alignment
Search Tool (BLAST) is an
al gorithm that search
databases of biological
sequence
information (e.g.
DNA, RNA, or Protei n
sequence) and return matches.
The BLASTN program is specific
to nucl eotide data.


GenBank:

The l argest database
of publicly available nucleotide
sequences. As of 2011 the
database contains well over
100 bi l lion nucleo
tides of
generated sequence data.


BOLD
:

Barcode of Li fe Online
Database (BOLD) is an online
repository for sequence data
generated by DNA barcoding
projects worldwide.


56


C. Building Phylogenetic Trees


Example Sequence
s
:

rbcL sample 1 from
Part C

Tool(s):


Select Data, MUSCLE, PHYLIP NJ, PHYLIP ML

Concept(s):


Sequence alignment, phylogenetics

I.
Select Data for Alignment






1.
Click

‘Select Data.’












2.
Select

any and all sequences you wish to add to your tree.







3.
Click

‘Save.”






II. Generate Multiple Sequence Alignment












1.
Click

‘MUSCLE.’




\

2.

Click

‘MUSCLE’ again to open the sequence alignment window.


3.

At the start and end of alignment for all sequence,
left
-
click

the
mouse

over


the position number over the alignment

to open a menu,
and trim both ends of the alignment.


4.

Click

‘Submit Trimmed Alignment.’


I
II.
Construct Phylogenetic Tree



1.
Click

either ‘PHYLIP NJ’ or ‘PHYLIP ML’ to run the tree construction
algorithm.


2.
Click

the button for the algorithm you chose above again to launch a
viewer for the multiple alignment a
nd tree.


Questions:


Q.
5
:


What relationship do you see between sequences that have more mutations (align less well with
majority of sequences) in the alignment and the lengths of a sequences’ branch on the tree?

_____________________________________________________________________________________________
_____________________________________________________________________________________________
______________________________________________________________________
_______________________

Q.6:

Do you see differences in the phylogenetic tree generated by the Neighbor
-
joining vs. Maximum
likelihood method?

_____________________________________________________________________________________________
_____________________________________________________________________________________________
______________________________________________________________________
_______________________

Multiple Alignment:
A (usually)
computer generated alignment
sequences. Under the
assumption that all sequences
wi thin the alignment are
si milar (e.g. of a common
genetic origin, from a common
l ocus, i n the same strand
ori entation) gaps are
i ntroduced where
mi salignments (e.g
. insertions
or del etion/ missing data)
appear.


Phylogenetic tree:

A di a gram
whi ch re pres ents i nferre d
e vol uti onary re l ati ons hi ps
be twe en organi s ms. As a ppl i ed
he re, s equences a re di s pl aye d
wi th bra nch l engths that are
proporti onal to the di f f erences
be
tween the s equences.


PHYLIP

NJ

and PHYLIP ML
:

T
ree
bui lding algorithm
s
based on
the “Nei ghbor Joining”

and
“Maxi mum l ikelihood methods
res pectively
. See:

http://www.icp.be/~opperd/pr
i vate/neighbor.html

and

http://www.icp.ucl.ac.be/~opp
erd/private/max_li
keli.html