Dabbling in bioinformatics:

moredwarfBiotechnology

Oct 1, 2013 (3 years and 2 months ago)

69 views

CS 251
Introduction to Bioinformatics
:


Laboratory 9:

Hunting for genes in a DNA sequence


Today, we examine web
-
based tools for finding genes hidden in eukaryotic genomes. We will locate
and identify genes in stretches of DNA sequence from a fungus,
Aspe
rgillus nidulans
, a common
bread mold fungus and the organism studied by Dr. James.


This laboratory will rely on methods and websites that are presented in
Chapter 5, pp. 158
-
172 in
Bioinformatics for Dummies
.




Objective
:

Obtain the following DNA seque
nce, and
ORF it
! (in other words, find all of the




Open Reading Frames, or ORFs, and determine what proteins they encode)



Go to the GenBank entry tool at
http://www.ncbi.nlm.nih.gov/entrez/



Along t
he left column, point your browser to
“Genomic Biology”
. Then, along the right column
of the new page, under
“Genome Resources”
, point your browser to
“Aspergillus”
. A new page,
“Aspergillus Genome Resources” will appear. Along the right column, open th
e link to “
A.
nidulans

Database at the Broad Institute
”.
This will direct you to the
A. nidulans

website, which is constructed and maintained by the
The Broad Institute

at Massachusetts Institute of
Technology (MIT)



a.

The A. nidulans genome sequence is

broken up into manageable
-
sized chunks called



contigs.

A
contig

is one
contiguous

stretch of DNA assembled from a number of smaller,



overlapping sequences. Today you will identify and study all of the genes encoded in a small,



15,000 bp sub
-
r
egion of Contig #26. To obtain this chunk of DNA sequence,
do the



following
:



b.

Go to
Aspergillus nidulans

database:
http://www.broad.mit.edu/annotation/fungi/aspergillus/



c.

P
oint your browser to
“Browse Regions”
.





d.

In the box labeled “Contig number”, enter
1.26




In the box labeled “Start”, enter
357000




In the box labeled “Stop”, enter
372000




Click on the hotlink labeled “DNA Sequence”




Copy/paste this 20 kb sequ
ence here, and convert it to 10 pt courier font
:





For the following exercises, follow along in
pp. 158
-
163 in BFD.



e.

Use
ORF Finder

to locate all of the potential Open Reading Frames (ORFs) in this 15 kb



stretch of DNA.
ORF Finder

will predict OR
Fs,
i.e
., long stretches of DNA that could



potentially contain a protein
-
coding portion of a gene.

ORF Finder

is a graphical analysis tool



that finds all open reading frames of a selectable minimum size, usually
>

100 nucleotides.




(1) To access
OR
F Finder
, go to NCBI:
http://www.ncbi.nlm.nih.gov




(2) Under
HOTSPOTS

in the right column, choose
ORF Finder
:




http://www.ncbi.nlm.nih.gov/gor
f/gorf.html




(3) Copy/paste the 15 kb sequence into the
ORF Finder

box (just the sequence!), and click on




the
OrfFind button
.





Six parallel horizontal bars will appear on the screen. Each will contain a number of




blue boxes. Each blue bo
x represents one potential ORF. You will see that a
blizzard





of potential ORFs can be found in this region. Only a small number of them represent




genuine protein
-
coding regions. Your assignment is to find out which ones belong to




real genes!



f.

Before you test the ORFs to see which ones are real, answer the following questions:




Q1
: What do the six bars represent? Explain why there are six bars, and explain how the




three parallel bars at the top differ from the three parallel

bars at the bottom.





Q2
: Could a single gene be contained in multiple, adjacent or overlapping ORFs? In other




words, is the protein
-
coding region of a gene necessarily contained in a single ORF, or




could the protein
-
coding regio
n be broken up into more than one ORF? Why or why not?




g.

Sometimes it can help to determine which potential ORFs are real by comparing your output



from
ORF Finder

with another gene
-
finding tool, called
GeneMark
.




Leave your ORF Finder window open
, showing the ORF map of the 15 kb region we are



studying.




Open a new window, and point your browser to
http://opal.biology.gatech.edu/GeneMark/
.



Then, choose the link corresponding to
“Gen
e Prediction in Eukaryotes”

associated with the



rat icon. If you follow the directions on pp. 162
-
163, and choose
C. elegans

as the species



most closely related to
A. nidulans
, then click on the
Start GeneMark.hmm

button, you will



receive a PDF ou
tput that shows the position of all probable genes in the 15 kb sequence.



Unfortunately this tool, like
ORF Finder
, also predicts many more functional ORFs than really



exist. Careful examination of the output may help to narrow the field. However,

another



way to make sense of your ORFs is to return to
ORF Finder

and use the associated blastp



feature to BLAST a subset of the ORFs.



h.

Return to the
ORF Finder

window, and proceed as follows:




Find the FOUR (4)
bona fide

genes in this 15,000
bp region. Find all



of the ORFs (exons) corresponding to these four genes, as follows:




(1) Based on the assumption that the longer the ORF, the more likely it is to represent a
bona




fide

gene, use
Blastp

to BLAST the largest eight (8) ORFs
, for starters. For each of these




8 BLAST searches, do the following:





(a) First, click on the blue ORF that you intend to BLAST. Use an organized approach:




Begin at the left, and work to the right. When you click on the desired ORF
, the screen




will refresh and the highlighted ORF will become purple. Also, the DNA sequence of




the ORF, and a corresponding translation, will be displayed. For each ORF that you




search, paste the sequence + translation into a MSWord file, an
d label it for




identification purposes.





(b) Second, BLAST the ORF. For each Blastp search, ask for a graphical output and





specify 10 descriptions + 10 alignments. Obtain the output, and then paste the output





below the sequence + t
ranslation from (a) above.
Use 10 point Courier font





throughout.





During this effort, you will need to use your judgement to assess the quality of




the Blastp hits that are produced, and decide if the hits are significant or if they




are mea
ningless. In any event, for the time being save the output of these




searches.





Clues for making good judgements include the following:





1. e
-
value: is the e
-
value <10
-
15
?




2. Does the ORF contain a putative conserved domain? If so, what is

it? List it





or copy in a description of the conserved functional domain (a conserved





domain is a protein region that is the same or very similar in many proteins,





because it provides a function that is common to many proteins)





If the answer to these two questions is YES, then you have probably hit a




bona fide

gene.




Each time you begin work with a new ORF, start a new page in your MSWord file.




i. After you have identified each of the four different genes, go back
and BLAST the




appropriate smaller ORFs that are adjacent to each identified gene on either side, to learn if




the gene is contained on more than one ORF.





j.
Completing the assignment
:





To complete this assignment to identify the four r
eal genes, you will probably need to




BLAST 17
-
18 total ORFs from this 15,000 nt sequence.





Please submit the following to complete this assignment
:





(1) Sequence + translation of each ORF that belongs to a real gene.





(2) Blastp outp
uts for each real
-
gene ORF that includes 10 descriptions + one keystone




alignment to an orthologous gene whose function is well
-
described and well
-
understood.




In other words, don’t necessarily choose an alignment because it has the
highest




e
-
value; an alignment to a “hypothetical protein” is uninformative. If your 8
th
-
best




alignment is the first one to list a protein with a real name (e.g., cyclic AMP
-
dependent




protein kinase), and this alignment’
s e
-
value is similar to each the 7 better matches, then




use this identification for your
Aspergillus nidulans

ORF(s).





(3) A schematic diagram depicting the order of the four genes and the distances separating




each one.





(4)
In addition, the schematic diagram must show the relative position and the reading frame




of each ORF belonging to a gene. If multiple ORFs (exons) belong to the same gene, this




must be clearly described and diagrammed.