Dabbling in bioinformatics:


Oct 1, 2013 (3 years and 2 months ago)


CS 251
Introduction to Bioinformatics

Laboratory 9:

Hunting for genes in a DNA sequence

Today, we examine web
based tools for finding genes hidden in eukaryotic genomes. We will locate
and identify genes in stretches of DNA sequence from a fungus,
rgillus nidulans
, a common
bread mold fungus and the organism studied by Dr. James.

This laboratory will rely on methods and websites that are presented in
Chapter 5, pp. 158
172 in
Bioinformatics for Dummies


Obtain the following DNA seque
nce, and
ORF it
! (in other words, find all of the

Open Reading Frames, or ORFs, and determine what proteins they encode)

Go to the GenBank entry tool at

Along t
he left column, point your browser to
“Genomic Biology”
. Then, along the right column
of the new page, under
“Genome Resources”
, point your browser to
. A new page,
“Aspergillus Genome Resources” will appear. Along the right column, open th
e link to “

Database at the Broad Institute
This will direct you to the
A. nidulans

website, which is constructed and maintained by the
The Broad Institute

at Massachusetts Institute of
Technology (MIT)


The A. nidulans genome sequence is

broken up into manageable
sized chunks called



is one

stretch of DNA assembled from a number of smaller,

overlapping sequences. Today you will identify and study all of the genes encoded in a small,

15,000 bp sub
egion of Contig #26. To obtain this chunk of DNA sequence,
do the



Go to
Aspergillus nidulans



oint your browser to
“Browse Regions”


In the box labeled “Contig number”, enter

In the box labeled “Start”, enter

In the box labeled “Stop”, enter

Click on the hotlink labeled “DNA Sequence”

Copy/paste this 20 kb sequ
ence here, and convert it to 10 pt courier font

For the following exercises, follow along in
pp. 158
163 in BFD.


ORF Finder

to locate all of the potential Open Reading Frames (ORFs) in this 15 kb

stretch of DNA.
ORF Finder

will predict OR
., long stretches of DNA that could

potentially contain a protein
coding portion of a gene.

ORF Finder

is a graphical analysis tool

that finds all open reading frames of a selectable minimum size, usually

100 nucleotides.

(1) To access
F Finder
, go to NCBI:

(2) Under

in the right column, choose
ORF Finder


(3) Copy/paste the 15 kb sequence into the
ORF Finder

box (just the sequence!), and click on

OrfFind button

Six parallel horizontal bars will appear on the screen. Each will contain a number of

blue boxes. Each blue bo
x represents one potential ORF. You will see that a

of potential ORFs can be found in this region. Only a small number of them represent

genuine protein
coding regions. Your assignment is to find out which ones belong to

real genes!


Before you test the ORFs to see which ones are real, answer the following questions:

: What do the six bars represent? Explain why there are six bars, and explain how the

three parallel bars at the top differ from the three parallel

bars at the bottom.

: Could a single gene be contained in multiple, adjacent or overlapping ORFs? In other

words, is the protein
coding region of a gene necessarily contained in a single ORF, or

could the protein
coding regio
n be broken up into more than one ORF? Why or why not?


Sometimes it can help to determine which potential ORFs are real by comparing your output

ORF Finder

with another gene
finding tool, called

Leave your ORF Finder window open
, showing the ORF map of the 15 kb region we are


Open a new window, and point your browser to

Then, choose the link corresponding to
e Prediction in Eukaryotes”

associated with the

rat icon. If you follow the directions on pp. 162
163, and choose
C. elegans

as the species

most closely related to
A. nidulans
, then click on the
Start GeneMark.hmm

button, you will

receive a PDF ou
tput that shows the position of all probable genes in the 15 kb sequence.

Unfortunately this tool, like
ORF Finder
, also predicts many more functional ORFs than really

exist. Careful examination of the output may help to narrow the field. However,


way to make sense of your ORFs is to return to
ORF Finder

and use the associated blastp

feature to BLAST a subset of the ORFs.


Return to the
ORF Finder

window, and proceed as follows:

Find the FOUR (4)
bona fide

genes in this 15,000
bp region. Find all

of the ORFs (exons) corresponding to these four genes, as follows:

(1) Based on the assumption that the longer the ORF, the more likely it is to represent a


gene, use

to BLAST the largest eight (8) ORFs
, for starters. For each of these

8 BLAST searches, do the following:

(a) First, click on the blue ORF that you intend to BLAST. Use an organized approach:

Begin at the left, and work to the right. When you click on the desired ORF
, the screen

will refresh and the highlighted ORF will become purple. Also, the DNA sequence of

the ORF, and a corresponding translation, will be displayed. For each ORF that you

search, paste the sequence + translation into a MSWord file, an
d label it for

identification purposes.

(b) Second, BLAST the ORF. For each Blastp search, ask for a graphical output and

specify 10 descriptions + 10 alignments. Obtain the output, and then paste the output

below the sequence + t
ranslation from (a) above.
Use 10 point Courier font


During this effort, you will need to use your judgement to assess the quality of

the Blastp hits that are produced, and decide if the hits are significant or if they

are mea
ningless. In any event, for the time being save the output of these


Clues for making good judgements include the following:

1. e
value: is the e
value <10

2. Does the ORF contain a putative conserved domain? If so, what is

it? List it

or copy in a description of the conserved functional domain (a conserved

domain is a protein region that is the same or very similar in many proteins,

because it provides a function that is common to many proteins)

If the answer to these two questions is YES, then you have probably hit a

bona fide


Each time you begin work with a new ORF, start a new page in your MSWord file.

i. After you have identified each of the four different genes, go back
and BLAST the

appropriate smaller ORFs that are adjacent to each identified gene on either side, to learn if

the gene is contained on more than one ORF.

Completing the assignment

To complete this assignment to identify the four r
eal genes, you will probably need to

18 total ORFs from this 15,000 nt sequence.

Please submit the following to complete this assignment

(1) Sequence + translation of each ORF that belongs to a real gene.

(2) Blastp outp
uts for each real
gene ORF that includes 10 descriptions + one keystone

alignment to an orthologous gene whose function is well
described and well

In other words, don’t necessarily choose an alignment because it has the

value; an alignment to a “hypothetical protein” is uninformative. If your 8

alignment is the first one to list a protein with a real name (e.g., cyclic AMP

protein kinase), and this alignment’
s e
value is similar to each the 7 better matches, then

use this identification for your
Aspergillus nidulans


(3) A schematic diagram depicting the order of the four genes and the distances separating

each one.

In addition, the schematic diagram must show the relative position and the reading frame

of each ORF belonging to a gene. If multiple ORFs (exons) belong to the same gene, this

must be clearly described and diagrammed.