Student Edition - Lycoming College

throneharshΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

114 εμφανίσεις

1



Decoding the rockfish genome: An introduction to modern genomics

and marine biology


Target Audience: AP high school biology students or lower division undergraduate college students.



Vince Buonaccorsi

Juniata College


2


OUTLINE

Pg 3.


Introduction

Pgs.
4 to 11

Lab 1:

How do eukaryotic genes work?

Pgs. 12 to 16

Lab 2:

What can comparison of DNA sequences tell us about evolution?

Pgs. 17 to 29

Lab 3
:

How can I use bioinformatics to annotate genes?

Pg. 30


Lab 4:

What does a typical rockfish gene look li
ke?




3


Introduction

This investigation focuses on how genes are found in a newly sequenced genome. This process is called
structural gene annotation. Concepts explored include the cell, the molecular basis of heredity, and
evolution. Students will use
online analytical tools to explore the fine structure of genes in the flag
rockfish
Sebasets rubrivinctus
, explore the connection between gene structure and cellular functions,
and the connection between function and evolutionary conservation of gene seque
nces.
Genomics and
bioinformatics are dynamic fields well
-
suited for capturing the imagination of
students in inquiry driven
classroom efforts. Genomic studies
provide a
c
omprehensive catalog of basic genetic information in a
system

that

underpin structu
re and functions responsible for organism’s survival, evolution, and
interactions with other organisms

of the same or different species
.
New genomes are being sequenced
at an increasing rate, leaving vast quantities of orphaned data that can be explored i
n authentic research
experiences.


In this investigation, students will be given raw segments of DNA (i.e. genomic scaffolds) from the
Sebastes rubrivinctus

genome project. In order to inform their gene annotations, students will search
for evidence ava
ilable online in the form of similar proteins and RNAs from closely related organisms, as
well as humans. This “extrinsic” evidence will be combined with “intrinsic” signals in the DNA itself
(signals that direct the cellular apparatus through transcripti
on and translation) to devise gene models
from raw DNA. Students will be answering the basic question: what does a
Sebastes rubrivinctus

gene
look like? The answer may depend on the extrinsic evidence the student is able to find. Students will
perform s
tructural gene annotation by hand, examine alignments of gene sections from many different
vertebrates using the UCSC genome browser, use a gene annotation pipeline to perform structural gene
annotation, assign a putative function to the gene, and describe

how the gene is important to the
organism’s development, reproduction, and/or survival.


This example highlights some elements of 9
-
12 National Science Education Content Standards A and C,
Science as Inquiry, and Life Science. Students will obtain the me
ans necessary to perform and
understand scientific inquiry. The primary life science standards covered include: the cell, the molecular
basis of heredity, and biological evolution. This lab manual assumes that each student has some
familiarity with cell
and molecular biology and access to a computer and the worldwide web. The
background information covered is not meant to be an exhaustive treatment of these topics, but rather,
reviews the specific information necessary to perform and understand the exerc
ises.



4


Lab 1. How do eukaryotic genes work?

Goal: To give a basic understanding of genes and their functions.

Review of Some Basic Molecular Biology:

The “Central Dogma of Biology” describes information flow in biological systems. It states that DNA
makes RNA, which makes proteins. Transcription is the process of turning DNA into RNA. Translation is
the process of turning RNA into proteins. We will focus our analysis on predicting protein coding genes
from raw genome sequence in a marine fish that
has recently been sequenced. Only messenger RNA
(mRNA) genes make, or code for, proteins. Ribosomal genes are transcribed into ribosomal RNAs
(rRNA), transfer RNA genes produce tRNA molecules, and many other RNA genes do not code for
proteins. Most DNA
is found in the nucleus, is transcribed in the nucleus, and is exported from the
nucleus to the cytoplasm for translation.

The flag rockfish
Sebastes rubrivinctus

is a member of a diverse marine fish assemblage with an
estimated 102 species native to the
west coast of North America. Rockfishes of the genus
Sebastes

support important commercial and recreational fisheries on the west coast of North America and are
the dominant assemblage on most cold temperate reefs. These live
-
bearers have a low intrinsic
rate of
population increase and highly sporadic recruitment, releasing large numbers of pelagic larvae into a
variable coastal environment. Their slow growth rates render them vulnerable to overfishing.
Uncertainty in the success of any particular year c
lass has favored an evolutionary strategy whereby
some species of the
Sebastes

genus have extremely long lifespans and do not show signs of aging
(negligible senescence), while others demonstrate typical aging patterns and have short lifespans.
S.
rubrivin
ctus

has a maximum lifespan of only 18 years, but it is closely related to
S. nigrocinctus
, which
has a maximum age of at least 116 years. Researchers are expecting to gain insight into the genetic
mechanism for negligible senescence by sequencing and com
paring the genomes of these two species.


5




6


Transcription and Eukaryotic Gene Structure:

Most cell functions involve chemical reactions. Food molecules taken into cells react to provide the chemical
constituents needed to synthesize other molecules. Both breakdown and synthesis are made possible by a
large set of protein catalysts, called en
zymes. The breakdown of some of the food molecules enables the cell to
store energy in specific chemicals that are used to carry out the many functions of the cell. Cells store and use
information to guide their functions. The genetic information stored

in DNA is used to direct the synthesis of
thousands of proteins that each cell requires Cell functions are regulated. Regulation occurs both through
changes in the activity of the functions performed by proteins and through the selective expression of
individual genes. This regulation allows cells to respond to their environment and to control and coordinate cell
growth and division. (National Science Education Standards pg. 184)

In all organisms, the instructions for specifying the characteristics of o
rganisms are carried in DNA, a large
polymer formed from subunits of four kinds (A, G, C, and T). The chemical and structural properties of DNA
explain how the genetic information that underlies heredity is both encoded in genes (as a string of molecular
“letters”) and replicated (by a templating mechanism). Each DNA molecule in a cell forms a single
chromosome. (National Science Education Standards pg. 185)

The structure of a gene dictates how cellular proteins will interact with it to transcribe a mess
enger
RNA and translate that mRNA into a protein. Some of the important details of DNA and a gene are
listed below:



Each cell contains the same genome sequence, but different genes are transcribed into mRNAs
by different cells and by the same cells at dif
ferent times.



DNA is double stranded, and has 5’ and 3’ ends, pronounced “five prime” and “three prime.” It is
read from 5’ to 3’ and the two strands go opposing directions, the 5’ on one is the 3’ on its
partner.

5'


ATGGCGT

TGCCATA

CCCGCAT

CCCTGAT

3'


3'

TACCGCA

ACGGTAT

GGGCGTA

GGGACTA



5'



Eukaryotic Genes have several different parts that orchestrate the cellular processes of
transcription and translation (see Figure below):

o

Promoter

o

Five Prime
U
n
T
ranslated
R
egion (5’ UTR)

o

Coding sequence(s)

o

Intron(s)

o

Three Prime
U
n
T
ranslated
R
egion (3’ UTR)



The Promoter

is a region of DNA up to a few hundred bp immediately upstream of the
transcription start point to which transcription factors (proteins) bind, and recruit RNA
polymerases for transcription.



Exons

are the sections of transcribed DNA that are exported from the nucleus after introns have
been removed and the ends of the transcript stabilized to form the mature mRNA. A gene may
have one exon or several exons separated by intervening sequences ca
lled introns. The
beginning and ends of exons are untranslated.



The Five Prime Untranslated Region (5’ UTR)

is the part of the mature mRNA immediately
upstream of the coding sequence. It may contain introns. Before translation can start, the
ribosome bi
nds to the modified 5’ end of the 5’UTR after export to the cytoplasm.

7




Coding sequences (CDS)

of DNA are ultimately the sections of exons that are translated by
ribosomes into amino acid sequences once the mature mRNA is exported to the nucleus. The
prote
in coding region of DNA begins with the start codon ATG and ends in the stop codons TAG,
TGA, or TAA. Note that the process of transcription copies Ts (thymines) as Us (uracils).



Introns

are non
-
coding regions between exons. Introns usually begin with GT
and end in AG (or
the reverse compliments) in what is known as “GT/AG rule”. They are excised from pre
-
mRNAs
in the nucleus by proteins that recognize these sequences. Their excision is part of the pre
-
mRNA processing step that also includes adding a 5’c
ap (a modified guanine) to the mRNA and
3’ poly
-
A tail (a long stretch of As) for message stability.



The Three Prime Untranslated Region (3’UTR)

is the region of DNA after the stop codon in a
gene. Once it is transcribed, a polyA tail is added that help
s to stabilize the mRNA. These
regions sometimes have binding sites to microRNAs that, when present, signal the mRNA for
break down.



Intergenic DNA

is the DNA between genes. There are still recognizable DNA elements in
intergenic DNA:

o

Enhancers and silen
cers

are sequence elements in the DNA that are more distant from
the transcription startpoint than promoters yet still regulate transcription by
determining what kind of molecules will be able to bind to the DNA . “Regulation”
refers to turning transcript
ion “up or down”.

o

Short

tandem repeats

(a.k.a. microsatellites) are stretches of repeated nucleotide
motifs from one to 30 bp. For example, the sequence ACACACACACACAC, contains the
motif AC repeated seven times.

o

Dispersed repeats

are repeats that do no
t occur in tandem. These are often ancient
viral DNA elements that have incorporated themselves into other organisms’ genomes,
have made copies of themselves over the course of evolutionary time (10s of millions of
years), and have lost their ability to b
ecome full viruses and infect their host. Dispersed
repeats can be also found within introns of genes and must be identified to keep gene
finders from characterizing them as exons belonging to native genes.

o

Tandem repetitive DNA can serve important purpos
es such as protecting the ends of
chromosomes, called telomeres, and serving as binding sites during cell division towards
the constriction point of a chromosome (centromere).

o

Most genome sequences are erroneously contaminated by bacterial and human genes
too. This illustrates a major point of scientific research. Researchers must be constantly
“on
-
guard” for a myriad of problems that obscure the truth.



One gene can encode different proteins in different cell types by including or excluding different
exon
s in a process known as
alternative splicing
. Different cell types are formed in a process
called differentiation, during development of the organism. After differentiation, cell types have
different proteins present that direct the cell to use the DNA i
n a way that suits the function of
the cell. Exons are mixed to form alternative mRNA products.



8


How DNA becomes a protein

























Promoter

DNA Gene

5’

5’ UTR

䍄匠S

䥮瑲Wn

䍄匠S

3’ UTR

Transcription

Pre
-
mRNA

RNA processing: Removal of introns

addition of 5’ cap and poly
-
䄠瑡楬

AAAAAAAAAAAAAAAAAAAAAA

Mature mRNA

5’ cap

Poly
-
䄠瑡楬

Translation

Exon 1

Exon 2

ATG…

…TGA





AUG…

…UGA





AUG…

…UGA

P牯瑥楮

9


Translation:

mRNA codes for amino acids (the building blocks of proteins) in stretches of three nucleotides called
codons. Each codon specifies an amino acid, the building blocks of proteins, or a stop codon that
signals
to stop translation. tRNAs carry specific amino acids and interact with ribosomes to deliver the specified
amino acid to the growing chain of
amino acids (called a polypeptide ). The translation of nucleotide to
amino acid sequences follows a sta
ndard genetic code (below
).


Use the genetic code to translate the following DNA: ATG TTG CGA TGA


Gene Annotation:

Long stretches of RNA that translate into amino acids without a stop codon (known as open reading
frames), start codons, intron/exon

boundaries, and other gene specific features form clues as to where
genes are located in eukaryotic genomes. Finding genes in prokaryotic genomes is much easier because
there are no introns and there is little intergenic DNA. For eukaryotes, annotating
tens of thousands of
genes by hand is an extremely time consuming process, in particular because DNA has to be examined
forwards and backwards, and for different starting points. Initial

gene prediction is now accomplished
by computers
, but these still ne
ed to be hand
-
checked for quality by
B
iologists

for genes of highest
interest
.





10


Day 1

Worksheet

Gene Annotation Lab Name: ________________________

The following is the nucleotide sequence of the human

-
globin gene. Gene regions are indicated as
follows:

Untranscribed,
Transcribed but not translated
,
Tr
anscribed and
Translated
,
Conserved
Sequence.
Nucleotide positions

in the DNA

are shown in the right hand margin.


ATATCTTAGA GGGAGGGCTG AGGGTTTGAA GTCCAACTCC TAAGCCAGTG

1
-
50

CCAGAAGAGC CAAGGACAGG TACGGCTGTC ATCACTTA
GA

CCTCACCC
TG


51
-
100

TGGA
GCCACA

CCC
TAGGGTT

GGCCAAT
CT
A

CTCCCAGGAG CAGGGAGGGC 101
-
150

AGGAGCCAGG GCTGGG
CATA

AAA
GTCAGGG

CAGAGCCATC TATTGC
T
T
AC

151
-
200

ATTT
GCTTCT GACACAACTG TGTTCACTAG CAACCTCAA
A CAGACACC
AT

201
-
250

GG
TGCACCTG ACTCCTGAGG AGAAGTCTGC CGTTACTGCC CTGTGGGGCA

251
-
300
AGGTGAACGT
GGATGAAGTT GGTGGTGAGG CCCTGGGCAG

GTTGGTA
TCA

301
-
350
AGGTTACAAG ACAGGTTTAA GGAGACCAAT AGAAACTGGG CATGTGGAGA 351
-
400
CAGAGAAGAC TCTTGGGTTT CTGATAGG
CA

CTGAC
TCTCT

CTGCCTATTG 401
-
450
GTCTATTTTC
CCACCCTTAG

GCTGCTGGTG GTCTACCCTT GGACCCAGAG

451
-
500
GTTCTTTGAG TCCT
TTGGGG ATCTGTCCAC TCCTGATGCT GTTATGGGCA

501
-
550
ACCCTAAGGT GAAGGCTCAT GGCAAGAAAG TGCTCGGTGC CTTTAGTGAT

551
-
600
GGCCTGGCTC ACCTGGACAA CCTCAAGGGC ACCTTTGCCA CACTGAGTGA

601
-
650
GCTGCACTGT GACAAGCTGC ACGTGGATCC TGAGAACTTC AGG
GTGAGT
C

651
-
700
TATGGGACCC TTGATGTT
TT CTTTCCCCTT CTTTTCTATG GTTAAGTTCA 701
-
750
TGTCATAGGA AGGGGAGAAG TAACAGGGTA CAGTTTAGAA TGGGAAACAG 751
-
800
ACGAATGATT GCATCAGTGT GGAAGTCTCA GGATCGTTTT AGTTTCTTTT 801
-
850
ATTTGCTGTT CATAACAATT GTTTTCTTTT GTTTATTCTT GCTTTCTTTT 851
-
900
TTTTTCTTCT CCGCAATTTT T
ACTATTATA CTTAATGCCT TAACATTGTG 901
-
950
TATAACAAAA GGAAATATCT CTGAGATACA TTAAGTAACT TAAAAAAAAA 951
-
1000

CTTACACAGT CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT 1001
-
1050
TTGCATATTC ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA 1051
-
1100
CATAATCATT ATACATATTT

ATGGGTTAAA GTGTAATGTT TTAATATGTG 1101
-
1150
TACACATATT GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA 1151
-
1200
TGCTTTCTTC TTTTAATATA CTTTTTGTTT ATCTTATTTC TAATACTTTC 1201
-
1250
CCTAATCTCT TTCTTTCAGG GCAATAATGA TACAATGTAT CATGCCTCTT 1251
-
1300
TGCACCATTC TAAAGA
ATAA CAGTGATAAT TTCTGGGTTA AGGCAATAGC 1301
-
1350
AATATTTCTG CATATAAATA TTTCTGCATA TAAATTGTAA CTGATGTAAG 1351
-
1400
AGGTTTCATA TTGCTAATAG CAGCTACAAT CCAGCTACCA TTCTGCTTTT 1401
-
1450
ATTTTATGGT TGGGATAAGG CTGGATTATT CTGAGTCCAA GCTAGGCCCT 1451
-
1500
TT
TGCTAAT
C AT
GTTCATAC CTCTTATCTT
CCT
CCCACAG

CTCCTGGGCA

1501
-
1550

ACGTGCTGGT CTGTGTGCTG GCCCATCACT TTGGCAAAGA ATTCATCCCA

1551
-
1600

CCAGTGCAGG CTGCCTATCA GAAAGTGGTG GCTGGTGTGG CTAATGCCCT

1601
-
1650

GGCCCACAAG TATCACTAA
G CTCGCTTTCT TGCTGTCCAA TTTCTATTAA 1651
-
1700
AGGTTCCTT
T GTTCCCTAAG TCCAACTACT AAACTGGGGG ATATTATGAA 1701
-
1750
GGGCCTTGAG CATCTGGATT CTGCCT
AATA

AA
AAACATTT

ATTTTCATTG 1751
-
1800
CAATGATGTA TTTAAATTAT TTCTGAATAT TTTACTAAAA AGGGAATGTG 1801
-
1850
GGAGGTCAGT GCATTTAAAA CATAAAGAAA TGATGAGCTG TTCAAACCTT 1851
-
1900
GGGAA
AATAC ACTATATCTT AAACTCCATG AAAGAA




1901
-
1936



11


1. Without looking, redraw the sketch of a eukaryotic gene, pre
-
mRNA, and mature mRNA in the space
below. Then correct your own work.

Gene




Pre
-
mRNA




Mature mRNA




2. Circle and label the
following in the sequence of β
-
globin (opposite page) and in the sketch of the gene above.

A. Transcription start point

B. Transcription end point

C. Translation start point

D. Translation end point


3. Explain what evidence (i.e. signals in the DNA sequen
ce) you used to support your answers to (2).






12


4. Based on the sequence information provided above, draw a sketch of the human

-
globin gene below
labeling the promoter, 5’UTR, coding sequences, introns, and 3’UTR. Make sizes roughly proportional to
the
sequence itself.








5. What would be the first three and the last three amino acids produced from the mRNA transcribed from this
gene? Note: a stop codon does not produce an amino acid.






6. Is a mutation

more likely to disrupt the function of the protein produced if it occurs in an intron or
coding sequence? Explain.


13


Lab 2.
What can comparison of DNA sequences tell us about evolution?

Goal: To give an understanding of how DNA shed light on evolution.

S
pecies evolve over time. Evolution is the consequence of the interactions of 1) the potential for a species
to increase its numbers, 2) the genetic variability of offspring due to mutation and recombination of genes,
3) a finite supply of the resources re
quired for life, and 4) the ensuing selection by the environment of
those offspring better able to survive and leave offspring. The great diversity of organisms is the result of
more than 3.5 billion years of evolution that has filled every available nich
e with life forms. Natural
selection and its evolutionary consequences provide a scientific explanation for the fossil record of ancient
life forms as well as for the striking molecular similarities observed among the diverse species of living
organisms.

The millions of different species of plants, animals, and microorganisms that live on earth
today are related by descent from common ancestors. Biological classifications are based on how
organisms are related. Organisms are classified into a hierarchy
of groups and subgroups based on
similarities which reflect their evolutionary relationships.

National Science Education Standards, p. 185


Whole genome sequences from many different species have been aligned by researchers. Now that
you’ve gained some ex
perience in understanding the fine structure of a gene, lets see what the same
gene, beta
-
globin, looks like when compared among many different species. Do you think it is possible
for mutations at a single gene to reveal phylogenetic relationships among
vertebrates?


1.

Go to the UCSC genome browser web page at genome.ucsc.edu, and select Genomes.





2.

Select the clade “Mammal”, genome “Human”, clear all text from “position or search term,” and
enter the abbreviation for human beta globin “HBB.”



14



3.

You will get a list of search results, choose HBB: Homo Sapiens Hemoglobin Beta. You will now
see a close up view of the
beta
-
globin

gene in humans. Some sections on your screen may look
different than below, but the top should be similar. Find the brows
er navigation tools, exact
location of the gene on chromosome 11, the sketch of the gene, and miscellaneous information
tracks below the gene sketch.



4.

The beta globin

gene is in the reverse orientation in this view of the chromosome (arrows on the
sketch point to the left). Before looking deeper, scroll down and reverse your orientation of the
gene so that it matches the orientation of the gene from the hand
-
annotatio
n exercise you
performed earlier. (Lab 1 Question 4)




15


5.

Scroll up to the picture and see if you can reconcile the genome browser’s gene sketch with your
hand sketch of this gene. How are UTRs, coding sequences, and introns depicted in the
browser? Roug
hly redraw the sketch below, and label the major pieces.


16


6.

Scroll down to the Comparative Genomics Track controls and adjust the settings so that
“conservation” is set to “full.” This will give your browser the most expanded view of the
alignment among man
y different species, which allows you to see how similar (i.e. conserved)
each nucleotide is among many different vertebrates.


Then hit “refresh” and scroll back up to
the view the changes.



7.

Under, “Mutliz alignment of 46 (your number might differ) spe
cies” you will see vertical bars
corresponding to each nucleotide location in the genome. The taller the bars, the more
conserved the DNA sequence is at that location in pairwise comparison of the human versus the
species identified on the left of the scr
een.


8.

To complete the lab, continue from this point to answer the questions on the “Evolution
Worksheet” below.



17


Evolution Worksheet


Name: ________________________


1.

Which species on your screen looks most sim
ilar to the human sequence? Does that make
phylogenetic sense? Explain.






2.

Does it look like some regions of the gene are more conserved than others?


a.


What evidence supports your answer?





b.

Which regions of the gene appear more conserved?






c.

Why
might that be the case?




3.

Zoom in all the way to the “base level” from
chr
omosome
11:5,247,971
-
5,248,171
. This view shows
the location of the first exon/intron boundary.

a.

Does the intron follow the GT/AG “rule”? Explain.




18


b.

Based on the species you see

in the browser viewer, what percent similar are the sequences
in each of the first four nucleotides of the intron?




c.

What could explain the variation in conservation you see in (b)?




4.

Write down the amino acid sequence for the six amino acids before the intron begins.





a.

Are more phylogenetically similar species, like mammals, more similar to each other than they
are to fishes?


b.

At what % of sites are all species in the group ident
ical? What about mammals? Humans vs.
fish?




c.

What could explain this?




d.

Do you think this pattern is something unique to beta globin, or is a general property of your
species genomes? Support your idea by going to the HDB (Homo sapiens hemoglobin de
lta)
gene and repeating the analysis below.


19


Lab 3: How can I use bioinformatics to annotate genes?

Goal: To give an understanding of how raw data and research help us predict genes in a genome.

Bioinformatics:

Bioinformatics is branch of biological science involved in using computers to analyze biological data.
Bioinformatics normally deals with very large datasets (sometimes in excess of 500 GB) that would be
nearly impossible to generate and manage without the

use of supercomputers.
Recent advances in
sequencing technology are allowing more genomes to be sequenced each year
. However, it takes ten
times as long to find genes and describe their functions (i.e. annotate a genome) than it does to
sequence genomes
. As a result, there is a widening

gap between sequenced genomes and annotated,
searchable genomes in publically available databases
.


The Maker Annotation Pipeline

Website does not work anymore


Maker


is a genome annotation pipeli
ne that seeks to close
this gap. A pipeline is a series of programs
working together like when you follow the different steps in a protocol to dissect a frog and label its
organs. Maker
was optimized for use on non
-
model organisms.

Much has been learned about biology
in the l
ast fifty years from model species like the fruit fly (
Drosophila

spp
), nematode worm (
C
.

elegans
),
baker’s yeast (
Sachharomyces cerevisea
), and bacteria (
Escherischia coli
) that have small genomes, are
easy to raise, have short life cycles, are easy to ob
serve the connection between genotype and
phenotype, and can be easily manipulated genetically. However, because sequencing costs have
dramatically decreased in the last decade, genomes from species with interesting phenotypes are now
being sequenced at i
ncreasing rates. The Maker pipeline

combines two
kinds

of
information to make
structural gene annotations from raw DNA:

i) in
trinsic
signals in the organism’s DNA, which are found

by
ab
-
initio

(from scratch)
gene predictors
,

and
ii)
extrinsic evidence
,

wh
ich is eviden
ce supplied to Maker
based on similarity

of genomic regions
to other organisms’ mRNA (also known as expressed sequence
tags, or ESTs) and protein sequences
.


20




















The

first step of the multi
-
step

Maker
gene annotation
pipeline

involves finding repetitive DNA and
labeling (i.e. “masking”) the genomic DNA using the program RepeatMasker. Repeat masking is
important in order to prevent inserted viral exons from being coun
ted as fish exons.

A repeat library
was developed specifically from the
Sebastes rubrivinctus

genome. T
he
ab
-
initio

gene predictor

SNAP

can detect the intrinsic signals in the DNA model genes. Because “what a gene looks like” differs
significantly amo
ng genomes, gene finders must be “trained” to know what the signals look like in each
new species sequenced. The program BLAST (Basic local alignment search tool) is used to find similarity
of genome sequence to public mRNA and protein sequences. There a
re different kinds of BLAST
searches depending on what kind of sequences (DNA, RNA, protein) are being compared. The program
E
xonerate
helps to polish up

gene annotations

since “Local” alignments of sequences end wherever
similarity between sequences begi
ns to decrease
.

The final annotations predicted by Maker are those
that are supported by both kinds of information, intrinsic signals and extrinsic evidence.

MAKER Annotations:

supported by both
Intrinsic and Extrinsic
Evidence.

Extrinsic

Evidence

Proteins

ESTs

Exonerate
to
improve
alignment
quality

Assembled Genome

Repeat Masker


M
asked Genome

Intrinsic

Signa
ls

Gene Predictor
(rockfish
trained SNAP
and Augustus)


21


The object of this
this

exercise is to
use Maker, a cutting edge research tool, to predict genes from a
small section of the
Sebastes rubrivinctus

genome using current methodologies.


1) Retrieve Scaffold folder from public drive.

2) Go to
http://b
last.ncbi.nlm.nih.gov/

3) Click on the link to BLASTx.


4) Upload the scaffold.



5) Change the database to UniProtKB/Swiss
-
prot (swissprot).

22



6) Click the BLAST button. BLAST will take several minutes to run. It will search the Swissprot

database
for matches to your query based on local sequence similarity.


7) Select all of the sequences, then click “Get selected sequences.


8) Select view as fasta and select the first 10 results.

23



9) Copy the results to a file and save as ProteinEvid
ence.fasta.

Thought Question: What protein appears most in the search results? Do a quick internet search, what
does this protein seem to do?

10) Go to
http://blast.ncbi.nlm.nih.gov/

11) Click on the link to t
BLASTx.


12) Upload the scaffold.


13) Change the database to expressed sequence tags (EST). Enter “Sebastes” in the Organism line.

24



14) Click the BLAST button. This may take several minutes.

15) Select 10 of the sequences, 2 from each “column” of align
ments in the picture at the top. Clicking on
any of the lines under the long bar will take you straight to that entry, check the box. Then click “Get
Selected Sequences”.


16) Select view as fasta and select the first 15 results.


17) Copy the results to a file and save as ESTEvidence.fasta.

18) Go to
http://derringer.genetics.utah.edu/cgi
-
bin/MWAS/maker.cgi

19) Click new guest account. Remember to write your
guest number down.

25



20) Click on the manage files link next to new job.



21) Upload the scaffold file, the protein evidence, the EST evidence as FASTA files.

26



22) Upload the HMM file (
on public drive)

as a SNAP HMM file


23) Go back to the new jobs t
ab.

24) Upload the Sequence file in the “Choose a genome fasta file” menu.


25) Upload the EST file as ESTs from a related organism.

27



26) Upload the Protein file.




28



27) Upload the SNAP file.


28) Set “Consider single exon EST evidence when generating

annotations” to yes.


29) Click “Add Job to Queue”.

30) Once Maker has finished, click the icon in the view results tab.




29



32) Click the “View in Apollo” button.


33) Select “open with Java Web Start Launcher”.

34) Select run to launch Apollo.

35)
Right click the Maker annotated gene and select “Sequence”.

36) Select Peptide Sequence and highlight the sequence.

37) Go to
http://blast.ncbi.nlm.nih.gov/

38) Select protein BLAST.

39) Paste in the sequence, change the database to Uniprot/Swissprot, and click BLAST.

40) Determine the identity of the gene based off of the more similar, significant blast result hit. The E
-
value of the “query” sequence you entered against a database “s
ubject” is a measure of the probability
that the hit was random. Values less than 10 x 10
-
10

are considered reliable indicators of significant
similarity.

41) Complete the Worksheet below

to finish the lab
.


Want more?


Have the students annotate the gene
s with and without the
ab initio

gene finder, the EST
evidence, or the protein evidence to see which most strongly affects the resulting protein sequence.
Measure percent overlap of different gene annotations to see how similar they are.


30


Gene Annotation
Statistics Worksheet




Name:_____________________________

1)

Paste a screen shot of your Apollo result in the space below:








2) How many genes were annotated by MAKER in the scaffold?


3) If there are any predicted genes, which appear complete, that is,

beginning with ATG and ending with
a stop codon?




4) How many exons are in each gene supported by MAKER?



5) How long are the introns on average, and how many are there per gene?




6) Do all introns follow the GT/AG rule?



31


6) For which genes were
UTRs predicted? Name the genes and UTR types.




7) Sometimes assembly of genomes from many small pieces results in chimeric sequences (Recall from
mythology that a chimera is a monster comprised of pieces of different animals). Do your blast results
sug
gest that the gene MAKER predicted is such a monster?




8) Describe the protein from the strongest protein Blast hit. Can you assign a putative function to the
gene based on this information?




9) Research how the gene is important to the organism’s de
velopment, reproduction, and/or survival.



32


Lab 4.
What does a typical rockfish gene look like?


Goal: Use skills learned in previous labs and classes to complete a gene annotation using an
annotation pipeline.

Exercise
: Each student should take a scaffol
d and repeat the annotation process above. There are 50
random scaffolds provided in
Appendix
D

(Ask your instructor for the file location). Two students
should independently annotate the same scaffold. Each individual should complete the worksheet with

their scaffold. Compare your results to another student who annotated the same scaffold.


Presentation
: Each pair of students with the same scaffold will present a 15 minute presentation on
their annotations. Discuss your worksheet results, annotations, the features of those annotations, what
the predicted genes are and why they are important. Be sure to

support your point with citations and
evidence and to include tables and figures where appropriate. The scaffolds may contain single exon
genes, multi exon genes, one gene per scaffold, more than one gene per scaffold, or simply no genes.
As a group dis
cuss what the general features of all of the scaffolds say about eukaryotic gene structure.
Can you classify different kinds of rockfish genes based on the results you’ve discovered?





Epilogue

As of the writing of this manual, scaffolds containing age
-
related genes have not yet been identified in
S.
rubrivinctus

or its sister species that lives 10X longer than it. Inquire to see if these are now available to
the author:
buonaccorsi@juniata.edu
. There are hundreds of candidate aging genes that are found in
both humans and rockfishes. Students could annotate the same gene from the two species to see if
there are differences that represent a “smoking

gun” that might explain negligible senescence in the
tiger rockfish!