enome annotation with Maker:
ene finder training
and characterizing genes in eukaryotic genomes is still a difficult and unsolved problem.
Genomes differ enough from one another that the accurate gene finding necessitates training of
for the genome of interest unless there are gene models from close relatives available
for the gene finders you are using
Gene Finder T
important step in genome annotation is optimizing the gene finders that will be used to ann
employs annotation from
(Stanke & Waack 2003;
Stanke et al. 2006)
as part of its gene annotation pipeline. Gene finders are trained using an
(several hundred) of high quality gene annotations.
We will generate a high quality gene set
longest scaffolds available for
and download protein
hits as a multifasta file
of gene annotations
(in Maker) that
the blastx hits and
general knowledge of
splice sites, start and stop
SNAP’s fathom algorithm
will be used to filter out
a hidden markov model
file that summarizes what a
typical gene looks like f
rom your pilot data
The HMM file is used by SNAP in the maker pipeline to predict genes
be trained on the web using the long scaffolds and homologous proteins
by using web training
enerate gene training files for a novel genome
Run Maker in parallel on a high performance cluster for a whole genome
V&C core competencies addressed
Ability to use quantitative reasoning: Applying statistical methods to diverse data,
modeling, Managing and analyzing large data sets
Use modeling and simulation to understand complex biological systems: Computational
modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data
stochasticity into biological models
SEEK sequencing requirements
Computer/program requirements for data analysis
Linux OS, SNAP
or Juniata’s HHMI Cluster
assumes you are using
multifasta file named genome.fa
containing the longest scaffolds in your genome
he custom repeat library created in the RepeatScout
A protein file named proteins.fa
unning any of the listed commands without any additional
input (or running them with the
will display the help message including a description of the program, the usage, and all options.
1) Generate the Maker control files
This will generate three control files (maker_opts.ctl, maker_b
opts.ctl, and maker_exe.ctl)
2) The maker_opts.ctl file contains all of the performance options for maker.
For this run, edit the following lines (nano
Line 2: genome=genome.fa
Line 27: rmlib=genome.secondfilter.lib
Line 41: protein2genome=1
3) Run protein2genome on the genome.
If you are running on multiple processors
on your personal computer
, specify number of
(N) and run Maker using:
n N maker
If running on a cluster, edit
Qsub script to distribute the job to a worker node, using the
command line above and then run Qsub as described in Module 3.
If only running on a single processor:
Filter out incompl
ete genes, generate an HMM file
aker has finished, the
need to be
compiled into a single file
2) Make a new directory for training SNAP
put the gff in there.
mv genome.all.gff SNAPTraining
3) Convert the gff to zff (the file format that
will create two new files: genome.ann and genome.dna
orge will be used to train
. Fathom has several subroutines that help filter
and analyze gene annotations. First, examine the annotations to get an impression of what
validate genome.ann genome.dna
stats genome.ann genome.dna
5) Next, split the annotations into four categories: unique genes, warnings, alternative spliced
genes, overlapping genes, and errors.
categorize 1000 geno
6) Next, export the genes.
export 1000 uni.ann uni.dna
This will generate four new files (export.aa, export.ann, export.dna, and export.tx)
7) Make a new directory for the new parameters.
Generate the new
. The command below will run forge on the directory above
(i.e. the parent directory for parameters)
, but put the output in the working
forge ../export.ann ../export.dna
9) Generate the new HMM (the file that
uses to generate gene predictions)
, by moving
into the directory above parameters, and entering the command that follows.
genome parameters > genome.hmm
1) Go to
2) In the designated spaces fill in
your email and the species name
export.dna as the genome file.
ort.aa as the protein file.
5) Type in the verification string and click “Start Training”
6) After the Augustus web training finishes,
download the directory of results.
If using HHMI
directory by the species name
and copy into
Rename the directory by the species name and copy into
reviewing and practicing
HMM generation (
below) students should be able to do it
by hand on small datasets from a
alignment. They could then score a short DNA sequence
for presence of a motif by hand.
Students should be able to repeat the steps starting with long (> 10x average gen
e size; see module 1 for
average gene size given genome size) scaffolds from any genome.
Time line of module
1 hour of lab
Discussion topics for class
Relevant background topics include HMM generation. Chapter 16 from
Introduction to Programming Tools for Life Scientists by Tore Samuelsson
good reference. The Korf 2004 paper is required reading on this topic.
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M.
2008.MAKER: an easy
use annotation pipeline designed for emerging model organism
Holt C, Yandell M. MAKER2: an annotation pip
eline and genome
database management tool for
generation genome projects.
Korf I. 2004.
Gene finding in novel genomes.
Stanke M, Waack S.
e prediction with a hidden Markov model and a new intro
Stanke M, Schöffmann O, Morgenstern B, Waack S.
Gene prediction in eukaryotes with a
generalized hidden Markov model that uses hints from external source