Module 8. Genome annotation with Maker: Gene finder training

kettleproduceSoftware and s/w Development

Dec 2, 2013 (3 years and 8 months ago)

96 views

1


Module
8
.
G
enome annotation with Maker:
G
ene finder training

Background

Finding

and characterizing genes in eukaryotic genomes is still a difficult and unsolved problem.

Genomes differ enough from one another that the accurate gene finding necessitates training of
gene
finding al
gorithms

for the genome of interest unless there are gene models from close relatives available
for the gene finders you are using
.


Gene Finder T
raining

An

important step in genome annotation is optimizing the gene finders that will be used to ann
otate the
genome.
Maker
employs annotation from

SNAP
(Korf 2004)
and Augustus
(Stanke & Waack 2003;
Stanke et al. 2006)

as part of its gene annotation pipeline. Gene finders are trained using an

initial set
(several hundred) of high quality gene annotations.


We will generate a high quality gene set

by doing
the following:



Isolate

the
longest scaffolds available for
the

s
pecies

of interest
(done already)



Blastx the
long scaffolds
to
NCBI

and download protein
hits as a multifasta file
(done
already)



Generate

a draft

training set

of gene annotations

by running
Protein2genome
program
(in Maker) that
will align
these

proteins
to the
scaffolds

and
construct

annotations
based
the blastx hits and

general knowledge of
eukaryotic
splice sites, start and stop
codons
, etc
.



SNAP’s fathom algorithm
will be used to filter out

incomplete

genes
.



Fathom

will

g
enerate

a hidden markov model

(
HMM
)

file that summarizes what a
typical gene looks like f
rom your pilot data
.




The HMM file is used by SNAP in the maker pipeline to predict genes



Augustus can
be trained on the web using the long scaffolds and homologous proteins
by using web training

site
.

Goals




G
enerate gene training files for a novel genome



Run Maker in parallel on a high performance cluster for a whole genome

V&C core competencies addressed


2




Ability to use quantitative reasoning: Applying statistical methods to diverse data,
Mathematical
modeling, Managing and analyzing large data sets



Use modeling and simulation to understand complex biological systems: Computational

modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data
sets, Incorporating

stochasticity into biological models

GCAT
-
SEEK sequencing requirements

None

Computer/program requirements for data analysis

Linux OS, SNAP
,

Maker,

GCAT
-
SEEK Virtual
Machine

or Juniata’s HHMI Cluster

Protocol
s

Generating a
Draft

Training Set

This tutorial
assumes you are using
:



A

multifasta file named genome.fa
containing the longest scaffolds in your genome



T
he custom repeat library created in the RepeatScout
Module




A protein file named proteins.fa

R
unning any of the listed commands without any additional

input (or running them with the
-
h option)
will display the help message including a description of the program, the usage, and all options.


1) Generate the Maker control files


$
Maker
-
CTL


This will generate three control files (maker_opts.ctl, maker_b
opts.ctl, and maker_exe.ctl)


2) The maker_opts.ctl file contains all of the performance options for maker.

For this run, edit the following lines (nano
-
c maker_opts.ctl)

_______________________________

Line 2: genome=genome.fa

Line 22:
protein=proteins.fa

Line 27: rmlib=genome.secondfilter.lib

Line 41: protein2genome=1

3


_______________________________


3) Run protein2genome on the genome.


If you are running on multiple processors

on your personal computer
, specify number of
processors
(N) and run Maker using:

$
mpirun
-
n N maker


If running on a cluster, edit
a

Qsub script to distribute the job to a worker node, using the
command line above and then run Qsub as described in Module 3.


If only running on a single processor:

maker


Trainin
g SNAP
:
Filter out incompl
ete genes, generate an HMM file


1) Once
M
aker has finished, the
annotations

need to be
compiled into a single file
.


$
cd genome.maker.output

$
gff3_merge
-
d genome_master_datastore_index.log
-
g


2) Make a new directory for training SNAP
and
put the gff in there.


$
mkdir SNAPTraining

$
mv genome.all.gff SNAPTraining

$
cd SNAPTraining


3) Convert the gff to zff (the file format that
SNAP recognizes

for training).


$
maker2zff
-
n genome.all.gff


This
will create two new files: genome.ann and genome.dna


4)
F
athom and
F
orge will be used to train
SNAP
. Fathom has several subroutines that help filter
and analyze gene annotations. First, examine the annotations to get an impression of what
annotations ther
e are.


$
fathom
-
validate genome.ann genome.dna

$
fathom
-
gene
-
stats genome.ann genome.dna


5) Next, split the annotations into four categories: unique genes, warnings, alternative spliced
genes, overlapping genes, and errors.


4


$
fathom
-
categorize 1000 geno
me.ann genome.dna


6) Next, export the genes.


$
fathom
-
export 1000 uni.ann uni.dna


This will generate four new files (export.aa, export.ann, export.dna, and export.tx)


7) Make a new directory for the new parameters.


$
mkdir parameters

$
cd parameters


8)

Generate the new
parameters
. The command below will run forge on the directory above

parameters

(i.e. the parent directory for parameters)
, but put the output in the working
directory (parameters).


$
forge ../export.ann ../export.dna


9) Generate the new HMM (the file that
SNAP

uses to generate gene predictions)
, by moving
into the directory above parameters, and entering the command that follows.


$
cd ..


$
hmm
-
assembler.pl

genome parameters > genome.hmm




5


Training Augustus

1) Go to
http://bioinf.uni
-
greifswald.de/webaugustus/training/create

2) In the designated spaces fill in
your email and the species name

3) Upload

export.dna as the genome file.

4) Upload

exp
ort.aa as the protein file.

5) Type in the verification string and click “Start Training”

6) After the Augustus web training finishes,
download the directory of results.

If using HHMI
Cluster:
Rename the

directory by the species name

and copy into

/
share
/
apps
/maker/exe/augustus/config/species directory.


If using
Virtual Box:

Rename the directory by the species name and copy into
/User/lib/maker/exe/augustus/config/species directory.


Assessment

After
reviewing and practicing

HMM generation (
see suggesti
on
below) students should be able to do it
by hand on small datasets from a

nucleotide

alignment. They could then score a short DNA sequence
for presence of a motif by hand.

Students should be able to repeat the steps starting with long (> 10x average gen
e size; see module 1 for
average gene size given genome size) scaffolds from any genome.


Time line of module

1 hour of lab

Discussion topics for class

Relevant background topics include HMM generation. Chapter 16 from
Genomics and
Bioinformatics: An
Introduction to Programming Tools for Life Scientists by Tore Samuelsson

is a
good reference. The Korf 2004 paper is required reading on this topic.


References

Maker

Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M.
6


2008.MAKER: an easy
-
to
-
use annotation pipeline designed for emerging model organism
genomes.

Genome Res.18:188

196.


Holt C, Yandell M. MAKER2: an annotation pip
eline and genome
-
database management tool for
second
-
generation genome projects.

BMC Bioinformatics.

2011;12:491.

doi: 10.1186/1471
-
2105
-
12
-
491.

SNAP

Korf I. 2004.

Gene finding in novel genomes.

BMC Bioinformatics

5:59

Augustus

Stanke M, Waack S.
2003.
Gen
e prediction with a hidden Markov model and a new intro
n
submodel.Bioinformatics.19:ii215

ii225.

Stanke M, Schöffmann O, Morgenstern B, Waack S.
2006.
Gene prediction in eukaryotes with a
generalized hidden Markov model that uses hints from external source
s.

BMC Bioinformatics.


7:62.