2-MAGPIE_labx

throneharshBiotechnology

Oct 2, 2013 (3 years and 2 months ago)

68 views





August
,
2010



Applied Computational Genomics Course



1

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre








MAGPIE: Automated Genome Analysis and Annotation

Christoph W. Sensen, University of Calgary

Key Concepts

-

The work flow in DNA sequence production

-

The data flow in automated annotation
systems

-

The general design of automated annotation systems

What you will be able to do at end of this section

-

Develop an analysis strategy for DNA sequences

-

Use MAGPIE annotated genomes

(prokaryotic)

-

Use MAGPIE annotated transcript datasets (eukaryotic)

1.

Overview

DNA sequence production and the bioinformatics challenge

Before we can attach functional assignments to DNA sequence, we have to understand the
work flow in the DNA sequencing laboratory, because the way DNA sequence is produced
influences analys
is strategies as much as the reason why the sequence is generated (e.g.
complete genomic sequencing, EST production, SNP analysis). In this lecture, we will mainly
focus on analysis strategies for genomic sequence and complete genomes, but many of the
meth
ods that are used for the annotation of complete genomes can also be applied to EST
libraries. The analysis of genomic sequence can be an overwhelming task, because of the
sheer quantity of data than needs to be generated. If one megabase of finished micro
bial
genomic sequence needs to be analyzed using 50 tools, we can expect that a single analysis
run will need to execute approximately 50,000 database searches and other analyses. Because
the content of the public databases is growing rapidly, many analyse
s need to be repeated at
least once a month, bringing the number of searches per megabase and year close to 500,000.

The four different stages of a DNA sequencing project

Different sequencing

strategies (full genome shotgun

vs. directed sequencing) have be
en a
hotly debated issue in the past, but over time it has turned out that all full genome sequencing
projects have generally four different phases:
primary sequencing
, where randomly cloned




August
,
2010



Applied Computational Genomics Course



2

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre








fragments of the genome or a particular large insert clone are se
quenced; the
linking phase
,
where primer
-
walking and other strategies are used to close the genome/clone and generate a
single contig, the
polishing phase
, where the single contig is
disambiguated
; and the
finished
sequence
, which is the end product of the

project. Figure 1 details these phases, using the
Sulfolobus solfataricus

P2 genome project as an example.


Figure 1:
Sulfolobus solfataricus

P2 genome sequencing strategy. The four different sequencing phases (primary,
linking, polishing and finished) are shown.

The extend of each phase depends on the particular project, and today many project stop short
of complete genomic sequence, submittin
g unpolished contigs to the databases instead. Each
of the four different phases need a completely different sequence analysis strategy, the
objectives shift from the identification of contaminations (
E. coli

and vector) during the primary
phase over the f
irst glimpse into the gene content of the contig during the linking and polishing
phase to the complete annotation of the finished contig, including the identification of all gene
locations, the assignment of function(s) to each gene and the identificatio
n of control elements
(e.g. promoters and terminators).





August
,
2010



Applied Computational Genomics Course



3

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre









The data flow in a sequencing project

The data flow in a sequencing project has to reflect the analysis requirements in the four
sequencing phases. Analysis tools (ORF identification tools, intron
-
ex
on recognition tools,
BLAST, FASTA, Smith
-
Waterman, motif finders, promoter and terminator identification tools and
other programs) will have to be employed during different phases and potentially with changing
parameters. For example, a BLAST search of th
e sequence in question against the
E. coli
genome is only appropriate in the primary sequencing phase, when this search will identify
contaminations that occurred during the cloning of the insert. Figure 2 shows the data flow
through a sequencing project.


Figure 2: Data flow in MAGPIE. The left side of the figure shows the sequence production engine (sequencing an
genome assembly), the right side shows the components of the MAGPIE automated genome annotation system,
which drives the genome analysis and a
nnotation. MAGPIE can use local and remote (email and Web requests) tools
and generates Web
-
brows
e
-
able outputs. In our example, sequence assembly is done using the Staden package.

2.
Sequence Analysis Using MAGPIE

Preprocessing

Before genomic sequence can

be analyzed, it has to be processed. Pre
-
processing strategies
depend on the sequencing phase. Primary sequence is just “handed through” and searched
against the
E. coli

and vector databases, while finished sequence undergoes a complex




August
,
2010



Applied Computational Genomics Course



4

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre








procedure that help
s identifying the coding regions. During this procedure, the potentially coding
regions (often called open reading frames or ORFs in prokaryotes) are identified, spliced and
merged if necessary (when introns are present). The resulting DNA sequences are tr
anslated
into protein sequence, which can be subjected to protein analysis tools. All original sequences
and the processed coding regions are stored in local databases which can be searched once
function is attached to the contigs. While it is relatively e
asy to identify prokaryotic ORFs using
start and stop codons, the identification of eukaryotic coding regions can be much more
complicated. For
eukaryotic

sequence it is advisable to combine the output of as many
organism
-
specific intron/exon recognition t
ools as possible in a first attempt to establish the
location and organization of genes, but to revise the findings later based on similarity searches.

Data collection

Once the sequence is processed and organized, an analysis plan will be attached to each

contig. Depending on the sequencing phase and the contig (DNA or protein, full genome or
ORF?), this plan may differ considerably. Good automated engines allow the users to define
analysis plans for their sequences. This allows the integration of new tool
s and searches
against updated or new types of databases.

A first daemon (or agent) executes the data collection for each contig based on the analysis
plan. Tools and databases can be either remote or local

(including the GeneMatcher 2 and
TimeLogic machin
es)
, they can be invoked via commands, email or Web requests. Whenever a
tool response is received, it is handed off to a post
-
processor (or parser), which extracts the
relevant information (e.g. scores, database hits, alignments, predictions for biophysic
al
parameters) and stores it in a local data system (either a flat
-
file structure or a relational
database).

Report generation

After the data collection and the p
ost
-
processing is finished, Web
brows
e
-
able reports are
generated. Typically, these reports ha
ve a hierarchical structure, which allows browsing of the
information on the level of the complete contig or at a higher magnification (for example the
level of a single gene). The feature and function summaries can be either in tabular format or in
graphi
cal format. In most cases, the graphical displays are pre
-
computed and served as GIF or
PNG files. The report generation can also create indices of the assigned features and functions,




August
,
2010



Applied Computational Genomics Course



5

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre








which can be used to query over the entire data set. The powerpoint pre
sentation will show the
hierarchy for the
Sulfolobus

genome in detail using the MAGPIE view as an example.

Reports
: Genomic

MAGPIE provides a hierarchical view into
genomic

data, as shown in the following table, using
E. coli

as an example. The E. coli project and approximately 1
50

more are available at
http://magpie.ucalgary.ca .



1 Project home page

The project home page lis
ts the project statistics for
all of the contigs. It also provides links to the
sequence assembly data if this informat
io
n
is
available (not for
E. coli
).




2 Supercontig page

Clicking onto "Whole Project View" on the project
home page connects to the MAGPIE supercontig
page. This page provides an overview of the genome
assembly and links t
o local databases in FASTA
format. For
E. coli
, it shows the order of the
~
100 kb
contigs.






August
,
2010



Applied Computational Genomics Course



6

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre










3 Group home page

The group home page f
or a contig can be reached
from the project home page or by clicking on a contig
in the supercontig view or a group link in the project
home page. This page provides graphical and tabled
info
rmation about the contig and its

open reading
frames (ORFs).




4 Status page

The status page summarizes the responses from the
tool servers for a group.




6 More Graphics

The MORE GRAPHICS link on the group home page
connects to additional graphical displays of contig
-
related images. This includes mapping data, ORF
displa
y, base composition and assembly information.







August
,
2010



Applied Computational Genomics Course



7

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre










7 ORF home page

Clicking

on an ORF connects to the ORF fun
ction
page. If the ORF is already annotated (as
represented by the Nova Scotia flag in the group
home page, the annotation database contents will be
displayed together with the summary evidence
graphics and the individual evidence graphics. The
individual
evidence graphics is completely
hyperlinked.




8 Search Project

MAGPIE projects can be searched for sequence
similarities

using BLAST and FASTA, for regular
expressions, e.g. "Polymerase", and for sequence
motifs. The link to the various search types is
provided from the project home page.





9 tRNA Report

The tRNA report provides an overview over the tRNA
genes that were identified in a MAGPIE project. This
report is using Todd Lowe's and Sean Eddy's
tRNAscan
-
SE to identify the genes.






August
,
2010



Applied Computational Genomics Course



8

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre










1
0 Metabolic Pathway Overview

The Metabolic Pathway overview identifies all primary
metabolism pathways in Evgeni Selkov's EMP
database
that have genes in the project with an E.C.
number that are part of the pathway. We are using
the public domain version of EMP as downloaded
from EBI, Hinxton.





11 Metabolic Pathway Schema

Individual metabolic pathway schemata containg the
highlighted E.C. numbers for ORFs indentified in the
MAGPIE project, a link to the particular ORF pages
and the pylogenetic classific
ation of each ORF with
an E.C. number.


Table 1: MAGPIE pages

All automated annotation systems allow simple queries like text searches, (e.g. find all
polymerases). Searches can include sequence similarities (using searches against the local
genome databa
se), motif searches, and searches for biophysical parameters (sometimes
including fragment masses for Mass Spectrometry analyses). These searches allow users to
compare their genomic fragment of interest to the annotated genome.

The development of better a
nalysis and annotation tools is still a wide
-
open field which poses
exciting challenges for computer scientists and biologists alike.


Reports: Transcriptomes

Unlike genomics reports, MAGPIE treats transcriptomics data coming from next
-
generation
sequencin
g technologies as more of a searchable dataset than a browsable dataset, which is
reflkewcted in the interface.






August
,
2010



Applied Computational Genomics Course



9

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre










1 Project home page

The raw reads and assembled contigs
are
presented as different groups. Assembly
data (right column) shows the provenance
of the contigs, as well as the rough ideas of
transcript abundance, if the cDNA was not
normalized biochemically.



2 Group home page



searches

The group home page for an sequencing
run allows the user to search for raw reads
based on the following attributes.



Description text of database hits



Gene Ontology terms glea
ned from
hits



IDs as generated by the sequencer,
or tags as may be assigned by
users in the ORF annotation



Phylogenetic conservation (e.g.
show seqs strongly similar to dog
ESTs)



BLAST similarity to a query DNA or
protein sequence (auto
-
detected)





August
,
2010



Applied Computational Genomics Course



10

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre










3
Group home page


graphs

MAGPIE provides histograms giving an
overview of sequence lengths (top), and if
the group represents assembled contigs, #
of reads per contig (bottom). Note the
logarithmic scale ranges for the X axis.
Sequences can be searched by

their length
or # of contributing reads, which can be
useful when looking for high quality, or
highly expressed ESTs.



4 Transcript evidence page

This page is similar to the ORF home page
in the genomic view, but may also include
information such as
assembly visualization
(see links above ontology terms form).




5 Transcript assembly viewer

Clicking on the “Display” button within the
transcript evidence page (assembled
groups)
launches a Java viewer for the
contig’s assembly. Don’t worry about the

asmviewrc error, it should work anyways.


This app lets to assess the quality of the
assembly job, and view subcluster
information, if available.





August
,
2010



Applied Computational Genomics Course



11

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre









Confirmation of the automated annotation

No completely automated genome analysis and annotation system is c
urrently capable of
producing output that is accurate enough for direct submission to the public databases. After the
report generation is completed, experienced biologists have to go through the hierarchical
reports and verify the automated function calls
. The reason for this are manifold, not the least
one being that functional assignments in the public databases have an error rate of 10% to
15%, which makes it essential to study the evidence and correct obvious miscalls. In addition,
information about fu
nctional categories and gene ontologies can be recorded at this point.
Figure 3 shows a gene page with the fields which can be edited by the annotator.


Figure 3: MAGPIE annotation verification information


Key Computational Challenge





August
,
2010



Applied Computational Genomics Course



12

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre









-

How can we build
tool
-
integration systems that reflect user needs and preferences?


Appendix

1.

Resources

i)

Original Papers

-

Gaasterland T., Sensen C.W. (1996) Fully automated genome analysis that reflects
user needs and preferences
-

A detailed introduction to the MAGPIE system
architecture
-
. Biochimie
78
:302
-
31

-

Gordon P., Sensen C.W. (2000) Bluejay: A Browser for Linear Units

in Java.In:
Pollard, A. Mewhort D.J.K., Weaver D.F. [eds.] High Performance Computing
Systems and Applications. Kluwer Academic Publishers, pp. 183
-
194

-

Gordon P., Gaasterland T., Sensen C.W. (2002) Genomic Data Representation
Through Images: MAGPIE as an

Example. In Sensen [ed.] Essential of Genomics
and Bioinformatics. Wiley
-
VCH pp. 345
-
363

ii)

Software

-

none

iii)

Text books:

-

Sensen, C.W. (2002) Essentials of Genomics and Bioinformatics. Wiley
-
VCH,
Weinheim, ISBN 3
-
527
-
30541
-
6

iv)

Web Sites:

-

MAGPIE:
http://
magpie.ucalgary.ca

Assignment

1.

Students

will
in pairs
manually complete the annotation of

a piece of the
E. coli

genome,
which was
automatically
annotated with the MAGPIE system.

Students can start their
assignment
s

from MAGPIE

sample project

webpage
:

http://magpie.ucalgary.ca/magpie/Ecoli_K12/private/

(
username
: demo,
pwd
: acgc)
.
Discuss with your partner the quantity and quality of matches you get from the different




August
,
2010



Applied Computational Genomics Course



13

Ottaw
a





Genome Canada
Bio
informatics Innovationm Centre








tools run by MAGPIE. Which tools are more sensitive, non
-
redundant, etc.. How did
yo
u synthesize a final description?

2.

Using the same user name and password, navigate the assembled sequence group of
the public Poppy project:

http://magpie.ucalgary.ca/magpie/Poppy/private/

Try the various search forms. Annotate a sequence of interest to yo
u, and tag it with a
name at the same time. You the search form to find that sequence again, using your
personal tag.

Launch the assembly display and discuss with your partner how good the
assembly looks (# reads, overlap, base quality).