NGS Bioinformatics Workshop - IRMACS Centre

shamebagΒιοτεχνολογία

22 Φεβ 2013 (πριν από 4 χρόνια και 4 μήνες)

296 εμφανίσεις

NGS Bioinformatics
Workshop

1.1 Workshop Overview and

Practical Informatics Considerations


March 7
th
, 2012

IRMACS, SFU

Facilitator: Richard
Bruskiewich

Adjunct Professor, MBB

Today’s Agenda


Part 1


Welcome and Acknowledgments


Some administrative details…


Introductions:


Facilitator


Participants


10
minute
break

Advance Acknowledgments


Jim Mattson:
for championing the workshop idea


Felix
Breden
:
for championing the idea of IRMACS
bioinformatics support & endorsing this workshop


IRMACS team:


Pam
Borghardt
, IRMACS
Managing Director: sponsorship


Brian Technical
Director: workshop infrastructure


WestGrid

Team:


Ata
Roudgar
, Martin
Siegert
: workshop HPC infrastructure


Fiona Brinkman:
for
her kind
permission to adapt a
number
of her MBB
introductory
bioinformatics
course
slides
for portions of the workshop


Topic

Lecture (12:30


14:30,
Wednesdays)

Demo/Lab (9:30


11:30,
Thursdays)

Bioinformatics Overview (roughly equivalent to core MBB 441/741 topics)

Workshop Overview and

Practical Informatics Considerations

March 7th

March 8th

Sequence Formats, Databases and Visualization Tools

March 14
th

March 15
th

Sequence Alignment and Searching

March 21
st

March 22
nd

Principles of Structural Genomics and Overview of
Next Generation Sequencing Technologies

March 28
th

March 29
th

Sequence Assembly Algorithms

April 4
th

April 5
th

Specific Applications

Sequence Assembly of
Transcriptomes


May 2
nd

May 3
rd

Sequence Assembly of Whole Genomes

May 9
th

May 10
th

Annotation of
de novo

Assembled Sequences

May 16
th

May 17th

Identification and Analysis of Sequence Variation

May 23
rd

May 24
th

Comparative Genomic Analysis and Visualization

May 30
th

May 31
st

Meta
-
Analysis of Newly Annotated Sequence Data

June 6
th

June 7
th



Venue


The
workshop lectures and demo/labs will
generally take
place here,
in the IRMACS Centre,
Room 10900 (top floor, Applied Sciences
Building) with the exception of the March 14
th

and May 9
th

lectures,
plus the May 10
th

lab/demo
for which there is a meeting conflict in IRMACS.
These
particular sessions
will instead be
convened in
BioSci

room B9242
.


The lab/demo sessions on March 8th, 15th and
29th will end earlier, at 11 am, to
accommodate
the
next scheduled event in IRMACS 10900.


Workshop Fee


Sign
-
up list to Barbara Sherman… will contact PI for
billing(?)


INTRODUCTIONS

NGS Bioinformatics Workshop

1.1 Workshop Overview and
Practical
Informatics Considerations

Facilitator Richard: A Brief Bio


Professional
Experience


2009


present, Adjunct Professor, MBB, SFU


2000
-
2011
, Research Scientist, Computational and Systems Biology,
Bioinformatics, International Rice Research Institute (IRRI; irri.org)



1999
-
2000
,
Postdoc,
Human Analysis Team, Sanger Centre, Cambridge, UK


Academic Background


1999, PhD
(Medical Genetics),
UBC


1992, B.Sc
. (Biochemistry, Molecular Biology&
Genetics),
UBC


1987, B.A
.
(Minor
Computing),
SFU


Personal


Originally from Edmonton; moved to GVRD in late
teens and resided here for
over 2 decades before travelling abroad to work


Wife is Filipina
-
Canadian (hence the job in the Philippines); 3 teenage kids
(son in
his late teens has just started
in the
SIAT
program at
SFU Surrey)


Returned last June to reside in Port Moody, at the foot of Burnaby Mountain



Participants “Around the table”


Your Name
, department,
lab, (PI)


(optional) Your “Port of Origin”


What is your research focus?


How can bioinformatics
(NGS)
support
that
research?


What NGS
data of your own
do you have to
analyse *now
*


E
xpectations
for the
workshop…

10 minute break…

Today’s Agenda


Part 2


What is Bioinformatics and why is it needed?


What is “Next Generation Sequencing”


Coping with the NGS bioinformatics challenge


The Workshop Road Map


Looking ahead…

WHAT IS BIOINFORMATICS?

NGS Bioinformatics Workshop

1.1 Workshop Overview and
Practical
Informatics Considerations

Bioinformatics is…


The development of computational methods
for studying the structure, function, and
evolution of genes, proteins, and whole
genomes;


The development of methods for the
management and analysis of biological
information arising from genomics and high
-
throughput biological experiments.

14

Why is there Bioinformatics?



Lots of new sequences being added

-

Automated sequencers

-
Genome Projects

-
Metagenomics

-

RNA sequencing, microarray studies, proteomics,…



Patterns in datasets that can be analyzed
using computers

Huge datasets


15


Gramicidine S (Consden et al., 1947), partial insulin sequence
(Sanger and Tuppy, 1951)


1961: tRNA fragments


Francis Crick, Sydney Brenner, and colleagues propose the
existence of transfer RNA that uses a three base code and
mediates in the synthesis of proteins (Crick et al., 1961)
General nature of genetic code for proteins. Nature 192: 1227
-
1232. In Microbiology: A Centenary Perspective, edited by
Wolfgang K. Joklik, ASM Press. 1999, p.384


First codon assignment UUU/phe (Nirenberg and Matthaei,
1961)



Need for informatics in biology: origins

16



The key to the whole field of nucleic acid
-
based identification
of microorganisms…


…the introduction molecular systematics using proteins and
nucleic acids by the American Nobel laureate Linus Pauling.


Zuckerkandl, E., and L. Pauling. "Molecules as Documents of
Evolutionary History." 1965. Journal of Theoretical Biology
8:357
-
366



Another landmark: Nucleic acid sequencing (Sanger and
Coulson, 1975)


Need for informatics in biology: origins

17

Need for informatics in biology: origins


First genomes sequenced:


3.5 kb RNA bacteriophage MS2 (Fiers et al.,
1976)


5.4 kb bacteriophage

報X㐠††††††††††††
卡湧敲ee琠慬⸬.
1977)


1.83 Mb First complete genome sequence of a free
-
living
organism:
Haemophilus influenzae

KW20 (Fleischmann
et al., 1995)


First multicellular organism to be sequenced:

C. elegans (C. elegans sequencing consortium, 1998)


Early databases: Dayhoff, 1972; Erdmann, 1978


Early programs: restriction enzyme sites, promoters, etc…
circa 1978.


1978


1993: Nucleic Acids Research published supplemental
information

18

(from the National Centre for Biotechnology Information)

Genbank and associated
resources doubles faster
than Moore’s Law!

(< every 18 months)


http://en.wikipedia.org/wiki/Moore’s_law


19

Today: So many genomes…

As of mid
-
August 2010, according to the GOLD
GenomesOnline

database….



Eukaryotic genome projects are in progress?

(Genome and ESTs)

1548

(517
-

5 years ago)



Prokaryote genome projects are in progress?


5006

(740
-

5 years ago
)



Metagenome

projects are in progress?


133

(Zero
-

5 years ago
)



TOTAL 6687 projects (
As of Sept 2011: >10,000
)





25

The Human Genome

The genome sequence is complete
-

almost!


approximately 3.5 billion base pairs.

26


Work ongoing to locate all genes and
regulatory regions and describe their
functions… …bioinformatics plays a critical
role

27


Identifying single nucleotide polymorphisms
(SNPs) and other changes between individuals



28

Bioinformatics helps with…….


Sequence Similarity Searching/Comparison


What is similar to my sequence?



Searching gets harder as the databases
get bigger
-

and quality changes



Tools: BLAST and FASTA = early time
saving heuristics (approximate methods)


Need better methods for SNP analysis!



Statistics + informed judgment of the
biologist

29

Bioinformatics helps with…….

Structure
-
Function

Relationships


Can we predict the function of protein
molecules from their sequence?


sequence > structure > function




Prediction of some simple 3
-
D structures
possible (
a
-
helix,
b
-
sheet, membrane
spanning, etc.)


30




Can we define evolutionary
relationships between organisms
by comparing DNA sequences?

-
Lots of methods and software,
what is the best analysis approach?

Bioinformatics helps with…….

Phylogenetics


WHAT IS NEXT GENERATION
SEQUENCING (
NGS
)?

NGS Bioinformatics Workshop

1.1 Workshop Overview and
Practical
Informatics Considerations

Sanger (“
dideoxy

sequencing or chain
termination”) Sequencing


Single stranded DNA
from sample* extended
by polymerase from
primer then randomly
terminated by
dideoxy

nucleotide (
ddNTP
)


Variable length DNA
fragments radiolabelled
or fluorescently
detected
ddNTP

*sample
derived
from amplified
cDNA
, genomic clones or whole genome shotgun

Sanger Pro’s & Con’s


Advantages


Relatively accurate


Relatively long (500


1500)
bp

reads


Disadvantage


Relatively costly in terms
of reagents and
relatively low
throughput

Next Generation Sequencing (NGS)

Sequence
Assembly

o
n HPC

Roche 454

Life Tech. Ion Torrent

Illumina

HiSeq

Life Tech
SOLiD

Oxford
Nanopore


GridION


Polonator

HeliScope

Pacific
Biosciences
SMRT Cell

(General) NGS Pro’s & Con’s


Advantages


Very

high throughput


Very

cheap data
production


Disadvantages


Relatively short reads


Relatively higher error
rates


Bioinformatics of
assembly is
much more
challenging

General NGS Workflow

1.
Template preparation

2.
Sequencing & imaging

3.
Genome alignment/assembly

COPING WITH THE NGS
BIOINFORMATICS CHALLENGE

NGS Bioinformatics Workshop

1.1 Workshop Overview and
Practical
Informatics Considerations

Challenge


Assembling “next generation sequence” (NGS) data
requires a great deal of computing power and
gigabytes

memory


Software often can execute in parallel on all available
computer processing unit (CPU) cores.


Many functional annotation processes (e.g. database
searching, gene expression statistical analyses) also
demand a lot of computing power

“High Performance
Computing” and “Cloud
Computing”

Computer
Nodes

Network
Storage

Your local
workstation/
laptop

What is Cloud Computing?


Pooled resources:
shared with many
users (remotely accessed)


Virtualization:
high utilization of
hardware resources (no idling)


Elasticity:
dynamic scaling without
capital expenditure and time delay


Automation:
build, deploy, configure,
provision, and move without manual
intervention


Metered billing:
“pay
-
as
-
you
-
go, only
for what you use

Cloud Computing

Cloud Bioinformatics Module

Raw Data/

Results/

Snapshots

Task
-

Specialized

Server

Input Job

Message
Queue

Output Job

Message
Queue

Job Status

Notification

Customized

Machine

Image

Start
-
up

(w/parameters)

A More Complete Picture…

Raw
Data + Results

Web

Portal

Project

Relational

Database

Database

Loader

Case Study in Bioinformatics on the Cloud


Used Amazon Web Services

http://aws.amazon.com


Assembled ~99 raw NGS
transcriptome

sequence
datasets from 83 species, on 16 Amazon EC2 instances
with 8 CPU cores, 68 GB of RAM, ~200 hours of
computer time, total run in less than one working day.


Each
single

machine of the required size would likely
have cost at least ~$10,000 (and time) to purchase,
and incur significant operating costs overhead
(machine room space, power supplies, networking, air
conditioning, staff salaries, etc.)


The above run could be started up in a few minutes
and cost ~ $500 to complete. Once done, no machines
left idling and unused…


Software for (NGS) Bioinformatics


Bundled with sequencing machines:


e.g.
Newbler

assembler with Roche 454


3
rd

party commercial:


DNA Star (www.dnastar.com)


Geneious

(http://www.geneious.com
/)


GeneWiz

(http://
www.genewiz.com)


And others…


Open Source:


Lots (selected examples to be covered in this
workshop)


What
do I need to run bioinformatics software locally?


Some common bioinformatics software is
platform independent, hence will run equally
under Windows and UNIX (Linux, OSX)


Most other software targets Unix systems. If
you are running Microsoft Windows and want
to run such software locally, the easiest way to
do this(?) is to install some version of Linux
(suggest “Ubuntu”) as a dual boot or (less
intrusively) as a guest operating system in a
virtual machine, e.g.

http
://www.vmware.com/products/player/

But, what are *we* going to use here?

WestGrid

@ SFU / IRMACS


WestGrid

is a consortium member of
“Computer Canada”

https
://computecanada.org/


“bugaboo” cluster:
4328
cores total: 1280
cores, 8 cores/node, 16 GB/node, x86_64, IB.
Plus 3048 cores, 12 cores/node, 24GB/node,
x86_64, IB. capability
cluster,
40 Core
Years


Access to other
Westgrid

resources through
LAN and WAN


More details from Brian Corrie tomorrow…

Galaxy Genomics Workbench

http://galaxy.psu.edu
/

(also http
://main.g2.bx.psu.edu
/)

THE WORKSHOP ROADMAP

NGS Bioinformatics Workshop

1.1 Workshop Overview and
Practical
Informatics Considerations

What is Bioinformatics?

Road Map

Annotation

Sequences

(Formats)

Visualization of
Sequence &
Annotation

Search &

Alignments

NGS

Sequence

Databases

Sequence
Assembly

Specific Applications


Sequence
Assembly of
Transcriptomes



Sequence
Assembly of Whole Genomes


Annotation
of
de novo
Assembled Sequences


Identification
and Analysis of Sequence
Variation


Comparative
Genomic Analysis and
Visualization


Meta
-
Analysis
of
Annotated
Sequence Data

Survey: Workshop Expectations I


How to find significance in the huge amount of
data that Next Gen sequencing, but also
microarrays etc. generate.


A
basic understanding of how to analyse next
generation sequencing data.


Learn
some hands
-
on computer experience


learning to use software for analysing sequence
data; what can be done and how to do it.


genome assembly + meta
-
analysis


Survey: Workshop Expectations II


The basics of alignment and SNP calling with next
-
gen sequencing, and what kind of programs are
out there to do these tasks and then analyze the
large datasets (I've been trying to figure this out
on my own through reading the literature and it's
quite time consuming so any info provided
through the workshop would be very helpful
-

thanks)


The main workflow for processing sequence data
from the beginning to the more specific paths of
analyses. Also the concepts, significance of the
adjustable parameters behind the various
algorithms used in the workflow.


Survey: Workshop Expectations III


I
expect to learn the basic bioinformatics tools.


Learn
different sequence alignment
software/technologies (i.e. BWA, Abyss, etc
.).
Learn
more about the complexities of NGS sequencing


Next
generation sequencing, data analysis etc.


Parameters
regulating assembly of
contigs
.
How to
take raw data to an assembly, control the main
parameters for assembly, mass analyze data for
annotation and SNPs


How
to compare expression profiles using RNA
transcriptomes
.


Want to learn new
things


Survey: Operating System Being Used


Microsoft Windows on Intel/AMD


14 (86.7%)



Most running Windows 7 (some XP & Vista)



One uses
Linux through
Westgrid

and the IRMACS
cluster


Some of you
also

thinking of running Linux


Apple
OS
X


2 (13.3%)



Snow Leopard Release



Apple Lion, running Windows 7 using Parallels


Linux
on Intel

-

2 (13.3%)

Looking Ahead…


What will you
need for this workshop?


Mainly, just a laptop running a web browser


(Optional) access to Linux/Unix locally (VM Player)


Reading list:


Will give review citations for future lectures


For next week, suggest that
you
surf to
http
://www.ncbi.nlm.nih.gov
/