March 22 - Mouse Genome Informatics

tastelesscowcreekΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

124 εμφανίσεις

last time


p
bx1 assignment…..find location of the probes
in another one of the
probesets

for
zebrafish
.


Read
limma

documentation


Run
limma

on your data set


Be sure you have your Galaxy account set up

pbx
1

chr2:19,708,833
-
19,758,832

UCSC Genome Browser on
Zebrafish

Jul. 2010 (Zv9/danRer7) Assembly

limma


From gene list to
intepretation


l
imma

will generate a list of
probeset

ids for
differentially expressed genes


What next?


Convert the
probeset

ids to gene symbols


Look for enrichment of functional terms
associated with the genes in your list

http://david.abcc.ncifcrf.gov/

RNA
Seq


Use of next
-
generation sequencing technology (NGS)
to measure RNA levels


RNA
Seq

advantages:


Wider dynamic range compared to microarray technology


Not dependent on known genome annotations


Higher throughput compared to microarray technology


RNA
Seq

challenges:


Specificity versus completeness of alignments..especially
for short sequence reads


Manipulation and analysis of large files


Data storage costs



http://www.geospiza.com/finchtalk/uploaded_images/rna
-
seq
-
steps
-
786705.png

RNA
Seq

Library Prep


Sequencing Technologies

http://www.geospiza.com/finchtalk/uploaded_images/plates
-
and
-
slides
-
718301.png

Sequence “Space”


Roche 454


Flow space


Measure pyrophosphate released by a nucleotide when it is added to a growing
DNA chain


Flow space describes sequence in terms of these base incorporations


http://www.youtube.com/watch?v=bFNjxKHP8Jc


AB
SOLiD



Color space


Sequencing by DNA ligation via synthetic DNA molecules that contain two nested
known bases with a
flouorescent

dye


Each base sequenced twice


http://www.youtube.com/watch?v=nlvyF8bFDwM&feature=related


Illumina
/
Solexa



Base space


Single base
extentions

of fluorescent
-
labeled nucleotides with protected 3 ‘ OH
groups


Sequencing via cycles of base addition/detection followed
deprotection

of the 3’
OH


http://www.youtube.com/watch?v=77r5p8IBwJk&feature=related


GenomeTV



Next Generation Sequencing (lecture)


http://www.youtube.com/watch?v=g0vGrNjpyA8&feature=related




http://finchtalk.geospiza.com/
2008
/
03
/color
-
space
-
flow
-
space
-
sequence
-
space_
23
.html

Further Reading


Metzker
, ML. (2010) Sequencing technologies


the next generation. Nature Reviews
Genetics 11:31
-
36.

Short Read Archive

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?

Short Read Archive Handbook

http://www.ncbi.nlm.nih.gov/books/NBK
47528
/

http://www.asperasoft.com/en/products/client_software_2/aspera_connect_8

High performance file
transfer for getting data from
the Short Read Archive

Aspera

Connect

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

SRA Toolkit


RNA
Seq

Workflow


RNA
Seq


FASTQ file format


Alignment


SAM file format


Annotation


GTF, BED file format


Alignment Counts


RPKM


Statistical analysis

FASTQ: Data Format


FASTQ


Text based


Encodes sequence calls and quality scores with ASCII
characters


Stores minimal information about the sequence read


4 lines per sequence


Line 1: begins with @; followed by sequence identifier and optional
description


Line 2: the sequence


Line 3: begins with the “+” and is followed by sequence identifiers
and description (both are optional)


Line 4: encoding of quality scores for the sequence in line 2


References/Documentation


http://maq.sourceforge.net/fastq.shtml


Cock et al. (2009).
Nuc

Acids Res 38:1767
-
1771.


FASTQ Example

FASTQ example from: Cock et al. (2009).
Nuc

Acids Res 38:1767
-
1771.

For analysis, it may be
necessary to convert to
the Sanger form of
FASTQ…For example,


Illumina

stores quality
scores ranging from
0
-
62
;

Sanger quality scores
range from
0
-
93
.


Solexa

quality scores
have to be converted to
PHRED quality scores.

Data deposited in GEO with accession id GSE20846

Example Data

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20846

http://www.ncbi.nlm.nih.gov/sra?term=SRP002119

SRP002119 (study/project)


SRX017794 (experiment)


SRS025246 (source)



SRR037945 (run)



SRR037946 (run)

SRA to FASTQ


NCBI’s SRA Tools contains utilities to convert
SRA format to FASTQ


fastq
-
dump


If utilities and
sra

formatted file are in the
same directory, command line is…

f
astq
-
dump <name of
sra

formatted file>


NOTE: Downloading and working with next generation sequence data will
very quickly exceed the capacity of a typical desktop or laptop computer. You
will need appropriate infrastructure in place to work with these files…or
consider scalable Cloud storage and compute services!

TopHat

Trapnell

et al. (
2009
). Bioinformatics
25
:
1105
-
1111
.

http://tophat.cbcb.umd.edu/

Figure from:
Trapnell

et al. (2010). Nature Biotechnology 28:511
-
515.

TopHat

is a good tool for
aligning RNA
Seq

data
compared to other aligners
(
Maq
, BWA) because it takes
splicing into account during
the alignment process.

Trapnell

C et al. Bioinformatics
2009
;
25
:
1105
-
1111

TopHat

is built
on the Bowtie
alignment
algorithm.

SAM (Sequence Alignment/Map)


It may not be necessary to align reads from
scratch…you can instead use existing
alignments in SAM format


SAM is the output of aligners that map reads to a
reference genome


Tab delimited w/ header section and alignment
section


Header sections begin with @ (are optional)


Alignment section has 11 mandatory fields


BAM is the binary format of SAM

http://samtools.sourceforge.net/

http://samtools.sourceforge.net/SAM1.pdf

Mandatory Alignment Fields

http://samtools.sourceforge.net/SAM1.pdf

Alignment Examples

Alignments in SAM format

Cufflinks

Trapnell

et al. (2010). Nature Biotechnology 28:511
-
515.

http://cufflinks.cbcb.umd.edu/



Assembles transcripts,



Estimates their
abundances, and


T
ests for differential
expression and regulation
in RNA
-
Seq

samples

Cufflinks Output


Gene expression


Transcript expression


Assembled transcripts

Annotations


Mapping reads to specific transcripts/genes

Data Visualization


UCSC Browser (accessible from Galaxy)


Trackster

(native to Galaxy)


External visualization tools:


Genome Workbench


http://www.ncbi.nlm.nih.gov/projects/gbench/


Integrative Genomics Viewer (IGV)


http://www.broadinstitute.org/igv/

Statistical Analysis


Once the mapping and genome summarization are
done, the data can be analyzed just like any other
count data



Bullard, et al. (2010). Evaluation of statistical
methods for normalization and differential
expression in mRNA
-
Seq

experiments. BMC
Bioinformatics 11:94.


Typical
RNA_Seq

Project Work Flow


Sequencing


Tissue Sample


Cufflinks


TopHat


FASTQ file


QC


Gene/Transcript/
Exon

Expression

Visualization


Total RNA


mRNA


cDNA


Statistical
Analysis

JAX Computational Sciences Service

Galaxy

http://main.g2.bx.psu.edu/

See Tutorial 1

Build and share data and analysis workflows

No programming experience required

Strong and growing development and user community

RNA
Seq

Workflow


Convert data to FASTQ


Upload files to Galaxy


Quality Control


Throw out low quality sequence reads, etc.


Map reads to a reference genome


Many algorithms available


Trade off between speed and sensitivity


Data summarization


Associating alignments with genome annotations


Counts


Data Visualization


Statistical Analysis

Tools

History

Dialog/Parameter Selection

Uploadin
g Data to Galaxy

Because of the size of
most sequence files it
is necessary to use ftp
to get files to Galaxy.

Select appropriate
reference genome at
time of data upload.

You can upload compressed files
and they will be uncompressed
upon loading into Galaxy.

Tutorial Web Site

http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml

This site will be accessible
after the meeting. Check
back for updates and new
tutorials.

n
ext time


Analyze project data with DAVID


Convert
probeset

ids to genes


Look for enrichment of functional terms


Try the first part of Tutorial
5
in Galaxy