Workshop hands on session: Basic RNA-Seq analysis with Tuxedo suite and ChIP-

bewgrosseteteSoftware and s/w Development

Dec 13, 2013 (3 years and 6 months ago)

97 views

Wo
rkshop hands on session:
Basic RNA
-
S
eq analysis with Tuxedo suite

and ChIP
-
S
eq analysis

Note: This is intended as
a
step by step guide of doing basic RNA
-
seq analysis with Illumina
r
eads using To
p
hat/Cufflinks
/cummeRbund
. The data

set is a subset of real data due to

the time
and computation resource constraint
s

at the workshop
. Therefore the results here are only for
demonstration of workflow purpose.

We are only using the default
parameter setting.

Pleases
note that COPY &

PASTE the command
s

might not work depending what computer you are
using

(font change

in

linux). It is best to type the command
s

yourself.

Please do it after you are
done with the Linux tutorial and
/or

are
already
familiar with the basic Linux commands.

F
or details of

parameter setting with Tophat/C
ufflinks
/cummeRbund
, see:

Tophat manual:

http://tophat.cbcb.umd.edu/manual.html

Cufflinks manual:

http://cufflinks.cbcb.umd.edu/manual.html

Bowtie manual:

http://bowtie
-
bio.sourceforge.net/manual.shtml

Samtools manual:

http://sa
mtools.sourceforge.net/samtools.shtml

CummeRbund

manual:

http://compbio.mit.edu/cummeRbund/manual_2_0.html

1.

Download Yeast reference genome from UCSC genome browser download site

(see class
slides)
.

Use
wget

command on HPC:

$
wget
http://hgdownload.soe.ucsc.edu/goldenPath/sacCer3/bigZips/chromFa.tar.gz


This should

download

a zipped version of yeast

g
enome sequences by chromosome in

your current directory.

You can use
ls

to see it.

$

gunzip chromFa.tar.gz

$

tar
-
xvf chromFa.tar

Gunzip and tar

should uncompress

your downl
oaded file into 17 FASTA files for 17
chromosomes. Concatenate the
17 FASTA files into on
e file as your reference ge
nome
using
cat

command.

This way you do not have to index your 17 files individually.
*.fa
lists all files that end

with .fa and
> directs

the output to a file called ref
.fa.

You can also
use your own naming system. Just keep consistent with the names.

$

cat *.fa > ref
.fa

2.

Use bowtie
2

to i
ndex your reference genome ref
.fa.


f indicates that your file is in
FASTA format.

r
ef is the prefix of the
6
output index files.

$
module load bowtie
2

$
bowtie
2
-
build
-
f ref
.fa ref

You

should have six f
iles of indexes
now: ref.1
-
4.b
t
2

and r
ef.rev.1/2.b
t
2
.
Bowtie
2

indexes the genome with a Burrows
-
Wheeler index to keep its memory footprint small
.

If
you are interested in why you need to index y
our reference and what the .bt2

files mean,
refer

to

http://bowtie
-
bio.sourceforge.net/index.shtml

3.

Download
a sample

annotation file

in GTF format

I
prepared for the tutorial
. This is a
small
set of annotation so your Tophat run will not take too long.

$
wget http://h
pc.oit.uci.edu/biolinux/ref/ref
.gtf


4.

Download the

sample read files for condition 1 and 2
.
It will take a few minutes
depending on your internet speed.
You should be able to use
ls

and
less

to see them in
your current directory.

$
wget
http://hpc.oit.uci.edu/biolinux/reads/s1.fastq

$

wget
http://hpc.oit.uci.edu/biolinux/reads/s2.fastq

5.

Run Top
hat with the download re
ads: s1.fastq and s2.fastq respectively.
This should run
much faster than a normal Tophat run
, which c
ould take hours

to days
.

T forces
Tophat
to
only align the reads to the
annotated
transcriptome to save run
ning

time.

o specifies

the output folder
, c1
for condition1 and c2 for condition 2
.

G specifies

your annotation
file.

ref indicates the prefix of your 6 bowtie index files. You also need to load samtools
in order to run tophat.

$
module load samtools

$
module load tophat/2.0.6

$
tophat2
-
T
-
G
ref.gtf
-
o c1 ref s1
.fastq &

$

tophat2
-
T
-
G ref.gtf
-
o c2 ref s2.fastq &

You do not have to wait for
the two T
ophat runs to finish
in order
to move
on
to the next
step!

6.

Index and sort the
2
T
ophat mapping
s

using
samtools
command so the mapping can be
imported to I
ntegrative
G
enomics
V
iewer

(IGV)

for visualization
.

F
irst get the prepared
BAM

fil
es instead of waiting for your T
ophat run
s

to finish

using
wget
.
You also need to
sort and index your BAM files so IGV can quickly find the desired genomics loca
tion.

$
wget
http://hpc.oit.uci.edu/biolinux/tophat/s1.bam

$
wget
http://hpc.oit.uci.edu/biolinux/tophat/s2.bam

$
samtools sort
s1.bam s1.sorted

$
samtools sort s2.bam s2.sorted

$
samtools index s1.sorted.bam

$
samtools index s2.sorted.bam

7.

Visualize your indexed and sorted mapping in IGV.

$
module load igv

$

igv &

Click
File
-
>
load from file and choose s1.sorted.bam. Use
the
yeast genome and go to e.g.
chrIII:167,540
-
167,956

in the search box
and you should see your
mapped
reads! Load
s2.sorted.bam in the same way.

And both s1 and s2

have one data track in your window.
Change your coordinates to wherever you like.

8.

Run
cufflink
s

on the T
ophat mapping bam files

to assemble tra
nscripts and estimate
abundance for each sample.


o specifies cufflinks output folder as cf1 for condition1 and
cf2 for condition 2:

$
module load cufflinks


$
cufflinks
-
o cf1 s1.sorted.bam &


$
cufflinks
-
o cf2 s2.sorted.bam &

You should have two folders cf1 and cf2 for cufflinks output

when you are done
.

Use

$
ls
-
ltr
cf1



to see what is inside.


The
file
transcripts.gtf is your assembled transcripts. Genes.fpkm_tracking and
isoforms.fpkm_tracking lists

the FPKM values with its corresponding transcripts
/genes

ID and genomic location.

9.

Use cuffmerge to combine your assembled transcripts from all samples. The manifest.txt
contains the path to the two assembled transcripts.gtf files.

Use
less

command to take

a
look.

$
wget
http://hpc.oit.uci.edu/biolinux/manifest.txt

$

less manifest.txt

$
cuffmerge
-
s ref.fa manifest.txt

When the run is finished, y
ou should have a folder generate
d by cuffmerge
called
merge_asm when you are done.

There is a merged.gtf file in this folder which is your
merged GTF format transcripts.

10.

Use cuffdiff
on the two T
ophat mapping files
to do differe
ntial expression
analysi
s
between condition 1 and
2.


o specifies the output directory of cuffdiff.

L gives labels to
the 2 conditions as c1 and c2.

-
FDR sets the false discovery rate threshold to 0.05.
merged_asm/merged.gtf is the merged transcriptome annotation you generated in step 9.

$
cuffdiff
-
o cd
_1_2
-
L c1,c2

--
FDR=0.05 merged_asm/merged.gtf s1.sorted.bam
s2.sorted.bam

When cuffdiff is done, y
ou should have a folder called cd_1_2 with your cuffdiff results.

$
ls
-
ltr cd_1_2

will show you the results list.
There are around 22 files in the folder an
d you should be
mostly interested in those .diff file
s

such as gene_exp.diff.
These files are t
ext files and
can be directly imported to

excel.

$
less gene_exp.diff

For details of cuffdiff output, see
http://cufflinks.cbcb.umd.edu/manual.html#cuffdiff_output

---------------------------------------------------------------------------------------------------------------------

For

those of you with R basic knowledge or want more challenges, use the output results from
cuffdiff you generate earlier (
in the
cd_1_2 folder) and follow the protocol here. A line
-
by
-
line
instruc
t
ion is included to show you how to generate graphs of the
two samples.

For details, s
ee

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

You need to load
R

and start rstudio for interactive use
:

$ module load R

$
rstudio &

>source('http://bioconductor.org/biocLite.R')

>biocLite('cummeRbund')

>library(cummeRbund)

>
cuff
_data <
-

readCufflinks('cd_1_2
')

>csDensity(genes(cuff_data))

>c
sScatter(genes(cuff_data), 'c1’, 'c2
')

>csVolcano(genes(cuff_dat
a), 'c1', 'c2
')



For
those of you who are interested in ChIP
-
Seq and have extra time at the tutorial session, here
is some sample data published by UCI researchers (SPEBP
-
1 ChIP
-
Seq Data)


https://cbcl.ics.uci.edu/doku.php
/data

Both the raw reads and processed data are accessible at this web. The paper that describes the
dataset is here if you are interested:

http://www.pnas.org/content/106/33/13765.long

Download

the raw data from the first link using
wget

on HPC.

$
wget
https://cbcl.ics.uci.edu//public_data/SREBP1/raw_data/IgG_lane8_eland_multi.txt

$
wget https://cbcl.
ics.uci.edu//public_data/SREBP1/raw_data/srebp1_lane7_eland_multi.txt

Run
MACS

on HPC now.


g mm indicates the genome size is mouse.

t indicates the treatment
file is srebp1_lane7_eland_multi.txt.

c gives the control file is IgG_lane8_eland_multi.txt.
Use
the
perl command line to

remove the “ref_” in front of
the chromosome
names so they are
compatible with IGV genome names
.

$
module load macs

$
macs14
-
t srebp1_lane7_eland_multi.txt
-
c IgG_lane8_eland_multi.txt
-
g mm

$

perl
-
anle 's/ref_//g; print $_'

NA_peaks.bed > NA_peaks1.bed

Your results are in the current folder. Use
ls

and
less

command t
o check them out! The BED files
NA_peaks1.bed
can
be directly imported to IGV for visualization.

Choose the

mouse genome
mm8 build.

Check it out yourself!