3.2 Running MeRIP-PF

helmetpastoralSoftware and s/w Development

Dec 13, 2013 (3 years and 8 months ago)

161 views

MeRIP
-
PF Manual
















Released
2013
-
01
-
24

Yuli Li
, Shuhui Song

et al.

Beijing Institute of Genomics,

Chinese Academy of Sciences
, Beijing

Contact
:

li
yl@big.ac.cn
;

songs
hh@big.ac.cn


1

OVERVIEW

................................
................................
................................
................................
.

3

1.1

Background

................................
................................
................................
...................

3

1.2

Summary of MeRIP
-
PF

................................
................................
................................
..

3

1.3

Implementation

................................
................................
................................
.............

3

1.4

Availability

................................
................................
................................
.....................

3

2

INSTALLATION

................................
................................
................................
............................

4

2.1

Perl and R

................................
................................
................................
......................

4

2.2

BWA

................................
................................
................................
...............................

4

2.3

SAMtools

................................
................................
................................
.......................

4

2.4

BEDTools

................................
................................
................................
........................

4

3

USAGE

................................
................................
................................
................................
.......

5

3.1

Preparing reference files

................................
................................
...............................

5

3.2

Running MeRIP
-
PF

................................
................................
................................
.........

6

3.2.1

Setting up the Config file

................................
................................
...................

7

3.3

Output files format

................................
................................
................................
........

8

3.4

Produce wig plots

................................
................................
................................
........

10

3.4.1

Usage

................................
................................
................................
...............

11


1

OVERVIEW

1.1

Background

Next generation parallel sequencing technologies
m
ake

m
6
A
-
specific methylated RNA
immunoprecipitation followed by sequencing a popular strategy to study
transcriptome
-
wide
RNA modification
s
,
while creating challenges for analysis

, especially in peak
-
finding.

However,
t
here

have been

no any available too
l
s or
software
s

for MeRIP
-
Seq data analysis yet.

Here,
we present
a high
-
efficiency and easily
-
used a
nalysis

pipeline

called MeRIP
-
PF, which

is
a
publicly available open source and
specially
developed

for
MeRIP
-
Seq

peak
-
calling
with control
samples
. MeRIP
-
PF achieves m
6
A regions detection and annotation, and powerful graphical
display which are useful for further study.

1.2

Summary

of MeRIP
-
PF

We integrate four modules, including mapping, testing, annotating

and plotting into one program

to complete the whole a
nalysis.

The pipeline requires two Fastq
-
formatted data
,

genome
reference sequences of the corresponding species

and

several
annotated
BED

files with the gene
structure

information. And it will output 4 results files
, giving

the complete information of the

modification profile.


1.3

Implementation

The pipeline program was written in Perl, and run in a Linux machine cluster;

each
node has 8
cores with 2.00G Hz processor and 16G RAM.
MeRIP
-
PF

requires the installation of
Perl and
R
language program,

and
BWA,
SAMt
ools and
BEDT
ools.

1.4

Availability

T
he MeRIP
-
PF package including an example dataset is available at

http://software.big.ac.cn/MeRIP
-
PF.html


2

INSTALLATION

2.1

Perl and R

Ensure that Perl is installed
.

Instal
l R from http://www.r
-
project.org/

!!!Note
:

Perl and R need be installed at the root pathway.

2.2

BWA

D
ownload any version
of BWA

(e.g. bwa
-
0.6.2.tar.bz2)

from

http://sourceforge.net/projects/bio
-
bw
a/files
/

>
bzip2
-
d bwa
-
0.6.2.tar.bz2

>
tar
-
xf bwa
-
0.6.2.tar

>
cd bwa
-
0.6.2/

>
make

2.3

SAM
tools

D
ownload
installation

file from
http://sourceforge.net/projects/samtools/files/s
amtools/0.1.1
8/

>
bzip2
-
d samtools
-
0.1.18.tar.bz2

>
tar
-
xf samtools
-
0.1.18.tar

>
cd samtools
-
0.1.18/

>
make

2.4

BEDTools

D
ownload
installation

file from
http://code.google.com/p/bedtools/downlo
ads/l
ist

>
gunzip BEDTools.v2.16.2.tar.gz

>
tar
-
xf BEDTools.v2.16.2.tar

>
cd BEDTools
-
Version
-
2.16.2/

>
make clean

>
make all



3

USAGE

3.1

Preparing reference files

You need prepare some
BED files
. D
ownload from
http://genome.ucsc.edu/cgi
-
bin/hgTable
s
,

tak
ing

the species of human for example.



Downloadin
g

setting up:



*clade:

Mammal


*genome:

Human


*assembly:

Feb 2009(GRCh37/hg19)


*group:

Genes and Gene Prediction Tracks


*track:

RefSeq Gene


*table:

refGene


*region:

genome


*output format:

BED
-
browser extensible data


*output file:

hg19


*file type returned:

plain text



Click "get output";




Then, c
hoose "Whole Gene",

"Exons",

"Introns",

"5' UTR Exons",

"Coding Exons"

and "3' UTR
Exons" in turn, then
click "get BED",

download;



Name

them

"hg19_gene.bed",

"hg19_exon.bed",

"hg19_intron.bed",

"hg19_cds.bed",

"hg19_utr
-
5.bed",

"hg19_utr
-
3.bed",

respectively;



Put files of "hg19_cds.bed,

hg19_intron.bed,

hg19_utr
-
5.bed,

hg19_utr
-
3.bed" into one
directory,
and

name it Prot
ein
_
Coding/;



Put file of "hg19_exon.bed,

hg19_intron.bed" into another directory,
and
name it
Non
Protein
_
Coding/
;



Finally, c
hange output format from "BED
-
browser extensible data" into "all fields from
selected table",

output file named "hg19_a
ll_field"

and download it.

3.2

Running MeRIP
-
PF

perl
MeRIP
-
PF
.pl Sample.config

The
format of

Sample.config

file
:

**********INPUT FILES**********

Genome Sequence in Fasta:

XXX/mm9/bwa_index/chr19.fa

Fastq File of Sample Control:

XXX/DemoCell/demo
-
ctrl.fq

Fastq
File of Sample IP

(m
6
A):

XXX/DemoCell/demo
-
m
6
a.fq


**********UCSC REFERENCE FILES********

File of Whole Transcripts:

XXX/mm9/mm9_gene.bed

Directory of Protein
-
Coding
-
Genes Reference:

XXX/mm9/

Prot
ein
_
Coding

Directory of NonProtein
-
Coding
-
Genes Reference:

X
XX/mm9/

Non
Protein
_
Coding

File of Gene Function:

XXX/mm9/mm9_all_field


**********OUTPUT*********

Output Directory:

XXX/DemoCell/out


***********OPTIONS**********

Peak Size:

200

Length of Fastq Sequence:

36

Reads

Length

after

Clipping
: 36

Fisher's Exact Te
st Cutoff:

0.05

FDR Cutoff:

0.05

PBS Jobs Running Queue Name:

bioque

Tracking Queue Name:

bioque


**********ADDITIONAL OPTIONS**********

Directory of BWA Installation:

XXX/bwa
-
0.6.2

Directory of Samtools Installation:

XXX/samtools

Directory of BEDTools Ins
tallation:

XXX/BEDTools
-
Version
-
2.16.2/bin

File of 'submit_scripts_to_PBS.pl':

XXX/submit_scripts_to_PBS.pl

File of 'Fisher_Test':

XXX/Fisher_Test_Genome_Left.pl


!!!NOTE
:




We here just ana
lyze single
-
end sequencing data
. I
f you have paired
-
end data, you
may just
use either end or combine them into one file as single
-
end.



Users should

prepare

the config file at
first;

the way of setting up is showed below.


3.2.1

Setting
up

the Config file

!!!NOTE
:
Directories in ADDITIONAL OPTIONS and OUTPUT should be absolute
paths.

**********INPUT FILES**********

Genome Sequence
i
n Fasta
:
filepath/filename


#filepath/filename
:
the genome sequence of the species in the format of fasta

Fastq File
o
f Sample Control
:
filepath/filename

#filepath/filename
:
the fastq file of your con
trol sample

Fastq File
o
f Sample IP

(m
6
A)
:
filepath/filename

#filepath/filename#
:
the fastq file of your m
6
A sample


**********UCSC REFERENCE FILES********

File of Whole Transcripts
:
filepath/filename

#filepath/filena
me
:

filepath
/
hg19_gene.bed

Directory of

Protein
-
Coding
-
Genes Reference
:
filepath

#filepath
:
filepath

/
protein
/

Directory of NonProtein
-
Coding
-
Genes Reference
:
filepath

#filepath
:

filepath
/
non
-
protein
/

File of Gene Function
:
f
ilepath/filename

#filepath/filename
:
filepath

/
hg19_all_field


*******
****OPTIONS**********

Peak Size
:
integer

#integer
:
If the length of your library
fragment

is ~100bp, you can set this option 200.

Length
o
f Fastq Sequence
:
integer

#integer
:
the length of your
sequencing reads

Fisher's Exact Test Cutoff
:
float

#float
:
the
cutoff of p
-
value by fisher's exact test

FDR Cutoff
:
float

#float
:
the cutoff of q
-
value by Benjamini

Hochberg
method

PBS Jobs Running Queue Name
:
QueueName1

#QueueName
:
the queue name of PBS jobs running

Tracking Que
u
e Name
:
QueueName2

#QueueName
:
the que
ue name of t
racking
, which
could be same as QueueName1


**********ADDITIONAL OPTIONS**********

Directory of BWA Installation
:
filepath

#filepath
:
Path
/bwa
-
0.6.2

Directory of Samtools Installation
:

#filepath
:
Path
/samtools
-
0.1.18

Directory of BEDTools Inst
allation
:
filepath

#filepath
:
Path
/BEDTools
-
Version
-
2.16.2

File of 'submit_scripts_to_PBS.pl'
:
filepath/
submit_scripts_to_PBS.pl

#filepath/filename
:
Path
/submit_scripts_to_PBS.pl

**********OUTPUT*********

Output Directory
:
filepath

#filepath
:
make a new di
rectory for output files and temp files

3.3

O
utput files format

FILE
1
: Reads_Overview.txt

This file
supplies the basic status of the two sequencing

data
, including reads mapping status,
t
ranscriptome
-
wide distribution of m
6
A peaks
, r
eads
d
istribution among
d
if
ferent regions of
transcripts in Control Sample and
MeR
IP Sample
r
espectively
,
and
Control Sample
g
ene
e
xpression
r
egardless of
r
eads
m
apped to
j
unctions
.


FILE
2
:

Peak_All.xls

This will generate a tab
-
key
-
separated file containing the information of peak l
ocation

Column1=Chromosome:

chromosome on which the peak resides

Column2=PeakStart:

position from which the peak starts

Column3=PeakEnd:

position with which the peak ends

Column4=PeakSize:

range the peak spans

Column5
-
Column6
-
Column7:

genomic region with w
hich the peak overlaps

Column8:

transcript in which the peak locate


FILE3
:
Gene_List.xls

This file offers

the peak annotation information in term of genes

(transcripts)
.

Column1=

GeneID

Column2=

Peak_Cnt:

count of peaks located in this gene

Column3=

Peak_
Start:

positions with which every peak starts

Column4=

Peak_Size:

range every peak spans

Column5=

Peak_Region:

genomic region

(cds,

intergenic,

intron,

utr3,

utr5) with which every
peak resides

Column6=

Fraction:

the fraction that the peaks overlap with co
rresponding genomic regions

Column7=

CL_Rds_Cnt
:

reads count of
every

peak

located in this gene in Sample Control

Column8=

CL_
R
PM
:

R
PM

(reads per million)

of
every

peak

located in this gene i
n

Sample Control

Column9=

IP_Rds_Cnt
:

reads count of
every
peak

l
ocated in this gene
of

Sample m
6
A

Column10=

IP_
R
PM
:

R
PM
(reads per million)
of
every
peak located in this gene of

Sample m
6
A

Column11=

Enrichment
: the enrichment score of every peak
(the ratio of MeRIP sample reads to
non
-
IP sample reads within the area of
a peak, each normalized to the number of
uniquely mapped reads within the sample)


Column12=

S
t
rand
: the strand in which the gene locate
s

Column11=

Chr:

chromosome with which this gene resides

Column12=

GeneName

!!!NOTE
:

Column
s
3
-
11

are all semicolon
-
sepa
rated when Column2>1
.

FILE4
:
Plot_Fig
.pdf

This file
shows



Transcriptome
-
wide distribution of m
6
A peaks

(Figure 1A)
.

Pie chart
s

show the percentage of
non
-
IP reads (top) and
m
6
A peaks

(bottom)

within distinct regions of RNA; NP stands for
non
-
protein codin
g genes, while PR stands for protein coding genes.



D
istribution of m
6
A peaks along mRNA

(Figure 1B)
.

5’UTRs, CDSs and 3’UTRs of every
transcript are
separately

binned into regions spanning 1% of their total lengths;
Y
-
coordinates represent percentage of m
6
A peaks located in every bin.



C
orrelation
between

gene
expression level

and
m
6
A
peak enrichment

(Figure 1C)
.

Plotted is
the peak enrichment value relative to the abundance of the transcript within the input RNA.

NP_Exonic
CDS
NP_Intronic
3'UTR
5'UTR
PR_Intronic
Intergenic
A
Reads in Control Sample
NP_Exonic
CDS
NP_Intronic
3'UTR
5'UTR
PR_Intronic
Intergenic
m
6
A

P
e
a
k
s
5'UTR
CDS
3'UTR
P
e
r
c
e
n
t
a
g
e

o
f

m
6
A

P
e
a
k
s

(%)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
B
0.01
0.1
1
10
100
1000
1
2
4
8
16
32
64
128
256
C
RPKM
Peak Enrichment

Figure 1 The output file of

Plot_Fig
.pdf”
.

3.4

Lite version of MeRIP
-
PF:

MeRIP
-
PF_Lite

I
n this
version, you can take BED files
as input, which means that you can choose alternative
mapping softwares besides B
WA, such as Bowtie 2,
TopH
at 2
, GSNAP and so on.

Usage:

P
erl MeRIP
-
PF_Lite.pl Config_Lite N1
N2

N1: the number of total raw reads in control sample.

N2: the number of total raw reads in MeRIP sample.

T
he format of Config_Lite is just as below:

**********INPUT FILES**********

Genome Sequence in Fasta:

XXX/mm9/bwa_index/chr19.fa

Bed

File of Sample C
ontrol:

XXX/DemoCell/demo
-
ctrl.
bed

Bed

File of Sample IP

(m
6
A):

XXX/DemoCell/demo
-
m
6
a.
bed


**********UCSC REFERENCE FILES********

File of Whole Transcripts:

XXX/mm9/mm9_gene.bed

Directory of Protein
-
Coding
-
Genes Reference:

XXX/mm9/

Prot
ein
_
Coding

Directory

of NonProtein
-
Coding
-
Genes Reference:

XXX/mm9/

Non
Protein
_
Coding

File of Gene Function:

XXX/mm9/mm9_all_field


**********OUTPUT*********

Output Directory:

XXX/DemoCell/out


***********OPTIONS**********

Peak Size:

200

Length of Fastq Sequence:

36

Fisher's
Exact Test Cutoff:

0.05

FDR Cutoff:

0.05

PBS Jobs Running Queue Name:

bioque

Tracking Queue Name:

bioque


**********ADDITIONAL OPTIONS**********

Directory of BWA Installation:

XXX/bwa
-
0.6.2

Directory of Samtools Installation:

XXX/samtools

Directory of BEDT
ools Installation:

XXX/BEDTools
-
Version
-
2.16.2/bin

File of 'submit_scripts_to_PBS.pl':

XXX/submit_scripts_to_PBS.pl

File of 'Fisher_Test':

XXX/Fisher_Test_Genome_Left.pl


3.5

Produce wig plots

W
e also provide a program

that can once produce wig plots of all th
e transcripts with m
6
A peaks,
which is helpful for further study.


F
igure 2
shows




A
n example of transcripts in wig plot.
Y
-
coordinates show the read coverage of every
position in transcripts. Different rectangles stand for different regions of transcripts
, and
blank ones are intronic regions.

Red triangle indicates the peak position.


Fig 2

Wig plots


3.5.1

Usage

p
erl bed
2
wig_
plot.pl

<
OPTIONS
>

OPTIONS
:

-
d1


the output dir
ectory



-
d2


the
MeRIP
-
PF
output
directory



-
pm1

the program
of
submit_to_PBS.pl


-
pm
2

the program

of

bed2wig_per_gene.pl


-
que


the queue


-
tool



BEDTools absolute pathway

(
XXX
/BEDTools
-
Version
-
2.16.2/bin/)

!!!
NOTE
:



O
ptions above are all
necessary

for wig plotting.



In wig plots, the absolute positions of transcripts are as x
-
coordinate
s, and reads coverage of
every base are as y
-
coordinates.