MeRIP
-
PF Manual
Released
2013
-
01
-
24
Yuli Li
, Shuhui Song
et al.
Beijing Institute of Genomics,
Chinese Academy of Sciences
, Beijing
Contact
:
li
yl@big.ac.cn
;
songs
hh@big.ac.cn
1
OVERVIEW
................................
................................
................................
................................
.
3
1.1
Background
................................
................................
................................
...................
3
1.2
Summary of MeRIP
-
PF
................................
................................
................................
..
3
1.3
Implementation
................................
................................
................................
.............
3
1.4
Availability
................................
................................
................................
.....................
3
2
INSTALLATION
................................
................................
................................
............................
4
2.1
Perl and R
................................
................................
................................
......................
4
2.2
BWA
................................
................................
................................
...............................
4
2.3
SAMtools
................................
................................
................................
.......................
4
2.4
BEDTools
................................
................................
................................
........................
4
3
USAGE
................................
................................
................................
................................
.......
5
3.1
Preparing reference files
................................
................................
...............................
5
3.2
Running MeRIP
-
PF
................................
................................
................................
.........
6
3.2.1
Setting up the Config file
................................
................................
...................
7
3.3
Output files format
................................
................................
................................
........
8
3.4
Produce wig plots
................................
................................
................................
........
10
3.4.1
Usage
................................
................................
................................
...............
11
1
OVERVIEW
1.1
Background
Next generation parallel sequencing technologies
m
ake
m
6
A
-
specific methylated RNA
immunoprecipitation followed by sequencing a popular strategy to study
transcriptome
-
wide
RNA modification
s
,
while creating challenges for analysis
, especially in peak
-
finding.
However,
t
here
have been
no any available too
l
s or
software
s
for MeRIP
-
Seq data analysis yet.
Here,
we present
a high
-
efficiency and easily
-
used a
nalysis
pipeline
called MeRIP
-
PF, which
is
a
publicly available open source and
specially
developed
for
MeRIP
-
Seq
peak
-
calling
with control
samples
. MeRIP
-
PF achieves m
6
A regions detection and annotation, and powerful graphical
display which are useful for further study.
1.2
Summary
of MeRIP
-
PF
We integrate four modules, including mapping, testing, annotating
and plotting into one program
to complete the whole a
nalysis.
The pipeline requires two Fastq
-
formatted data
,
genome
reference sequences of the corresponding species
and
several
annotated
BED
files with the gene
structure
information. And it will output 4 results files
, giving
the complete information of the
modification profile.
1.3
Implementation
The pipeline program was written in Perl, and run in a Linux machine cluster;
each
node has 8
cores with 2.00G Hz processor and 16G RAM.
MeRIP
-
PF
requires the installation of
Perl and
R
language program,
and
BWA,
SAMt
ools and
BEDT
ools.
1.4
Availability
T
he MeRIP
-
PF package including an example dataset is available at
http://software.big.ac.cn/MeRIP
-
PF.html
2
INSTALLATION
2.1
Perl and R
Ensure that Perl is installed
.
Instal
l R from http://www.r
-
project.org/
!!!Note
:
Perl and R need be installed at the root pathway.
2.2
BWA
D
ownload any version
of BWA
(e.g. bwa
-
0.6.2.tar.bz2)
from
http://sourceforge.net/projects/bio
-
bw
a/files
/
>
bzip2
-
d bwa
-
0.6.2.tar.bz2
>
tar
-
xf bwa
-
0.6.2.tar
>
cd bwa
-
0.6.2/
>
make
2.3
SAM
tools
D
ownload
installation
file from
http://sourceforge.net/projects/samtools/files/s
amtools/0.1.1
8/
>
bzip2
-
d samtools
-
0.1.18.tar.bz2
>
tar
-
xf samtools
-
0.1.18.tar
>
cd samtools
-
0.1.18/
>
make
2.4
BEDTools
D
ownload
installation
file from
http://code.google.com/p/bedtools/downlo
ads/l
ist
>
gunzip BEDTools.v2.16.2.tar.gz
>
tar
-
xf BEDTools.v2.16.2.tar
>
cd BEDTools
-
Version
-
2.16.2/
>
make clean
>
make all
3
USAGE
3.1
Preparing reference files
You need prepare some
BED files
. D
ownload from
http://genome.ucsc.edu/cgi
-
bin/hgTable
s
,
tak
ing
the species of human for example.
Downloadin
g
setting up:
*clade:
Mammal
*genome:
Human
*assembly:
Feb 2009(GRCh37/hg19)
*group:
Genes and Gene Prediction Tracks
*track:
RefSeq Gene
*table:
refGene
*region:
genome
*output format:
BED
-
browser extensible data
*output file:
hg19
*file type returned:
plain text
Click "get output";
Then, c
hoose "Whole Gene",
"Exons",
"Introns",
"5' UTR Exons",
"Coding Exons"
and "3' UTR
Exons" in turn, then
click "get BED",
download;
Name
them
"hg19_gene.bed",
"hg19_exon.bed",
"hg19_intron.bed",
"hg19_cds.bed",
"hg19_utr
-
5.bed",
"hg19_utr
-
3.bed",
respectively;
Put files of "hg19_cds.bed,
hg19_intron.bed,
hg19_utr
-
5.bed,
hg19_utr
-
3.bed" into one
directory,
and
name it Prot
ein
_
Coding/;
Put file of "hg19_exon.bed,
hg19_intron.bed" into another directory,
and
name it
Non
Protein
_
Coding/
;
Finally, c
hange output format from "BED
-
browser extensible data" into "all fields from
selected table",
output file named "hg19_a
ll_field"
and download it.
3.2
Running MeRIP
-
PF
perl
MeRIP
-
PF
.pl Sample.config
The
format of
Sample.config
file
:
**********INPUT FILES**********
Genome Sequence in Fasta:
XXX/mm9/bwa_index/chr19.fa
Fastq File of Sample Control:
XXX/DemoCell/demo
-
ctrl.fq
Fastq
File of Sample IP
(m
6
A):
XXX/DemoCell/demo
-
m
6
a.fq
**********UCSC REFERENCE FILES********
File of Whole Transcripts:
XXX/mm9/mm9_gene.bed
Directory of Protein
-
Coding
-
Genes Reference:
XXX/mm9/
Prot
ein
_
Coding
Directory of NonProtein
-
Coding
-
Genes Reference:
X
XX/mm9/
Non
Protein
_
Coding
File of Gene Function:
XXX/mm9/mm9_all_field
**********OUTPUT*********
Output Directory:
XXX/DemoCell/out
***********OPTIONS**********
Peak Size:
200
Length of Fastq Sequence:
36
Reads
Length
after
Clipping
: 36
Fisher's Exact Te
st Cutoff:
0.05
FDR Cutoff:
0.05
PBS Jobs Running Queue Name:
bioque
Tracking Queue Name:
bioque
**********ADDITIONAL OPTIONS**********
Directory of BWA Installation:
XXX/bwa
-
0.6.2
Directory of Samtools Installation:
XXX/samtools
Directory of BEDTools Ins
tallation:
XXX/BEDTools
-
Version
-
2.16.2/bin
File of 'submit_scripts_to_PBS.pl':
XXX/submit_scripts_to_PBS.pl
File of 'Fisher_Test':
XXX/Fisher_Test_Genome_Left.pl
!!!NOTE
:
We here just ana
lyze single
-
end sequencing data
. I
f you have paired
-
end data, you
may just
use either end or combine them into one file as single
-
end.
Users should
prepare
the config file at
first;
the way of setting up is showed below.
3.2.1
Setting
up
the Config file
!!!NOTE
:
Directories in ADDITIONAL OPTIONS and OUTPUT should be absolute
paths.
**********INPUT FILES**********
Genome Sequence
i
n Fasta
:
filepath/filename
#filepath/filename
:
the genome sequence of the species in the format of fasta
Fastq File
o
f Sample Control
:
filepath/filename
#filepath/filename
:
the fastq file of your con
trol sample
Fastq File
o
f Sample IP
(m
6
A)
:
filepath/filename
#filepath/filename#
:
the fastq file of your m
6
A sample
**********UCSC REFERENCE FILES********
File of Whole Transcripts
:
filepath/filename
#filepath/filena
me
:
filepath
/
hg19_gene.bed
Directory of
Protein
-
Coding
-
Genes Reference
:
filepath
#filepath
:
filepath
/
protein
/
Directory of NonProtein
-
Coding
-
Genes Reference
:
filepath
#filepath
:
filepath
/
non
-
protein
/
File of Gene Function
:
f
ilepath/filename
#filepath/filename
:
filepath
/
hg19_all_field
*******
****OPTIONS**********
Peak Size
:
integer
#integer
:
If the length of your library
fragment
is ~100bp, you can set this option 200.
Length
o
f Fastq Sequence
:
integer
#integer
:
the length of your
sequencing reads
Fisher's Exact Test Cutoff
:
float
#float
:
the
cutoff of p
-
value by fisher's exact test
FDR Cutoff
:
float
#float
:
the cutoff of q
-
value by Benjamini
–
Hochberg
method
PBS Jobs Running Queue Name
:
QueueName1
#QueueName
:
the queue name of PBS jobs running
Tracking Que
u
e Name
:
QueueName2
#QueueName
:
the que
ue name of t
racking
, which
could be same as QueueName1
**********ADDITIONAL OPTIONS**********
Directory of BWA Installation
:
filepath
#filepath
:
Path
/bwa
-
0.6.2
Directory of Samtools Installation
:
#filepath
:
Path
/samtools
-
0.1.18
Directory of BEDTools Inst
allation
:
filepath
#filepath
:
Path
/BEDTools
-
Version
-
2.16.2
File of 'submit_scripts_to_PBS.pl'
:
filepath/
submit_scripts_to_PBS.pl
#filepath/filename
:
Path
/submit_scripts_to_PBS.pl
**********OUTPUT*********
Output Directory
:
filepath
#filepath
:
make a new di
rectory for output files and temp files
3.3
O
utput files format
FILE
1
: Reads_Overview.txt
This file
supplies the basic status of the two sequencing
data
, including reads mapping status,
t
ranscriptome
-
wide distribution of m
6
A peaks
, r
eads
d
istribution among
d
if
ferent regions of
transcripts in Control Sample and
MeR
IP Sample
r
espectively
,
and
Control Sample
g
ene
e
xpression
r
egardless of
r
eads
m
apped to
j
unctions
.
FILE
2
:
Peak_All.xls
This will generate a tab
-
key
-
separated file containing the information of peak l
ocation
Column1=Chromosome:
chromosome on which the peak resides
Column2=PeakStart:
position from which the peak starts
Column3=PeakEnd:
position with which the peak ends
Column4=PeakSize:
range the peak spans
Column5
-
Column6
-
Column7:
genomic region with w
hich the peak overlaps
Column8:
transcript in which the peak locate
FILE3
:
Gene_List.xls
This file offers
the peak annotation information in term of genes
(transcripts)
.
Column1=
GeneID
Column2=
Peak_Cnt:
count of peaks located in this gene
Column3=
Peak_
Start:
positions with which every peak starts
Column4=
Peak_Size:
range every peak spans
Column5=
Peak_Region:
genomic region
(cds,
intergenic,
intron,
utr3,
utr5) with which every
peak resides
Column6=
Fraction:
the fraction that the peaks overlap with co
rresponding genomic regions
Column7=
CL_Rds_Cnt
:
reads count of
every
peak
located in this gene in Sample Control
Column8=
CL_
R
PM
:
R
PM
(reads per million)
of
every
peak
located in this gene i
n
Sample Control
Column9=
IP_Rds_Cnt
:
reads count of
every
peak
l
ocated in this gene
of
Sample m
6
A
Column10=
IP_
R
PM
:
R
PM
(reads per million)
of
every
peak located in this gene of
Sample m
6
A
Column11=
Enrichment
: the enrichment score of every peak
(the ratio of MeRIP sample reads to
non
-
IP sample reads within the area of
a peak, each normalized to the number of
uniquely mapped reads within the sample)
Column12=
S
t
rand
: the strand in which the gene locate
s
Column11=
Chr:
chromosome with which this gene resides
Column12=
GeneName
!!!NOTE
:
Column
s
3
-
11
are all semicolon
-
sepa
rated when Column2>1
.
FILE4
:
Plot_Fig
.pdf
This file
shows
Transcriptome
-
wide distribution of m
6
A peaks
(Figure 1A)
.
Pie chart
s
show the percentage of
non
-
IP reads (top) and
m
6
A peaks
(bottom)
within distinct regions of RNA; NP stands for
non
-
protein codin
g genes, while PR stands for protein coding genes.
D
istribution of m
6
A peaks along mRNA
(Figure 1B)
.
5’UTRs, CDSs and 3’UTRs of every
transcript are
separately
binned into regions spanning 1% of their total lengths;
Y
-
coordinates represent percentage of m
6
A peaks located in every bin.
C
orrelation
between
gene
expression level
and
m
6
A
peak enrichment
(Figure 1C)
.
Plotted is
the peak enrichment value relative to the abundance of the transcript within the input RNA.
NP_Exonic
CDS
NP_Intronic
3'UTR
5'UTR
PR_Intronic
Intergenic
A
Reads in Control Sample
NP_Exonic
CDS
NP_Intronic
3'UTR
5'UTR
PR_Intronic
Intergenic
m
6
A
P
e
a
k
s
5'UTR
CDS
3'UTR
P
e
r
c
e
n
t
a
g
e
o
f
m
6
A
P
e
a
k
s
(%)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
B
0.01
0.1
1
10
100
1000
1
2
4
8
16
32
64
128
256
C
RPKM
Peak Enrichment
Figure 1 The output file of
“
Plot_Fig
.pdf”
.
3.4
Lite version of MeRIP
-
PF:
MeRIP
-
PF_Lite
I
n this
version, you can take BED files
as input, which means that you can choose alternative
mapping softwares besides B
WA, such as Bowtie 2,
TopH
at 2
, GSNAP and so on.
Usage:
P
erl MeRIP
-
PF_Lite.pl Config_Lite N1
N2
N1: the number of total raw reads in control sample.
N2: the number of total raw reads in MeRIP sample.
T
he format of Config_Lite is just as below:
**********INPUT FILES**********
Genome Sequence in Fasta:
XXX/mm9/bwa_index/chr19.fa
Bed
File of Sample C
ontrol:
XXX/DemoCell/demo
-
ctrl.
bed
Bed
File of Sample IP
(m
6
A):
XXX/DemoCell/demo
-
m
6
a.
bed
**********UCSC REFERENCE FILES********
File of Whole Transcripts:
XXX/mm9/mm9_gene.bed
Directory of Protein
-
Coding
-
Genes Reference:
XXX/mm9/
Prot
ein
_
Coding
Directory
of NonProtein
-
Coding
-
Genes Reference:
XXX/mm9/
Non
Protein
_
Coding
File of Gene Function:
XXX/mm9/mm9_all_field
**********OUTPUT*********
Output Directory:
XXX/DemoCell/out
***********OPTIONS**********
Peak Size:
200
Length of Fastq Sequence:
36
Fisher's
Exact Test Cutoff:
0.05
FDR Cutoff:
0.05
PBS Jobs Running Queue Name:
bioque
Tracking Queue Name:
bioque
**********ADDITIONAL OPTIONS**********
Directory of BWA Installation:
XXX/bwa
-
0.6.2
Directory of Samtools Installation:
XXX/samtools
Directory of BEDT
ools Installation:
XXX/BEDTools
-
Version
-
2.16.2/bin
File of 'submit_scripts_to_PBS.pl':
XXX/submit_scripts_to_PBS.pl
File of 'Fisher_Test':
XXX/Fisher_Test_Genome_Left.pl
3.5
Produce wig plots
W
e also provide a program
that can once produce wig plots of all th
e transcripts with m
6
A peaks,
which is helpful for further study.
F
igure 2
shows
A
n example of transcripts in wig plot.
Y
-
coordinates show the read coverage of every
position in transcripts. Different rectangles stand for different regions of transcripts
, and
blank ones are intronic regions.
Red triangle indicates the peak position.
Fig 2
Wig plots
3.5.1
Usage
p
erl bed
2
wig_
plot.pl
<
OPTIONS
>
OPTIONS
:
-
d1
the output dir
ectory
-
d2
the
MeRIP
-
PF
output
directory
-
pm1
the program
of
submit_to_PBS.pl
-
pm
2
the program
of
bed2wig_per_gene.pl
-
que
the queue
-
tool
BEDTools absolute pathway
(
XXX
/BEDTools
-
Version
-
2.16.2/bin/)
!!!
NOTE
:
O
ptions above are all
necessary
for wig plotting.
In wig plots, the absolute positions of transcripts are as x
-
coordinate
s, and reads coverage of
every base are as y
-
coordinates.
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο