Supplementary Data - Bioinformatics

stalliongrapevineΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 3 μήνες)

89 εμφανίσεις

1


Supplement
ary

methods

1

Generate

profiles

For
the
Illumina
sequencing
-
by
-
synthesis
(
SBS
)

process
,

the
distribution
of
error rate is an abstract of the light
signals

of every DNA
fra
g-
ment

cluster

with
crosstalk, phasing, non
-
c
hastity

(
Whiteford, et al., 2009
)

and other
artifacts

such as bubbles and
mislabeled

dNTPs. As
most of these characters are
difficult to

get

clear
, i
t is
almost

impossible

to design a top
-
down profile by di
rectly modeling the
synthesis

reaction of DNA
molecule
s. Instead, pIRS
adopts

a
bottom
-
up profile derived from statistics of real re
-
sequencing data.


In

pIRS,
the
p
rofile
s

are generated
from

real re
-
sequencing data

with

known
reference
genomes, and
user can
update

them

when there is
reagent or machine update
s
.
In order to
get

the raw quality values
,
sequencing
data
without
EAMSS

filtering
(
mask

low quality values with
fixed quality value

B

)

is
preferred

here
, which can be generated by
run
ning

CASAVA

with "
--
no
-
eamss
"
parameter

to disable
the

EAMSS

filter
ing

function
.

However,
if only sequencing data

with
EAMSS

filtering

is available,
it

can also be used as training data for
profile generation,
which lose part of the
quality distribution informa
tion
.
The gap
-
tolerating alignment should be used in order to generate
the InDel profile
,

and a

SNP
/InDel

set is optional to
eliminate

variations

derived

from

genetic

polymorphism
.
In order to get

a
comprehe
n-
sive

GC%
-
depth

profile, it is better to
choose
reference

genomes with a wide
range of
GC content. Typically,
we

suggest
users to
choose
the
human

s
whole genome sequencing data.

In the pIRS package, i
ndependent t
ools

are provided

to
generate the Base
-
Calling

profile by
analy
z
ing

alignment results of SO
AP2
(
Li, et al., 2009
)

or
SAM
/BAM

files

(
Li and Durbin, 2010
)
,
and
generate
the
GC%
-
depth
profile

by
analyzing
the results of soap.coverage

(http://soap.genomics.org.cn/soapaligner.html)
with

different window size
s
.

As the
se

profiles reflect
a comprehensive set of characteristics
for

Illumina sequencing, they are also very useful
for the quality checking and control jobs (QC) in
sequencing labs.


To generate

the
Base
-
Calling
profile, the alignment results
are firstly filtered
by
keep
ing only those unique
full
-
length
mapped
read

pair
s
.
Then,
an

optional SNP site list is loaded and all the p
otential

SNP sites are skipped in later
statistic
s
.

As
the

detectable

substitution

errors

are desired

here
, strict filter
ing

can

eliminate
the influence of
mis
-
alignment and
polymorphism
.

Typically,
at least
150
M

PE reads, which
is
~
30
G
bases
with
read length 100

bp
,
is suggested

in this step.

The Base
-
Calling profile
contains

two parts:

one is t
he
overall distribution
(Dist)
matrix

which
is a
4
-
dimensional matrix
comprised of
count of quality values grouped by
the
reference bases, read cycles and called
bases
,
the other is

t
he
quality
-
transition

(
Qtrans
) matrix

which
is a
3
-
dimensional matrix
comprised of
count of quality values grouped by
the
read cycles and the previous quali
ty values
.

Since the distribution of called bases grouped by
the
reference bases is recorded in the
Dist
matrix, the distribution of substitution errors can be reproduced easily.


In
the
process to generate

GC%
-
depth profiles, soap.coverage is
firstly
run

with

-
onlyuniq”

p
arameter

on the alignment results to get the
coverage
depth of ever

base

in
the
reference
genome. Then,
tiling

windows
along

the whole genome are defined
, and

the
GC content ratio
and average base depth is
calculated

for each window
respectively
.

Note that w
indows with more than 3/4
ratio of
Ns are
omitted
.

Finally,

the
GC%
-
depth profile

which contains the average depth
information
for each GC content intervals
,

together with

other information

for
box
-
plot

drawing
,

is

g
enerat
ed
.

Typically,
at least
1
0
~20 X
sequencing coverage
is suggested

in this step.

Note that i
n some reference g
e-
nomes, there
may be

not enough windows for every GC content

intervals
, so
when a GC content
interval
come
s

with less than 100 wi
n-
dows, we use
the
LO
ESS

regression

(
Cleveland, 1979
)

algorithm

to

estimate the
corresponding

depth
value
and

smooth the
GC%
-
depth
curve, h
owever, users are still suggested to check whether the
automatically estimated
value is reasonable.

Note that the window size
should be determined carefully by considering both
the
distribution
of
genomic GC content and read length factors.



The InDel profile is generated by analyzing the same
alignment

data that is used for
generating
the Base
-
Calling
profile
.

T
he gap
-
tolerating alignment
is performed
and then the insertions and deletions
are identified
from the alignment gaps on either the reference or
reads

sequences
.
Strict filtering is also needed to eliminate false alignments an
d polymorphic InDel
variations.
The InDel profile is a 3
-
dimensional matrix
comprised of count of InDel cases

grouped by
the
read cycles
,
types
(insertion, deletion), and length of continuous
bases (1
-
3 bp).
To be practical,
a simple Indel profile
is used here
here
, however,
we are aware that the propert
ies

for InDel
errors

are

much
more
complex than that of substitution errors
.

B
esides the listed dimensions,
the
upstream base and quality,
and especially the specific
-
sequence
-
structures are also important affecting factors

for

generating InDel errors
.

2

Further analysis
for the Base
-
Calling profile

The distribution of quality values
generated

by a combination of
Qtrans

and Dist matrix is equal to that generated by only the Dist matrix,
i.e.
Qtrans

does not change the overall quality distributions in each cycle. We further generated 4 individual Qtrans matrix

that

disti
n-
guished by the 4 types of reference base, and found that all of them are very similar with
correlation coefficient

r>0.99 and
have
pass
ed

the
F
-
test of
linear correlation
, so we adopt a
universal
Qtrans matrix in pIRS.


We use
d

a simplified way to prove the above conclusion.

For read cycle 1 and 2, suppose
that
there are only 2 quality values A and B,
thus
all
the
possible quality com
binations are AA, AB, BA, BB. The quality distribution of cycle 1 and 2 are defined as below:

(
"
_
"
refers to all the quality values, here
it
includes A and B.)

2





The transition possibilities are defined as:



Through transition,
P(
_A) and P(_B) can be calculated as:



So, after transition,
P(
_A) and P(_B) are the same as the original definition. The
quality distribution for the
following cycles can be
proved in
similar

way.


Besides the combination way of Qtrans and Dist matrix used in this paper, we can also integrate the Qtrans and Dist matrix in
to a 5
-
dimensional matrix, with dimensions of
read cycle, reference base, called base, upstream quality and called quality in seque
ntial.

The
5
-
dimensional matrix is a comprehensive profile of Illumina data, which can also be used to simulate Illumina reads. However, i
t often needs
much more training data to
produce

the 5
-
dimensional matrix than the Qtrans and Dist matrix.

3

Simulate
h
eterozygous

genome sequence

To generate
the
heterozygous genome, w
e used
an
empirical
polymorphism
profile to
introduce
SNP and
small
InDel
variations to the re
f-
erence genome.
For SNPs,
the default

transition
/
transver
s
ion

ratio
is

2
.

F
or small
I
n
D
els
,
only
length
1
-
6 base
s

are used
,

which
have co
n-
tained

the major part of small In
D
els

with

the
empirical distribution

from panda sequencing
data

(
Li, et al.
)
.

Our tool
generates a haploid
genome that is heterozygous to the input reference genome.
The combination of two
haploid genomes
form
s

a diploid genome, and pIRS
can take them as input
together
to simulate heterozygous reads.

4

Simulate Illumina reads with

profiles

The r
eference sequences are parsed one
chromosome

at a time,

and so
is

the output.

In

diploid

mode, each reference file
is

sampled with
half depth

compared to that in haploid mode
.

For each
pair
-
end (
PE
)

reads,
the
insert size

value

is
determined

from

normal distribution

with
the

Box
-
Muller method

(
Box and Muller, 1958
)
, and a

start point is randomly chosen on the
chromosome
.
The PE read pair is
then
se
para
t-
ed to read 1 and read 2 randomly
,

in order
to simulate sampling on a random strand of
the
reference
chromosome
, and t
he second read of
the pair is
changed into its
reverse
and
complement
ary form
.


When
the GC
%
-
depth profile

is chosen,

the relevant ratio of GC
-
bias is obtained from the GC%
-
depth profile based on the GC content of
both pair
-
end reads
.
A random number is
generated

and
then
compared to that ratio to
determine

whether this PE read pair
should be sa
m-
3


pled
. Since
the
GC%
-
depth

profile

is influenced by its window size

seriously
, we suggest
using

window size
s

approximate
ly

equal to twice
of the read length.


For
both

reads in a pair
,
the

quality

value

of first base
are

randomly

chosen from
the
Dist matrix

with

corresponding

cycle number
s
,

and
all

the

quality values
in following cycles
are

determined
according

to the
Qtrans
matrix
. Then
the
called
quality

together with

reference
base and cycle number

are used
to
extract
the called base distribution from
the
Dist

matrix
,
from

which the corresponding
base is dete
r-
mined
.


The overall characteristic
s

for the pIRS simulated data
are

similar to

that of the training data used for profile
s

generation.
In
software d
e-
velopment, we may want the raw quality values without
EAMSS

filtering
, which contains the most complete information. However,
most
of the
currently
available Illumina
real data are after

EAMSS

filtering
,
and so we

may
also want to get simulated

data

of this kind

at some

cases.
There are two ways to achieve this go
al: (1)
run pIRS using the
Base
-
Calling profile generated from the
EAMSS

filtered

data
, which
is not prefer
r
ed
; (2) run pIRS using the
Base
-
Calling profile generated from the
raw
data

without
EAMSS

filtering
, and then mask the low
quality
values on
read tails with
fixed

B


by an
additional

masking function that is

same to
the
Illumina data
processing

pipeline.



When
the generati
on

of
bases and quality values

is finished,
pIRS
can optionally

add InDel errors
to
the
simulated PE reads
,

based on the
corresponding probability for each InDel type

&

length

on each read cycle
.



4


Supplement
ary

results and discussion
s

1

Mapping s
tatistics of the testing data

There are
3
testing
data sets
,

two

of them
from coliphage

with different reagent versions

and
the other
one from
human. The coliphage,
whose genome size is only 5.4 kb,
is

used as a reference material
for quality control
by Illumina. The human data can be easily obtained
from any human sequencing projects and here we
cho
se

one sample
HG00464

from
the 1000 Genomes Project
(
Consortium, 2010
)
.

We
used both
SOAP2

and
B
WA

to do the mapping,
and
the
ir

results are comparable,
so

only

the
BWA

result

is shown
here

(Table S1)
.

All
the
alignment

result
s

are
available

as gzipped SAM files at
ftp://ftp.genomics.org.cn/pub/pIRS/data/

.



Table
S
1
.

Statistics of the testing data used for quality evaluation

Source

Coliphage (V2)*

Coliphage (V3)*

Human

Genome accession

NC_001422

NC_001422

hg19, GRCh37. Feb.
2010

Genome
size (bp)

5,386

5,386

3,000,000,000

Accession of reads data

110315_I696_FC81MJBABXX_L1

110416_I327_FCB008GACXX_L1_PHIXCONTROLv2

ERR020234

Read length (PE, bp)

100

100

91

Insert Size

375.914 ± 30.365

406.863 ± 32.495

465.031 ± 12.506

Coverage of raw
bases (X)

397580.84

5997401.89

6.34

Mapped bases ratio

93.84%

96.71%

86.27%

Proper

data

ratio

86.32%

92.67%

73.45%

Coverage of proper
data

(X)

343197.42

5557893.83

4.66

Mismatch Ra
te
*

0.9902%

0.3206%

0.5262%

Note:
V2 and V3 refer to the version
s

of Illumina sequencing reagents.

Proper data refer
s

to reads that are uniquely mapped to the reference
with full length.
Mismatch Rate i
s calculated on the proper data, which is lower than that
from

all raw data.

2

S
tatistics
of
sequencing
substitution
errors

The Base
-
Calling profile
was generated by

only

the proper data that are uniquely mapped

to the genome with full length.
Since
diploid
human genome usually contains about 0.1% SNPs, we exclude
d

all potential SNP sites before evaluation analysis.

The
p
otential SNP
s

used
here is a combination of

SoapSNP

(
Li, et al., 2009
)

result and dbSNP

(
Sherry, et al., 2001
)

v132
, and the list of all SNP sites is also avail
a-
ble at out FTP site.

The substitution matrix
es

shown in
Table
S
2

were reproduced from the
Dist

matrix of the Base
-
Calling profile
.
For

coliphage

data
,
results using
v2 reagent shows significant higher substitution error
rate
on A
-
C and G
-
T cross talk
, comp
ared to that using

v3 reagent
in which

the error rate is much lower and cross talk is not so significant. However, since the
two

data sets are
derived
from di
f-
ferent
Illumina
machines at different time, reagent
may be

only one of the factors.

For
human

data
,
the polymorphism
SNP
s

show a great
influence on
the
substitution ratio
.

T
here is a much higher A
-
G
and C
-
T
transition rate
for with
-
SNP matrix
,

compar
ed

to the non
-
SNP
matrix. Thus it is important to eliminate SNP sites
before, in order
to
generate

a pure substitution matrix
containing

sequencing substitution

error
s

only.


Table
S
2
.

Statistics of 4x4 substitution matrix for testing data sets

reference
bases

Coliphage v2
,

sequenced bases

Coliphage v
3
,

sequenced bases

A

C

G

T

A

C

G

T

A

0.989364

0.005224

0.003020

0.002392

0.996809

0.001194

0.001486

0.000510

C

0.004811

0.988776

0.002446

0.003967

0.002287

0.995912

0.000703

0.001098

G

0.002495

0.001962

0.990469

0.005074

0.001055

0.000295

0.997824

0.000826

T

0.001877

0.002880

0.003643

0.991600

0.000714

0.000955

0.001671

0.996661


reference
bases

Human
with
-
SNP
,

sequenced bases

Human no
n
-
SNP
,

sequenced bases

A

C

G

T

A

C

G

T

A

0.994986

0.002171

0.001975

0.000868

0.995914

0.002010

0.001380

0.000697

C

0.002616

0.994115

0.001461

0.001808

0.002368

0.995480

0.001203

0.000950

G

0.001993

0.000974

0.995591

0.001441

0.001130

0.000709

0.996990

0.001171

T

0.001358

0.001781

0.002522

0.994339

0.001182

0.001204

0.002332

0.995282

Notes: The column
s

represent reference bases, while
the
rows represent sequenced bases. The normalization was made on each reference
bases for each species data.

5


3

Statistics of sequencing InDel errors

The InDel profile was also generated
using

the proper data that are uniquely mapped to the genome with full length. Since the
polymo
r-
phism
InDel data

of human

is not as
reliable

as its SNP data, here only the
Coliphage

data
was used
to generate the In
D
el profile.
The ove
r-
all insertion and deletion
ratio on read 1 and read 2 are shown in Table S3,
and the
ratio for each InDel type along the reads were shown in
Fig. S1.
It is obvious that t
he
1
-
bp insertion
s

and deletion
s

are dominant

in all
types of
InDel errors, and the ratio for all types of InDels

get
s

higher

towards the 3
-
end of reads.



Table S3.

Statistics of InDels in Illumina data



Ins
-
1bp

Ins
-
2bp

Ins
-
3bp

Insertion

Del
-
1bp

Del
-
2bp

Del
-
3bp

Deletion

Read_1

4.23E
-
06

4.26E
-
07

1.23E
-
07

4.78E
-
06

1.05E
-
05

5.22E
-
07

1.80E
-
07

1.12E
-
05

Read_2

7.35E
-
06

8.00E
-
07

2.16E
-
07

8.37E
-
06

1.39E
-
05

1.29E
-
06

3.40E
-
07

1.55E
-
05

Pair_reads

5.79E
-
06

6.13E
-
07

1.70E
-
07

6.57E
-
06

1.22E
-
05

9.07E
-
07

2.60E
-
07

1.33E
-
05

Note:
The
Coliphage v2

proper data is used here.
The
ratio of
Insertion and Deletion errors for both read

1 and read 2 were calculated, and
calibrated by the length (continuous bases).
“Insertion”
refers

to a summary of Ins
-
1bp, Ins
-
2bp, and Ins
-
3bp, and it is similar for “Del
e-
tion”.
The ratio value for Pair_reads is an average of that on Read 1 and Read 2.




Fig. S
1


Distribution of InDel ratio

for each read cycle
. Use the same data as Table S3

and the ratio of InDel is
calculated

with same style
.

Note that t
he first half read cycles refer to read 1, whist the last half cycles refer to read 2.

4

Relationship
between

quality value

and error rate

In
Illumina
data
,

substitution error
s are much more common compar
ing

to InDel errors, so without specification, all errors below
refer to

substitution error
s
.


T
he quality system of
Illumina

has been in changing since its first release. The GApipeline v1.3+ use
s

phred
-
scale qualit
y values

and the
v1.5+ mask
s

low quality
values

with
fixed
“B”

under default settings
. All our testing data
was

generated by v1.5+, however, we use
d

an
option “
--
no
-
eamss” in BclConverter to disable the
EAMSS filtering
function, and generated
data with
raw

quality values for
the
coliphage
and human data sets.

If the quality values are assigned accurately, then we can infer the error rate
for

bases fro
m the corresponding quality
values.
Although Illumina is trying to make
the
quality value
better
represent error rate, there is still
some

distance, shown in

Fig. S2
. The
curves in
a
,
b
, and
c

sub
-
figures are all similar, indicating that the relationship between quality and error rate in pIRS simulated data is
quite similar to that of real sequencing data. In contrast, there is a great difference for ART, where the error rate is well

consistent

to that of
calculated from quality values with the phred
-
scale formula.


The command lines we used for each simulator are:

pirs

simulate
-
M
1

-
i ref22.fa
-
m 500
-
l 100
-
x 10
-
v 25


o
pirs10
_with_Qtrans

pirs

simulate
-
M
0

-
i ref22.fa
-
m 500
-
l 100
-
x 10
-
v

25


o
pirs10
_without_Qtrans

art_illumina
-
sp
-
i ref22.fa
-
p
-
l 75
-
f 10
-
m 500
-
s 20
-
ir 0
-
dr 0
-
ir2 0
-
dr2 0
-
o ./sim/art10

6




Fig. S
2


Relationship
between quality values and error rate
.


Note:
The error rate on Y
-
axis is log scaled.
The
reads
data in all sub
-
figures are
10 X coverage,
derived from human and
the
22
autosomes
in hg18
were

chosen
as the reference genome.

The
sub
-
figure
a

is drawn using the

Base
-
calling profile

derived from
62

G
recent
human re
-
sequencing data
, which has
eliminated

SNPs
.
The
simulated
reads data in
b

and
c

were produced

by pIRS
using Base
-
Calling profile

with
and

without Qtrans

derived from real sequencing data in
a
, whist the simulated reads data in
d

was produced by ART using its
built
-
in

75
-
bp profile.


5

Distribution of error rate

on reads

The error rate on reads is

not uniform,
with an increasing trend
towards

the 3
-
end terminal of reads, and t
he overall error rate on read 2 is a
little higher
than that of read 1, shown in Fig. S
3
.
The curves in both
b

and
c

are almost the same as that of
a
, indicating that the error rate
pattern in pIRS simulated reads is very similar with that of real sequencing data.
There is a difference of curve style in
d

compared
with the
others, which is caused by the different
underlying profile

used by ART
.

7



Fig.
S3

Distribution of error rate

along reads.

Note:

Each sub
-
figure use the same data as in Fig. S1.
Both the quality

derived error

rate using phred
-
scale formula and the mismatch

d
e-
fined error rate by statistics
are

shown.

T
he reads data in
a
,
b
,
c

are 100 bp pair
-
end, whist that in
d

is

75 bp pair
-
end.

The first half cycles
represents read 1, whist the last half cycles represent read 2.


6

Distribution

of
Quality

transition on reads

The quality transition
along reads
can be
plotted

as heat maps. We compare
d

real
sequencing
data, pIRS and ART
simulated data
as below

in Fig. S
4
. Since Wgsim only output dummy quality values (
character


2

), we did not
consider

it here.

The heat map of pIRS
simulated

reads with Qtrans is a
lmost the same as that of real sequencing data,
in contrast
, it
differentiate

largely for pIRS
simulated

reads without
Qtrans

and ART simulated reads
,
in which there are

no or weak relatio
n
ships between the leading and following quality

values
.

8



Fig.
S4


Heat map for the f
irst
-
order quality transition

matrix
.

Note: Each sub
-
figure use the same data as in Fig. S1
.
The leading and following

quality
-
pairs for all the cycles are
taken into statistics
.

7

Relationship between GC content and Coverage bias

Illumina data
has coverage bias
problem
on genomic regions with different

GC content
, and

t
he genomic regions with high
er

or low
er

GC
content
than average level
tend to be less sampled, show
n

in

Fig. S
5
.
The
coverage bias pattern of pIRS simulated reads wi
th GC%
-
depth
profile is quite similar with that of real sequencing data, compared to that of ART and wgsim which ha
ve

little GC
-
induced coverage bias.

I
n many real data sets,
w
e found that
the single
-
base depth distribution deviates from Poisson distribution more than our simulated data sets,
indicating
that
there are other important factors that
cause

coverage bias. It has also been reported that certain sequence
-
specific characters
can lea
d to higher substitution rate
,

so that reads can hardly be mapped back to the reference under ordinary mismatch thresholds
, which
may also contribute to the observable coverage bias.


The command
line

for wgsim simulator

is
:

wgsim
-
d 500
-
s 20
-
N 191182185

-
1 75
-
2 75
-
R 0
-
X 0 ref22.fa wgsim10_1.fq wgsim10_2.fq

9



Fig.
S
5


Relationship between GC content and coverage depth (Window_Size = 100 bp)

Note:
The
a
,
b
,
c

sub
-
figures use the same data as in Fig. S1
, whist
d

uses 10 X wgsim simulated data on the same

reference genome
. The
pIRS without Qtrans sub
-
figure is not shown here, because
Qtrans
does not affect GC coverage bias

and so it should be the same with th
at
of
pIRS with Qtrans.
The mean depth and depth box
-
plot
curves
use the left Y
-
axis, whist the max depth
curve
use
s

the right Y
-
axis.
For all
plots
,
the
mean depth values are raw values before LOESS
regression
.

8

Apply simulated reads on SNP calling

To test the performance
of simulated reads
on SNP calling,
w
e simulated
30

X

pair
-
end reads

using pIRS

for

E. coli

TY2482 (without
plasmids) and
Arabidopsis thaliana
,
respectively,
both
with 0.1% SNPs and 0.01% short In
D
el

variations
. T
he read length

used here

is 100
bp
and the insert size is 5
00±
25

(mean
±
sd)
.
The a
lignment is
performed

with BWA
(
Li and Durbin, 2010
)

and SNP calling is
performed by

SAMTOOLS
(
Li, et al., 2009
)
,
both

with default parameters.

The SNP calling results for both E.coli and Arabi
dopsis are

generally
good,
shown
in Table S4
,
indicating that our simulation
method is
compatible
with

SNP
calling

methods
.

As the assembly process is often
much

more complicate, and the final result is determined by many unclear factors, so we did not test the performance of
pIRS simula
ted read
s

on
de novo

assembly.

10


Table
S4
.

Performance of
SNP
calling
under different
data coverage


Species

E. coli

TY2482

Arabidopsis thaliana

Simulated heterozygosis SNP

5212

119089

Called SNP at 10x

TP

4371

83.86%


98355

82.59%

FP

2

0.0
5
%

93

0.0
9
%

FN

841

16.14%

20734

17.41%

Called SNP at 20x

TP

5001

95.95%

113684

95.46%

FP

0

0.00%

5

0.00%

FN

211

4.05%

5405

4.54%

Called SNP at 30x

TP

5088

97.62%

115808

97.24%

FP

1

0.02%

0

0.00%

FN

124

2.38%

3281

2.76%

Note: Each
called
SNP is defined as positive and the missed SNP is false
-
negative. The
number and ratio of
true
-
positive
(TP), false
-
positive (FP), false
-
negative (FN) are shown
respectively

for each data coverage
.

9

Running time and memory
comparison

of simulator

We

compare
d

t
he running efficiency of
pIRS
,

ART
(Q Version
1.3.1
)

,
wgsim

(
0.3.0
)

in
SAMTOOLS

and MAQ
(
0.7.1
)

by

simulat
ing

3

X

PE reads
with

75bp read length on
the
3 human
chromosome
s (

chr1, chr2, and chr3 in
hg19,
GRCh37. Feb. 2010
)
, respectively
. The
mean
insert size
used here
is 500bp with a
standard deviation

of 20.

The comparison results

are shown in

Table S5
. Since pIRS only parses
one
chromosome

at a time,
it

takes

much
less

memory

than

other simulators
. Moreover, pIRS also runs a little faster than th
e other simul
a-
tors.


T
he command lines
for each simulator
are:

pirs

simulate
-
M
1

-
i hg19
-
123.fa
-
l 75
-
x 3
-
m 500
-
v 20
-
c 0
-
o pIRS3
_with_Qtrans

art_illumina
-
i hg19
-
123.fa
-
p
-
l 75
-
f 3
-
m 500
-
s 20
-
ir 0
-
dr 0
-
ir2 0
-
dr2 0
-
o art3

wgsim hg19
-
123.fa
wgsim_1.fq wgsim_2.fq

d 500

s 20

N 13809447
-
1 75
-
2 75

R 0

X 0 >wgsim.lst

maq simulate
-
d 500
-
N 13809447
-
1 75
-
2 75
-
R 0 maq
-
1.fq maq
-
2.fq hg19
-
123.fa h1
.mkv

>
maq1.lst

Table
S5
.

Comparison of memory usage and running
speed


Max
Virtual Memory

(M)

Speed (Mbp/s)

pIRS

(1.0)

240

3.435

ART (Q
1.3.1
)

1
,
882


4.078

wgsim

(
0.3.0
)

1
,
288


4.029

MAQ
(
0.7.1
)

1
,
701

7.18
9

Note:
the
time efficiency

for all simulators
is

evaluated by the speed of producing simulated bases, and
calibrated
as the

number of pr
o-
duced

bases

(Mbp)

per second.

10

Extension
potential
of pIRS to other sequencing platforms

Although pIRS is originally designed for Illumina platform,
it has
the
potential to be
extended to other platforms that generate bases with
substitution and InDel
errors
. If we apply the current pIRS software on other sequencing platforms, as the InDel profile integrated in pIRS
is
very
simple, it will generate relatively good InDel results for
non
-
sequence
-
dependent

platforms such as
Helicos
, PacBio, and Oxford
nan
opore, but not for
the
strong sequence
-
structure
-
specific platforms such as 454 and Ion Torrent. For these two platforms, more compl
i-
cated InDel profiles should be designed to fully capture the InDel characteristics,
especially

the relationship with polyme
rs.
Although some
platforms may be more similar in sequencing mechanisim,
each platform tends to have
some

special characteristics

of its own, so

it is very
difficult to develop a universal simulator that is suitable for all types of platforms
, such as ART
. Instead, the
individual
simulator
s

each

focus only on a specific sequencing platform, will be more likely to generate simulated data that is more like the real data.

In future, we
plan to develop individual simulators for
other

dominant sequencing platfo
rms, such as Ion Torrent and Oxford nanopore.


11


References

Box, G.E.P. and Muller, M.E. (1958) A note on the generation of random normal deviates,
The Annals of Mathematical Statistics
,
29
, 610
-
611.

Cleveland, W.S. (1979) Robust Locally Weighted Regression and Smoothing Scatterplots,
Journal of the American Statistical Association
,
74
, 829
-
836.

Consortium, T.G.P. (2010) A map of human genome variation from population
-
scale sequencing,
Nature
,
467
, 106
1
-
1073.

Li, H. and Durbin, R. (2010) Fast and accurate long
-
read alignment with Burrows

Wheeler transform,
Bioinformatics
,
26
, 589
-
595.

Li, H.
, et al.

(2009) The Sequence Alignment/Map format and SAMtools,
Bioinformatics
,
25
, 2078
-
2079.

Li, R.
, et al.

(200
9) SNP detection for massively parallel whole
-
genome resequencing,
Genome Res
,
19
, 1124
-
1132.

Li, R.
, et al.

(2009) SOAP2: an improved ultrafast tool for short read alignment,
Bioinformatics
,
25
, 1966
-
1967.

Li, R.
,
et al.

(2010)

The sequence and de novo as
sembly of the giant panda genome,
Nature
,
463
, 311
-
317.

Sherry, S.T.
, et al.

(2001) dbSNP: the NCBI database of genetic variation,
Nucleic Acids Research
,
29
, 308
-
311.

Whiteford, N.
, et al.

(2009) Swift: primary data analysis for the Illumina Solexa sequencing platform,
Bioinformatics
,
25
, 2194
-
2199.