Genome
-
Wide SNP Discovery from
de novo
Assemblies
of
Pepper (
Capsicum annuum
) Transcriptomes
Hamid Ashrafi
1
, Jiqiang Yao
2
, Kevin Stoffel
1
, Sebastian R. Chin
-
Wo
3
, Theresa Hill
1
, Alexander Kozik
3
and Allen Van Deynze
1
1
Department of Plant Sciences, Seed Biotechnology Center, University of California, Davis, CA 95616
2
Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, Gainesville, FL 32610
3
Genome Center, University of California, Davis, CA 95616
Background and Significance
To obtain
as many transcribed genes
as possible,
peppers were
sampled from different
,
cultivars, tissues
at multiple stages of growth
and development.
To discover putative SNPs among three sampled pepper cultivars by
sequencing transcriptomes using
Illumina
Genome Analyzer.
To annotate the transcriptome sequence in order to have an insight into
pepper biological processes.
To use annotated gene
s for QTL analysis and
candidate gene
discovery.
Objectives
Materials and Methods
Results
Conclusions
References
Acknowledgments
Plant
Materials
and
cDNA
Library
Preparation
The
seed
of
three
pepper
(
C
.
annuum)
lines
‘CM
334
,’
‘
Maor
’
and
‘Early
Jalapeño’
were
planted
.
Three
cDNA
libraries
(one
from
each
pepper
variety)
were
prepared
using
pooled
RNA
that
was
extracted
from
4
tissues
:
root,
young
leaf,
flower
and
fruit
using
Qiagen
RNeasy
Mini
Kit
(Qiagen
Valencia
CA,
USA)
.
Fruit
tissues
were
collected
in
different
developmental
stages
;
5
,
10
,
and
20
days
post
pollination
developing
fruit,
breaker
and
ripe
fruit
.
The
libraries
were
constructed
by
shearing
cDNAs
and
300
‐
350
bp
fragments
were
selected
on
gels
.
The
libraries
were
normalized
using
a
double
-
stranded
nuclease
protocol
.
The
cDNA
libraries
were
sequenced
using
Illumina
Genome
Analyzer
IIx
(
GAIIx
)
(Illumina
Inc
.
,
San
Diego,
CA)
for
80
-
120
cycles
at
UC
Davis
Genome
Center
core
facility
.
De
Novo
Assembly
of
NGS
Sequences
The
NGS
data
(
GAIIx
)
went
through
our
standard
preprocessing
pipeline,
developed
at
UC
Davis
(Kozik,
A,
2010
)
.
Velvet
(
Zerbino
and
Birney,
2008
)
and
CLC
(CLCBIO,
2010
)
software
packages
were
used
to
assemble
the
sequences
.
CAP
3
was
used
to
make
the
final
assembly
of
three
assemblies
.
Trimmed reads
Min 40nt
–
Max 85nt
Trimmed reads
Min 25nt
–
Max 60nt
One iteration of CLC
assembly with all reads
One iteration of CLC
assembly with all reads
One iteration of CLC
assembly with all reads
Velvet Assembler
Early Jalape
ñ
o
31
35
41
31
35
41
Trimmed reads
Min 40nt
–
Max 85nt
Trimmed reads
Min 25nt
–
Max 60nt
All K
-
mer
assemblies,
assembled with CAP3
Ma
or
31
35
41
Trimmed reads
Min 40nt
–
Max 85nt
Trimmed reads
Min 25nt
–
Max 60nt
31
35
41
All K
-
mer
assemblies,
assembled with CAP3
CM334
31
35
41
31
35
41
Trimmed reads
Min 40nt
–
Max 85nt
Trimmed reads
Min 25nt
–
Max 60nt
All K
-
mer
assemblies,
assembled with CAP3
Velvet K
-
mers
CLC Assembler
Trimmed reads
Min 40nt
–
Max 85nt
Trimmed reads
Min 25nt
–
Max 60nt
Trimmed reads
Min 40nt
–
Max 85nt
Trimmed reads
Min 25nt
–
Max 60nt
Velvet Assembler
CLC Assembler
Velvet Assembler
CLC Assembler
CM334
assembly made with CAP3
Early Jalape
ñ
o
assembly made with CAP3
Ma
or
assembly
made with CAP3
+
+
+
Pepper final assembly made with CAP3
(Reference Sequence)
Assembly
Statistics
No
.
of
Contigs
Total
nt
N
50
CM
334
83
,
113
84
,
792
,
180
1
,
488
Early
Jalapeño
82
,
614
84
,
973
,
865
1
,
488
Maor
76
,
375
79
,
383
,
673
1
,
526
Pepper
assembly
123
,
261
135
,
019
,
787
1
,
647
(CM
334
,EJ
and
Maor
)
Annotation
A
total
of
63
,
202
contigs
(
51
.
3
%
)
had
at
least
one
hit
in
the
non
-
redundant
database
of
GenBank
with
an
average
length
of
1
,
495
nucleotides
.
Contigs
with
a
hit,
covered
94
.
5
M
bases
(
70
%
)
of
the
total
assembly
.
A
total
of
60
,
055
(
48
.
7
%
)
contigs
that
did
not
have
any
hit
in
the
GenBank
were
on
average
674
nucleotide
long
and
covering
40
.
5
M
bases
(
30
%
)
of
the
total
assembly
.
Based
on
all
results
of
BLASTX,
Vitis
vinifera
,
Arabidopsis
thaliana
and
Oryza
sativa
were
the
top
three
species
in
the
blast
hits
(Fig
3
)
.
Mapping
step
of
Blast
2
GO
resulted
in
identification
of
37
,
918
(
30
.
7
%
)
contigs
with
Gene
Ontology
(GO)
terms
.
Biological
Processes
(BP)
at
different
GO
levels
were
generated
.
Fig
4
shows
the
BP
at
level
2
.
For
each
BP
number
of
annotated
sequences
are
shown
in
Fig
5
.
Kegg
maps
for
150
biological
pathways
were
generated
and
contigs
within
each
pathways
were
determined
.
For
instance,
Fig
6
depicts
Kegg
map
of
Pyrimidine
Metabolism
pathway
.
SNP
discovery
A
total
of
22
,
863
putative
SNPs
within
11
,
869
contigs
were
identified
by
our
SNP
discovery
pipeline
.
The
contigs
with
identified
putative
SNPs
comprised
23
,
794
kb
(
17
.
6
%
)
of
pepper
transcriptomes
assembly
.
On
average
1
SNP
per
1040
bp
of
exonic
regions
of
pepper
genome
was
identified
.
Assembly
of
transcriptomes
of
three
pepper
cultivars,
increased
the
total
assembled
bases
by
50
%
.
The
present
pepper
transcriptome
assembly
represents
~
4
%
of
pepper
genome
(
3500
Mb)
.
We
demonstrated
that
for
the
plants
for
which
the
genome
sequences
are
not
available
yet,
the
transcriptome
assembly
is
an
alternate
approach
SNP
calling
.
Annotation
of
51
%
of
contigs
or
70
%
of
total
assembled
bases
indicates
that
~
49
%
of
contigs
are
small
contigs
that
are
covering
the
remaining
30
%
of
unannotated
sequences
.
Conesa,
A
.
,
S
.
Götz
,
et
al
.
(
2005
)
.
"Blast
2
GO
:
a
universal
tool
for
annotation,
visualization
and
analysis
in
functional
genomics
research
.
"
Bioinformatics
21
(
18
)
:
3674
-
3676
.
Kozik
,
A
(
2010
)
.
Tool
to
process
and
manipulate
Illumina
sequences)
.
http
:
//code
.
google
.
com/p/atgc
-
illumina/downloads/list
”)
.
Li,
H
.
and
R
.
Durbin
(
2009
)
.
"Fast
and
accurate
short
read
alignment
with
Burrows
–
Wheeler
transform
.
"
Bioinformatics
25
(
14
)
:
1754
-
1760
.
Li,
H
.
,
B
.
Handsaker
,
et
al
.
(
2009
)
.
"The
Sequence
Alignment/Map
format
and
SAMtools
.
"
Bioinformatics
25
(
16
)
:
2078
-
2079
.
Zerbino
,
D
.
and
E
.
Birney
(
2008
)
.
"Velvet
:
algorithms
for
de
novo
short
read
assembly
using
de
Bruijn
graphs
.
"
Genome
Res
18
:
821
-
829
.
Molecular breeding of pepper (
Capsicum spp.
) has been hampered by the paucity of
molecular markers.
This is primarily due to lack of availability of the pepper genome sequence and
limited available sequence resources.
In recent years with the more cost effective sequencing technologies such as
Illumina
, sequencing of expressed genes (transcriptomes), gene discovery and
allele mining is no longer insurmountable.
In order to exploit the speed and scale of data from new sequencing technologies
and in an effort to enrich the sequence resources of pepper, we sequenced
transcriptome sequences (RNA
-
seq
) of three pepper lines: Maor, Early Jalapeño
(EJ) and
Criollo
de
Morelos 334 (CM334).
We selected a wide range of tissues to represent as many expressed genes as
possible.
The reference sequence was constructed from >200 million
Illumina
reads
(80
-
120
nt
) using a combination of Velvet, CLC and CAP3 software packages.
BWA (Li and Durbin, 2009),
SAMtools
(Li et al, 2009b) and in
-
house Perl scripts
were used to identify SNPs among three pepper lines.
The SNPs were filtered to be 100
bp
apart from any putative
intron
-
exon junctions
as well as adjacent SNPs. After filtering >22,000 high quality putative SNPs were
identified and
bioinformatically
mapped to pepper genetic maps.
The reference sequence was annotated by Blast2Go software (Conesa et al, 2005).
The
authors
would
like
to
thank
Enza
Zaden
,
Nunhems,
Rijk
Zwaan,
Syngenta,
Vilmorin
and
UC
Discovery
program
for
the
financial
support
.
We
also
would
like
to
thank
sequencing
facility
of
UC
Davis
Genome
Center
and
Bioinformatics
core
facility
to
provide
us
the
servers
and
computational
power
.
The
annotation
would
not
be
possible
without
collaboration
with
Dr
.
R
Michelmore’s
laboratory
.
SNP
Discovery
Pipeline
BWA
was
used
to
map
all
the
reads
of
three
genotypes
individually
to
the
Pepper
final
transcriptome
assembly
.
SAMtools
was
use
to
make
the
pileups
of
each
cultivar
and
discover
the
difference
within
each
cultivar
with
reference
sequence
.
Indels
were
screened
out
of
pileup
files
.
Intron
-
exon
junction
positions
were
inferred
in
the
reference
sequence
based
on
Arabidopsis
gene
models
using
intron
finder
of
Solanaceae
Genome
Network
website
(SGN)
.
In
-
house
Perl
scripts
were
used
to
create
allele
call
table
of
all
three
genotypes,
the
SNPs
were
filtered
against
adjacent
SNPs
and
identified
Intronic
regions
.
Sequences
surrounding
the
SNPs
(
100
base
on
each
side)
were
extracted
from
the
reference
sequence
to
design
assays
.
Annotation
of
Reference
Sequence
Blast2Go program was used to annotate the reference sequence, obtain the statistics and generate Kegg maps(http://www.genome
.jp
/kegg/pathway.html).
Fig
.
3
Fig
.
4
Fig
.
5
Fig
.
2
Distribution
of
contig
length
in
pepper
transcriptome
assembly
N50=1647
Mean=1095
Max=19,089
Min=265
Fig
.
6
Kegg
map
of
Pyrimidine
Metabolism
Fig
.
1
De
Novo
assembly
of
pepper
transcriptomes
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο