Genome-Wide SNP Discovery from de novo Assemblies

hordeprobableΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 9 μέρες)

79 εμφανίσεις

Genome
-
Wide SNP Discovery from
de novo

Assemblies

of

Pepper (
Capsicum annuum
) Transcriptomes

Hamid Ashrafi
1
, Jiqiang Yao
2
, Kevin Stoffel
1
, Sebastian R. Chin
-
Wo
3
, Theresa Hill
1
, Alexander Kozik
3

and Allen Van Deynze
1


1

Department of Plant Sciences, Seed Biotechnology Center, University of California, Davis, CA 95616


2

Interdisciplinary Center for Biotechnology Research (ICBR), University of Florida, Gainesville, FL 32610


3
Genome Center, University of California, Davis, CA 95616

Background and Significance


To obtain

as many transcribed genes
as possible,
peppers were
sampled from different
,

cultivars, tissues
at multiple stages of growth
and development.


To discover putative SNPs among three sampled pepper cultivars by
sequencing transcriptomes using
Illumina

Genome Analyzer.


To annotate the transcriptome sequence in order to have an insight into
pepper biological processes.


To use annotated gene
s for QTL analysis and
candidate gene
discovery.

Objectives

Materials and Methods

Results

Conclusions

References

Acknowledgments

Plant

Materials

and

cDNA

Library

Preparation


The

seed

of

three

pepper

(
C
.

annuum)

lines

‘CM
334
,’


Maor


and

‘Early

Jalapeño’

were

planted
.


Three

cDNA

libraries

(one

from

each

pepper

variety)

were

prepared

using

pooled

RNA

that

was

extracted

from

4

tissues
:

root,

young

leaf,

flower

and

fruit

using

Qiagen

RNeasy

Mini

Kit

(Qiagen

Valencia

CA,

USA)
.

Fruit

tissues

were

collected

in

different

developmental

stages
;

5
,

10
,

and

20

days

post

pollination

developing

fruit,

breaker

and

ripe

fruit
.



The

libraries

were

constructed

by

shearing

cDNAs

and

300

350

bp

fragments

were

selected

on

gels
.


The

libraries

were

normalized

using

a

double
-
stranded

nuclease

protocol
.


The

cDNA

libraries

were

sequenced

using

Illumina

Genome

Analyzer

IIx

(
GAIIx
)

(Illumina

Inc
.
,

San

Diego,

CA)

for

80
-
120

cycles

at

UC

Davis

Genome

Center

core

facility
.


De

Novo

Assembly

of

NGS

Sequences


The

NGS

data

(
GAIIx
)

went

through

our

standard

preprocessing

pipeline,

developed

at

UC

Davis

(Kozik,

A,

2010
)
.


Velvet

(
Zerbino

and

Birney,

2008
)

and

CLC

(CLCBIO,

2010
)

software

packages

were

used

to

assemble

the

sequences
.


CAP
3

was

used

to

make

the

final

assembly

of

three

assemblies
.














Trimmed reads

Min 40nt


Max 85nt

Trimmed reads

Min 25nt


Max 60nt

One iteration of CLC

assembly with all reads

One iteration of CLC

assembly with all reads

One iteration of CLC

assembly with all reads

Velvet Assembler

Early Jalape
ñ
o

31

35

41

31

35

41

Trimmed reads

Min 40nt


Max 85nt

Trimmed reads

Min 25nt


Max 60nt

All K
-
mer

assemblies,

assembled with CAP3

Ma
or

31

35

41

Trimmed reads

Min 40nt


Max 85nt

Trimmed reads

Min 25nt


Max 60nt

31

35

41

All K
-
mer

assemblies,

assembled with CAP3

CM334

31

35

41

31

35

41

Trimmed reads

Min 40nt


Max 85nt

Trimmed reads

Min 25nt


Max 60nt

All K
-
mer

assemblies,

assembled with CAP3

Velvet K
-
mers

CLC Assembler

Trimmed reads

Min 40nt


Max 85nt

Trimmed reads

Min 25nt


Max 60nt

Trimmed reads

Min 40nt


Max 85nt

Trimmed reads

Min 25nt


Max 60nt

Velvet Assembler

CLC Assembler

Velvet Assembler

CLC Assembler

CM334
assembly made with CAP3

Early Jalape
ñ
o

assembly made with CAP3

Ma
or

assembly
made with CAP3

+

+

+

Pepper final assembly made with CAP3

(Reference Sequence)

Assembly

Statistics



No
.

of

Contigs


Total

nt


N
50

CM
334



83
,
113


84
,
792
,
180

1
,
488

Early

Jalapeño



82
,
614


84
,
973
,
865

1
,
488

Maor


76
,
375


79
,
383
,
673

1
,
526


Pepper

assembly


123
,
261


135
,
019
,
787

1
,
647

(CM
334
,EJ

and

Maor
)



















Annotation


A

total

of

63
,
202

contigs

(
51
.
3
%
)

had

at

least

one

hit

in

the

non
-
redundant

database

of

GenBank

with

an

average

length

of

1
,
495

nucleotides
.


Contigs

with

a

hit,

covered

94
.
5

M

bases

(
70
%
)

of

the

total

assembly
.


A

total

of

60
,
055

(
48
.
7
%
)

contigs

that

did

not

have

any

hit

in

the

GenBank

were

on

average

674

nucleotide

long

and

covering

40
.
5

M

bases

(
30
%
)

of

the

total

assembly
.


Based

on

all

results

of

BLASTX,

Vitis

vinifera
,

Arabidopsis

thaliana

and

Oryza

sativa

were

the

top

three

species

in

the

blast

hits


(Fig

3
)
.


Mapping

step

of

Blast
2
GO

resulted

in

identification

of

37
,
918

(
30
.
7
%
)

contigs

with

Gene

Ontology

(GO)

terms
.


Biological

Processes

(BP)

at

different

GO

levels

were

generated
.

Fig

4

shows

the

BP

at

level

2
.

For

each

BP

number

of

annotated

sequences

are

shown

in

Fig

5
.



Kegg

maps

for

150

biological

pathways

were

generated

and

contigs

within

each

pathways

were

determined
.

For

instance,

Fig

6

depicts

Kegg

map

of

Pyrimidine

Metabolism

pathway
.


SNP

discovery



A

total

of

22
,
863

putative

SNPs

within

11
,
869

contigs

were

identified

by

our

SNP

discovery

pipeline
.


The

contigs

with

identified

putative

SNPs

comprised

23
,
794

kb

(
17
.
6
%
)

of

pepper

transcriptomes

assembly
.


On

average

1

SNP

per

1040

bp

of

exonic

regions

of

pepper

genome

was

identified
.


Assembly

of

transcriptomes

of

three

pepper

cultivars,

increased

the

total

assembled

bases

by

50
%
.



The

present

pepper

transcriptome

assembly

represents

~
4
%

of

pepper

genome

(
3500

Mb)
.


We

demonstrated

that

for

the

plants

for

which

the

genome

sequences

are

not

available

yet,

the

transcriptome

assembly

is

an

alternate

approach

SNP

calling
.



Annotation

of

51
%

of

contigs

or

70
%

of

total

assembled

bases

indicates

that

~
49
%

of

contigs

are

small

contigs

that

are

covering

the

remaining

30
%

of

unannotated

sequences
.







Conesa,

A
.
,

S
.

Götz
,

et

al
.

(
2005
)
.

"Blast
2
GO
:

a

universal

tool

for

annotation,

visualization

and

analysis

in

functional

genomics

research
.
"

Bioinformatics

21
(
18
)
:

3674
-
3676
.

Kozik
,

A

(
2010
)
.
Tool

to

process

and

manipulate

Illumina

sequences)
.

http
:
//code
.
google
.
com/p/atgc
-
illumina/downloads/list
”)
.


Li,

H
.

and

R
.

Durbin

(
2009
)
.

"Fast

and

accurate

short

read

alignment

with

Burrows

Wheeler

transform
.
"

Bioinformatics

25
(
14
)
:

1754
-
1760
.

Li,

H
.
,

B
.

Handsaker
,

et

al
.

(
2009
)
.

"The

Sequence

Alignment/Map

format

and

SAMtools
.
"

Bioinformatics

25
(
16
)
:

2078
-
2079
.

Zerbino
,

D
.

and

E
.

Birney

(
2008
)
.

"Velvet
:

algorithms

for

de

novo

short

read

assembly

using

de

Bruijn

graphs
.
"

Genome

Res

18
:

821

-

829
.





Molecular breeding of pepper (
Capsicum spp.
) has been hampered by the paucity of
molecular markers.


This is primarily due to lack of availability of the pepper genome sequence and
limited available sequence resources.


In recent years with the more cost effective sequencing technologies such as
Illumina
, sequencing of expressed genes (transcriptomes), gene discovery and
allele mining is no longer insurmountable.


In order to exploit the speed and scale of data from new sequencing technologies
and in an effort to enrich the sequence resources of pepper, we sequenced
transcriptome sequences (RNA
-
seq
) of three pepper lines: Maor, Early Jalapeño
(EJ) and
Criollo

de
Morelos 334 (CM334).


We selected a wide range of tissues to represent as many expressed genes as
possible.


The reference sequence was constructed from >200 million
Illumina

reads

(80
-
120
nt
) using a combination of Velvet, CLC and CAP3 software packages.


BWA (Li and Durbin, 2009),
SAMtools

(Li et al, 2009b) and in
-
house Perl scripts
were used to identify SNPs among three pepper lines.


The SNPs were filtered to be 100
bp

apart from any putative

intron
-
exon junctions
as well as adjacent SNPs. After filtering >22,000 high quality putative SNPs were
identified and
bioinformatically

mapped to pepper genetic maps.


The reference sequence was annotated by Blast2Go software (Conesa et al, 2005).

The

authors

would

like

to

thank

Enza

Zaden
,

Nunhems,

Rijk

Zwaan,

Syngenta,

Vilmorin

and

UC

Discovery

program

for

the

financial

support
.

We

also

would

like

to

thank

sequencing

facility

of

UC

Davis

Genome

Center

and

Bioinformatics

core

facility

to

provide

us

the

servers

and

computational

power
.

The

annotation

would

not

be

possible

without

collaboration

with

Dr
.

R

Michelmore’s

laboratory
.





SNP

Discovery

Pipeline


BWA

was

used

to

map

all

the

reads

of

three

genotypes

individually

to

the

Pepper

final

transcriptome

assembly
.


SAMtools

was

use

to

make

the

pileups

of

each

cultivar

and

discover

the

difference

within

each

cultivar

with

reference

sequence
.


Indels

were

screened

out

of

pileup

files
.



Intron
-
exon

junction

positions

were

inferred

in

the

reference

sequence

based

on

Arabidopsis

gene

models

using

intron

finder

of

Solanaceae

Genome

Network

website

(SGN)
.


In
-
house

Perl

scripts

were

used

to

create

allele

call

table

of

all

three

genotypes,

the

SNPs

were

filtered

against

adjacent

SNPs

and

identified

Intronic

regions
.


Sequences

surrounding

the

SNPs

(
100

base

on

each

side)

were

extracted

from

the

reference

sequence

to

design

assays
.

Annotation

of

Reference

Sequence


Blast2Go program was used to annotate the reference sequence, obtain the statistics and generate Kegg maps(http://www.genome
.jp
/kegg/pathway.html).





Fig
.

3

Fig
.

4

Fig
.

5

Fig
.

2

Distribution

of

contig

length

in

pepper

transcriptome

assembly


N50=1647

Mean=1095


Max=19,089

Min=265

Fig
.

6

Kegg

map

of

Pyrimidine

Metabolism

Fig
.

1

De

Novo

assembly

of

pepper

transcriptomes