Taidi_Saina_MSc_thesisx - University of Guelph

chatteryellvilleBiotechnology

Feb 20, 2013 (4 years and 6 months ago)

349 views



Biodiversity
A
ssessment of
I
nsect from
E
nvironmental
S
amples
U
sing qPCR
and
N
ext
-
G
eneration
P
arallelized
S
equencing of DNA
B
arcodes

b
y


Saina Taidi







A Thesis

Presented to

The
University of Guelph






In partial fulfillment of requirements

for the degree of

Master of Science

in

Integrative Biology









Guelph, Ontario, Canada

©
Saina Taidi
,
August
,
2012


ABSTRACT







Biodiversity
A
ssessment of
I
nsect from
E
nvironmental
S
amples
U
sing qPCR
and
N
ext
-
G
eneration
P
arallelized
S
equencing of DNA
B
arcodes



Saina Taidi







Advisor:

University of Guelph, 2012






Mehrdad Hajibabaei


This thesis employs
three

bioindicator species of mayfly (Insecta: Ephemeroptera) and

three of

caddisfly (Insecta: Trichoptera) as models to develop a reliable biodiversity and
biomonitoring assessment approach by using quantitative PCR (qPCR) and next
generation sequencing (NGS)
technology. Quantitative PCR was employed to assess the
efficiency of species
-
specific PCR primers in amplifying their target species versus other
taxa from closely or distantly related taxonomic groups from benthic habitats.
Results
showed
qPCR can be use
d as a
practical

test for evaluating PCR primers for amplifying
specific taxa

in mixed environmental samples

although it

might be influenced by
amplification bias
.
Target specific primers are an alternate to presumably universal
primers.
Each primer set
can be tested and optimized using qPCR prior to use in next
-
generation sequencing.

qPCR results showed corroboration with 454 pyrosequence data
and hence it can be used in experimental design procedure for NGS based biomonitoring

which could indicate that
qPCR is

a useful tool
for selecting primers in the NGS
amplicon preparation.


iii


ACKNOWLEDGMENTS


First and foremost I offer my sincerest gratitude to my supervisor, Dr. Mehrdad Hajibabaei,
who has supported me throughout my thesis with his patience and knowledge. I attribute the
achievement of my Master’s degree to his encouragement and assistance. Wi
thout his advice,
this thesis would not have been written or completed. One simply could not wish for a better
or more friendly supervisor.

I would like to especially thank Dr. Teresa Crease for all her support both as my committee
member and as
Graduate
C
oordinator. I will never forget her kind support; nobody could
wish for a better professor.

Many thanks to my co
-
adviser, Dr. Paul Hebert
,

who supported me with his constructive
opinions;

it was a great hono
u
r for me to have such opportunity to have his g
uidance through
my project. I also had such an unforgettable time in Dr. Donald Baird’s lab.
Also many
thanks to Dr. Baird and his team
,

especially Kristie Heard
,

who supervised my very first
experience in sample collection and species identification.

Hear
tfelt thanks to my dear friends and lab mates Claudia, Jennifer, Steve, Connor, Joel, Ian
and Stephanie, who blessed my everyday work brain storming and having a good time
together as well. Special thanks to Shannon who was always there to support me in mo
re
ways than anyone can expect. Many thank to Dr. Shady Shokarallah, who helped me
throughout this journey not only by his scientific knowledge, but with his great attitude and
encouragement to keep on going. I definitely would not be here without his grea
t support.

I am deeply lucky to have friends who helped me maintain the courage to write and to move
forward. I would like to thank all of them, near or far, for their support and encouragement.

iv


My especial thank to Ahmed Al
-
Wattar, Margaret Hundleby and S
hawn Kehoe, who read
over my thesis and provided great comments, explaining their concerns and paying careful
attention.

Thanks to the great help of Xin Zhou and Terri Porter who helped me in learning the
strategies for classic taxonomic identification
and bioinformatics

analysis of my data.

Susan Mannhardt, with her extra busy schedule, was always there to answer all questions and
support me in all aspects of the administrative process
. Also I would like to thank Ma
ry
-
Ann
Davis, Karen White, Lori Fergus
on and all the IB department staff.


I would like to thank to all of my colleagues and friends at Biodiversity Institute of Ontario,
especially Natalia Ivanova, for aid with laboratory protocols and workshops on sequence
editing.

Finally I would like to t
hank especially my mother and father for all their love and support
they gave me in my life, and also my sisters and brother for their love and support.










v


Table of Contents


LIST OF TABLES

................................
................................
................................
.......................

v
i

LIST OF FIGURES

................................
................................
.

vi
Error! Bookmark not defined.

LIST OF APPENDICES

................................
.........................

vi
Error! Bookmark not defined.

INTRODUCTION
................................
................................
................................
.........................

1

The challenges o
f biodiversity analysis

................................
................................
......................

2

Biodiversity and biomonitoring

................................
................................
................................
..

2

DNA information and biodiversity
analysis

................................
................................
...............

4

DNA barcoding: standardized molecular biodiversity analysis
................................
..................

5

Next
-
Generation sequencing for biomonitoring

................................
................................
.........

6

Quantitative

PCR

................................
................................
................................
........................

7

Why qPCR?

................................
................................
................................
................................

9

Objectives

................................
................................
................................
................................
.

10

MATERIAL AND METHODS

................................
................................
................................
.

1
1

Target species selection and specimen collection

................................
................................
.....

11

DNA extraction

................................
................................
................................
.........................

1
2

Primer design and optimization

................................
................................
................................

1
2

Sanger s
equencing validation of amplicons

................................
................................
..............

1
4

Quantitative PCR

................................
................................
................................
......................

1
4


vi


1. Template selection and normalization

................................
................................
..............

1
4

2. Experimental design
................................
................................
................................
..........

1
4

3. Reaction conditions for qPCR experiments

................................
................................
......

1
5

4. Data analysis

................................
................................
................................
.....................

1
6

454 pyrosequencing

................................
................................
................................
..................

1
8

1. Experimental design
................................
................................
................................
..........

1
8

2. Multiplexing amplicons

................................
................................
................................
....

1
8

3. Amplicon preparation

................................
................................
................................
.......

19

4.
454 Pyrosequencing amplicon library preparation

................................
...........................

2
0

5. 454 data analysis
framework

................................
................................
............................

2
0

Automated sequence filtering

................................
................................
................................
...

2
1

Manual sequence analysis

................................
................................
................................
.........

2
2

RESULTS

................................
................................
................................
................................
....

2
3

Quantitative PCR Results

................................
................................
................................
.........

2
3

Relative
Amplified Copies (RAC) Analysis

................................
................................
.............

2
5

Quantitative and qualitative analysis of pyrosequencing reads

................................
................

2
6






vii


DISCUSSION

................................
................................
................................
..............................

29

Primer

behavio
u
r in multi
-
template PCR

................................
................................
.................

3
0

Quantitative PCR as a tool for target identification

................................
................................
..

3
1

Optimal NGS analysis of target genes and taxa

................................
................................
........

3
2

Comparing qPCR a
nd 454 results

................................
................................
.............................

3
4

Towards an standardized approach for metagenomics analysis of environmental DNA

.........

3
5

REFERENCES

................................
................................
................................
............................

3
7

TABLES

................................
................................
................................
................................
.......

44

Table 1: Species
-
specific oligonucleotide primers

................................
................................
...

44

Table 2: gDNA extracts concentration from target species

................................
......................

45

Table 3: 454 Pyrosequencing tagged primers

................................
................................
...........

46

Table 4: C
T

Values of target species setA (Trichoptera)

................................
..........................

47

Table5:
C
T
Values of target species setA (Ephemeroptera)

................................
......................

48

Table 6:
C
T
Values of target species setB (Trichoptera)

................................
...........................

49

Table 7:
C
T
Values of target species setB (Ephemeroptera)

................................
.....................

50

Table 8:
Summary results from qPCR & 454 pyrosequencing analysis

................................
...

51

Table 9:
Slope and efficiency
rates for primer set A and B amplicon
-
based material

.............

5
2

Table
10
:
R
ead numbers for gDNA
-
based material, automated analysis


................................

5
3

Table 1
1
: R
ead numbers for amplicon
-
based material, automated
analysis

.............................

5
4


viii


Table 1
2
: R
ead numbers for gDNA based material, manual analysis

................................
......

5
5

Table 1
3
: R
eads numbers for amplicon based material, manual analysis

................................

5
6

FIGURE
S

................................
................................
................................
................................
.....

5
7

Figure 1:
Amplification plot sample

................................
................................
.........................

5
7

Figure 2:
The workflow used in qPCR experiments


................................
................................

5
8

Figure 3: 454 pyrosequencing experimental workflow

................................
............................

5
9

Figure 4: Exemplar standard curves for qPCR experiments (gDNA based)

............................

60

Figure 5: Exemplar standard curves for qPCR experiments (amplicon based)

........................

6
1

Figure 6. E
xemplar Relative Amplified Copies (RAC).

................................
...........................

6
2

Figure 7. MID distribution for gDNA based material.

................................
.............................

6
3

Figure 8. MID distribution for Amplicon based material.

................................
........................

6
4

APPENDIX 1:

Standard curves for target samples
, gDNA based

................................
..............

6
5

APPENDIX 2:
Standard

curves for target samples, amplicon based

................................
..........

6
9

APPENDIX 3:

454 pyros
equencing analysis results

................................
................................
..

7
3





1


INTRODUCTION


Biodiversity is the diversity of genes, species and ecosystems, or the variety of every
living organism and can be defined at many different levels, from allelic diversity and
heterozygosity to the variation of population distribution in a region
(Lovej
oy, 1997)
.

Today, the
concept of biodiversity within conservation biology is not only focused on the subject of species
diversity or endangered species but also on other aspects of biodiversity that focus on practical
applications such as water quality
analysis, conservation biology or measuring the health of
biological resources .

Biodiversity and its impact on other fields of biological sciences has long been a subject
of fascination for scientists around the world. Modern biodiversity analysis started

with the work of
Linnaeus almost 250 years ago, and yet even today only a small fraction of the world’s species are
known to humanity. The greatest diversity exists among insects, which account for more than one
million of the planet's named animal specie
s. From the canopy of the tropical rain forests to ocean
floor, it is estimated that millions of undescribed insect species and
other

organisms exist

(Mora
et
al.

2011)
.

All together, the earth's oceans and continents support close to 50,000
-
55,000 specie
s of
vertebrate animals and 300,000
-
500,000 species of plants, with anywhere from 10 to 100 million
species still to be identified (Mora
et al.

2011). A new study used a statistical approach to estimate
the total number of species to be 8.7 million (Mora
et al.

2011)
.

However, the authors recognize
limitations of current direct methods for estimating biodiversity.




2



The challenges of biodiversity analysis

Biodiversity is fundamentally concerned with measuring the number of species and how
they combine to
form communities and ecosystems. The most common way of studying this is to
characterize the differences between species using different traits such as body size, physiological
tolerance and body shape or even by habitat preferences

(Bonada
et al.

2006)
. H
owever, it is
important to note the difficulty of measuring these characteristics easily and accurately for
biodiversity analysis. There are bottlenecks such as difficulties in the identification of species at
different life stages (such as difficulties in

identifying larvae) or sometimes measuring the
biodiversity based on this method is more difficult when parameters such as species richness or the
increase in consistency (evenness) distributes more equally among these species.

Although biodiversity measu
rements are based on counting the
abundance of

species in a
target environment, the ability of research scientists to conduct measurements on a large scale is an
important factor in the efficacy of any method. When considering species
-
rich ecosystems such
as
in
the tropics, analyses become more complicated and the nature of these complex ecosystems makes
biodiversity assessment much more difficult.


Biodiversity and biomonitoring

Biological monitoring or biomonitoring is the systematic utilization of biolog
ical
responses to assess and monitor changes in the environment with the intention of using that
information in environmental assessment programs (Bonada
et al.

2006). The utilization of
environmental bio
-
indicators has become one of the common methods for evaluating the health of a

3


target environment. In general, bio
-
indicators are defined as taxa that can respond to environmental
changes or disturbances in a wa
y that can be observed and measured (quantified). The sensitivity of
an organism’s reactions to environmental changes and the capacity of scientists to measure them are
important factors in selecting bio
-
indicators
(
Hajibabaei,
et al.
, 2011; Nash, 1989; Noss, 1990)
.

Biomonitoring of water quality can occur in freshwater or marine water. Freshwater
biomonitoring can occur in lentic (lakes and ponds) or lotic (rivers and streams) i
nland waters.
Organisms that live in the bottom subtracts (sediments, debris, logs, macrophytes, filamentous algae,
e
tc.) of fresh
water

habitats (lentic
and lotic
) for at least part of their life cycle are considered benthic.
Benthic macroinvertebrates ref
ers to
animal
s that inhabit the bottom sub
s
tra
te

for at least part of
their life cycle and are retained by mesh sizes ≥ 200 to 500 µm
( Rosenberg
et al.
, 1993; Suess,
1982; Ward
et al.
, 1986)
.

The processing of benthic macroinvertebrate specimens using classical taxonomic
approaches is an important
barrier to

the development of biomonitori
ng processes especially when
applied to large
-
scale programs such as the biomonitoring of freshwater to indicate the quality of a
target stream. Moreover, this type of bottleneck can also occur at the sample collection, sorting and
preparation stages. The
identification of larvae has always been a major bottleneck in biomonitoring
studies involving benthic macroinvertebrates

(Bonada
et al.

2006)
.

The routine biomonitoring process relies on

the

identification of one specimen at a time,
which requi
res exper
ienced technicians,

sufficient time and funds to complete the process. Another
difficulty found within taxonomy
-
based biomonitoring is the depth of the identification. Although
keys exist for the identification of species, they are not comprehensive and ar
e lacking in
descriptions of all life stages of target species.


4


DNA information and biodiversity analysis

Without genetic diversity, a population loses the ability to evolve and adapt to
environmental changes. Genetic diversity has an impact on intraspecific levels of biodiversity.
Hence, the study of genetic
variation is

central to biodiversity analysis. In
order to accurately
identify species based on genetic information one needs to focus on genetic information that varies
between species and not among members of the same species. However, traditionally, the
characterization of species has been studied base
d on morphological characteristics. Nevertheless,
morphological inconsistency is one of the main issues that scientists are faced with;
diagnose the
characteristics may not be apparent at all life stages of an organism’s development and its
appearance may
be influenced by environment factors. Today, many different genetic markers and
techniques have been introduced to assess genetic variation as a complementary tool to aid
traditional approaches
(Roesch
et al.

2007; Gill
et al.

2006; Limpiyakorn
et al.

2006)
.

Molecular biology tools have provided useful information on the diversity of target organisms
through
the detection of variation at the molecular level (mainly DNA and proteins). The reliable
identification of organisms is an essential and important ability that these techniques can provide
within evolutionary, ecological and environmental studies. There a
re many instances in which
genetic tools could give better resolution in the identification of species when barriers in
identification processes exist.

There are a number of different techniques which are available for genetic identification.
The priority
of choosing one technique over another is dependent on the material that is being
studied or the nature of the questions to be addressed.
DNA b
arcoding
is one

of the DNA
-
based
techniques that have been used for studying biodiversity and molecular evolution
.



5


DNA Barcoding: Standardized molecular biodiversity analysis

DNA barcoding
(Floyd,
et al.,
200
2;
Hebert,

et al.,
200
3
)

is a relatively new molecula
r
approach that uses a short uniform sequence of DNA to identify species across taxonomic groups. A
650 base pair region near the 5’ end of the mitochondrial gene cytochrome c oxidase 1 (COI) has
been suggested as a DNA barcode for animals. Subsequently

(Hebert and Gregory, 2005; Smith
,

et
al.
, 2008)
, DNA barcoding

has gained momentum in biodiversity studies as a standard species
identification method
(Frézal and Leblois, 2008; Hajibabaei
et al.
, 2006)
. DNA barcoding can
differentiate between morphologically cryptic species more efficiently than other methods; however
it does not eliminate the need for traditional taxonomy
. Beyond its use as an identification technique,
it has been suggested that DNA barcoding can be used to expand our understanding of phylogenetic
and population
-
level differentiation, although DNA barcode sequences are often not appropriate for
comprehensi
ve phylogenetic analyses. Some studies have questioned the ability of COI barcodes to
distinguish between species from certain taxa, such as hybrids and in recently diverged species
(Munch,
et al.,
2008)
. These critics propose that COI should be used in concert with nuclear genes
to yield more robust conclusions. Additionally, alternative genes have been pro
posed as DNA
barcodes for plants and fungi (Hollingsworth,
et al.
, 2009
). In cases where DNA in a specimen is
degraded, it has been shown that even a partial fragment of DNA barcode, a mini
-
barcode, can
provide species
-
level resolution
(Meusnier
et al.
, 2008)
. These mini
-
barcodes can often provide
DNA barcode information in situations where a full
-
length barcode cannot be retrieved. These cases
include museum samples with potentially degraded DNA as well as environmental samples in which
next generation sequen
cing methods (that can currently produce sequence reads less than 500 bases)
are needed.



6


Next
-
g
eneration
s
equencing for
b
iomonitoring

Although DNA barcoding contributes to taxonomic research and biodiversity analysis by
identifying unknown specimens, som
e important issues need to be considered concerning the
possible applications of barcoding to the analysis of bulk environmental samples. For example, is it
possible to analyze and barcode all species in an environmental sample without separating them to
i
ndividuals? If so, would it then be possible to quantify species abundance by analyzing bulk
samples? Next Generation Sequencing (NGS) platforms may aid in answering these questions.
While Sanger sequencers work on single specimens, NGS devices such as 454
-
FLX
(Margulies
et
al
.
, 2005)

can read the sequence of thousands to millions of DNA fragments. However, one of these
technologies, massively parallelized pyrosequencing, which is currently implemented in the Roche
454 device, has three characteristics that make it sui
table for the analysis of biodiversity in a large
number of DNA templates, such as DNA extracted from bulk environmental samples: 1) high
throughput, 2) the ability of parallel sequencing, and 3) the ability to read a relatively long length of
sequence (cu
rrently 250
-
400 bases). The third characteristic is especially important for accurate
identification of biota in environmental samples, as the alternative technologies produce short
sequence reads incapable of distinguishing taxa in complex environmental s
amples
(Claesson
et al.

2010)
. Therefore, through the use of a 454 pyrosequencer, it is possible to gain sequence infor
mation
from DNA barcodes and to use bioinformatics to compare this information to standard barcode
libraries to assess biodiversity in an environmental sample. 454
-
pyrosequencing produces large
amounts of data at low cost as well as providing a method for
sequencing environmental DNA
without a former cloning step. To date, 454
-
pyrosequencing technology has mainly been used in
environmental studies involving bacteria. While the use of DNA barcoding combined with next

7


generation sequencing offers great potent
ial in broadening the application of DNA barcodes, such
protocols have not been fully developed.

The goal of a new technology development project at the Biodiversity Institute of Ontario is to
optimize protocols for data generation and bioinformatics anal
yses of an environmental barcoding
system for biomonitoring applications. The 454
-
FLX pyrosequencing facility has been generating
data from sentinel groups, such as benthic macro
-
invertebrates including mayflies (Ephemeroptera),
stoneflies (Plecoptera), an
d caddisflies (Trichoptera) called “EPTs”. Because of their sensitivity to
environmental changes, EPTs are key taxa for environmental biomonitoring studies for freshwater
quality assessments (Bonada
et al.

2006). If these taxa are to be used in environment
al barcoding
using a pyrosequencing approach, we need to understand and optimize recovering their DNA
barcode sequences directly from environmental samples. Although groundbreaking work at BIO has
proved this approach feasible (Hajibabaei
et al.

2011), mol
ecular tools for assessing multi
-
template
Polymerase chain reaction (PCR) prior to pyrosequencing analysis are not available. This thesis
employs six bioindicator species of Ephemeroptera and Trichoptera (three species from each order)
as a model to assess

various factors in developing a reliable biodiversity and biomonitoring
assessment approach by using pyrosequencing. Quantitative PCR technology will be employed to
assess the behavior and efficiency of PCR primers used in the multi
-
template PCR necessary

to
perform amplicon
-
based pyrosequencing.


Quantitative PCR

The polymerase chain reaction (PCR) can produce millions of copies of a particular DNA
sequence in approximately 1.5
-
2 hours. This automated process avoids the use of cloning and

8


bacteria to ampl
ify DNA. Real
-
time polymerase chain reaction or quantitative polymerase chain
reaction (qPCR) is similar to normal PCR, but the PCR amplicons are detected and quantified as
they are generated. Hence, qPCR has been used for quantifying the PCR product of o
ne or more
specific sequences in a DNA sample.

Preliminary efforts to manage the quantifying power of PCR have been faced with limitations
such as generating data by removing an aliquot of reaction at specific cycles, making a serial dilution
of PCR produc
t or in some cases by including an internal control
(Becker
et al.
, 1996; Kennedy,
2011; Ozawa
et al.
, 1990; Piatak
et al.
, 1993; Roux, 2009)
. Although these methods are able to
quantify the PCR product to some extent, they are time consuming and labour intensi
ve so the use of
these methods has been limited.

Quantitative PCR has had a great impact on molecular biology and simplified quantification. The
mechanism of this technique is based on monitoring the amount of fluorescence in each cycle, which
is produced
by a dye that binds to the PCR amplicon as it is generated. The amount of PCR product
can be plotted as a function of cycle number. By this new method there is no longer a need to
actually sample a reaction at various cycles or to use labor intensive techn
iques to predict the
exponential phase. This technique recognizes the exponential region by plotting fluorescence on a
logarithmic plot. The preliminary cycle occurs when the fluorescence level is significantly higher
than background levels, which represen
ts the initial template amount. The quantification cycles
(Cqs) are determined by a fluorescence threshold (The term, “CT value” is the number of cycles
required for each template to pass the threshold). Figure 1 provides an example of the differences in
CT value and cycle number which may be detected in a qPCR experiment.




9


Why qPCR?

Although PCR
-
based techniques have had a great influence on the field of molecular biology, the
post PCR analysis methods used to analyze its results are limited. Gel electrophoresis is one of the
most common techniques for visualizing PCR products. Although it is fas
t, easy and inexpensive, it
cannot distinguish between different products with the same molecular weight.

Soon after
the introduction of qPCR in

1996, it became an everyday tool in molecular
labs; Quantitative qPCR machines have simplified amplicon recogn
ition by providing the ability
to monitor amplifications during each cycle. All available instruments designed for qPCR
experiments measure the progress of PCR amplification by tracking the changes in the
fluorescence level coming from each amplicon, in ea
ch cycle within each PCR reaction. In
addition, these measures can be taken without opening the instrument so the risk of
contamination decreases significantly.

Quantitative PCR offers many advantages for quantitative analysis and detection of
specific tar
get genes and has been widely used in research and diagnostics. The ability to
monitor the reaction constantly, rapid running time, potential for high throughput analysis, high
sensitivity (~ 3pg or 1 genome equivalent of DNA) and wide range as it can de
tect across 10
-
10
10

copies of target DNA are some of the advantages of qPCR . Conversely, there are
disadvantages of this technique such as limited capacity for multiplexing, the requirement for
high levels of optimization and the need for high technical s
kills above those required for normal
PCR.


In this study, I employ qPCR to evaluate primer
-
binding affinities in different primer sets used in
multi
-
template PCR amplification of bulk environmental samples prior to pyrosequencing.


10


Objectives

The objectiv
es of this study are to improve the present understanding of the patterns and
processes obtained using molecular information from DNA barcodes in biodiversity assessment
using species from two orders of
the
class Insecta as models. More specifically, an a
ttempt will
be made to examine the use of barcoding as a tool for biodiversity assessment and biomonitoring
of environmental samples. I predict that the results from pyrosequencing will be more robust in
obtaining a comprehensive species
-
level biodiversity

measure from bulk samples at a much
faster pace than other approaches such as cloning and Sanger sequencing.

I predict that the primers that bind to specific sites (100% matching
) in

the target species
will lead to better amplification efficiency as refl
ected in qPCR analysis. Moreover, the
proportion of pyrosequencing reads obtained from a mixed template PCR analysis will reflect the
amplification efficiency of qPCR for each target
-
specific primer set.













11


MATERIAL AND METHODS

Target species
selection and specimen collection

Three local species from the insect order Trichoptera (
Ceratopsyche bronta
,
C. sparna
,
and
Chimarra obscura
) and three local species from the insect order Ephemeroptera
(
Maccaffertium

interpunctatum
,
M. modestum
, and
Caeni
s diminuta
) were selected to test the
effect of primer bias. In both cases, two species were selected from the same genus and one
species was selected from another distantly related genus in the same family. These insect orders
were selected because of the
ir importance in freshwater biomonitoring programs. Target species
were chosen because of their abundance and availability, which allows access to fresh material
for downstream analyses.

Three sampling sites were selected for this study. The first two were

near Fredericton,
New Brunswick. These sites were the Marysville Bridge on the Nashwaak River (45°59'4.19"N,
66°35'29.40"W), and the Renous River (46°47'46.65"N, 66°11'58.52"W). The third site was on
the Grand River in Ontario (43°50'0"N, 80°25'0"W) close

to the Elora Conservation Area. Both
adult and larval insect samples were obtained from all three sites during the spring and summer
of 2009. A light trap technique was used to collect adults, and each individual was placed in a
1.5 ml tube containing 95%

ethanol. A total of 140 Trichoptera individuals from the two New
Brunswick sites were placed in separate empty tubes, frozen overnight and pinned and identified
using the taxonomic key on the next day.

To select target samples, a total of 279 individual insects from the 6 species were all
either pinned or sorted in ethanol from the three sites, and were tentatively classified on the basis
of morphological characteristics, and sorted into three 96
-
well pl
ates.


12


DNA
e
xtraction

A single leg from each individual was placed into a 10 MP lysing matrix tube (MP
Biomedicals Inc., Solon, Ohio USA) and homogenized using the MP FastPrep
-
24 Instrument
(MP Biomedicals Inc.) set at “6” for 30 seconds. DNA was extracted

from each homogenized
tissue sample using a NucleoSpin tissue kit (MACHEREY
-
NAGEL Inc.
Bethlehem,
Pennsylvania

, USA) following the manufacturer’s instructions. The DNA was eluted with 7
0

l
of molecular biology grade water pre
-
warmed to 70 °C.


Primer design and optimization

Routine DNA barcoding of target samples followed standard COI barcoding protocols
(Hajibabaei
et al.
, 2005)
. A full
-
length COI DNA barcode was amplified using the
LCOI490/HCO2198 primers
(Folmer
et al.
, 1994)
. In order to evaluate primer binding bias,
additi
onal primers were designed with 100% match to the sequence of the target species.
Previous studies focusing on the amount of DNA barcode sequence information needed for
species differentiation and resolution have shown that a partial fragment of the standa
rd COI
barcoding region can be informative enough to discriminate species in most groups

(Hajibabaei
et al.
, 2006; Hollingsworth
et al.
, 2009; Janzen
et al.
, 2005)
. Following these studies and by
taking advantage of available barcode sequences (for primer design), the species
-
specific primers
were designed within the COI standard barcode region.

After aligning the available barcodes for the target species, two regi
ons for designing
primers were selected. Twelve primer sets were designed in total: six were designed near the 5’
end of the COI DNA barcode region (Set A) and the other six primer sets were designed (Set B)

13


at the 3’ end of the DNA barcode region. Primer
Set A targeted a 143bp amplicon of the COI
barcode region and Primer Set B targeted a longer fragment of 305bp at the opposite end of the
COI barcode region. The routine primer design conventions, including high G+C content (more
than 50%), minimal seconda
ry structure, primer length and self complementarities were
considered
(Aird
et al.
, 2011; Lakes, 2001)
. Primers were checked for routine primer designing
rules using tools available on the Integrated DNA Technologies,
Inc (
Coralville, Iowa, USA)
website and produced by the same compan
y. All primers were received in lyophilized tubes, and
diluted to 10mM working solutions (molecular biology grade water). Table 1 provides details of
primer codes and their nucleotide sequences.

The PCR mixture consisted of 17.5

l molecular biology grade

water, 2.5

l 10X
reaction buffer, 2mM of 50 mM MgCl2 , 0.2mM of 10 mM dNTPs mix, 0.2μM of 10μM, 0.2
μM of 10μM reverse primer and 5 U/ μl Invitrogen’s Platinum Taq polymerase in a total volume
of 25 μl. The amplification regime was set to initial denatur
ing at 94°C for one min, followed by
4 cycles of denaturing at 94°C for 40 s, annealing at 45°C for 40 s and extension at 72°C for one
min. For the next 35 cycles, the annealing temperature was increased to 50°C, followed by final
extension at 72°C for 10
min. Amplicons were visualized on 1.5% agarose gel using 0.3

l of
ethidium bromide for 5 μl of each PCR product in TE 10X buffer.

A consensus optimal condition (considering factors affecting PCR) was selected by
running test PCRs for each primer set for
each species and selecting the condition where all
primer sets provided amplicons with relatively similar intensity on Agarose gels. For example,
an optimal annealing temperature of 50°C was selected after gradient PCR was done at varying
annealing tempera
tures of 40°C, 43.5°C, 46°C, 50°C and 55°C.



14


Sanger sequencing validation of amplicons

Amplicons were verified to correspond to the targeted fragment of the COI barcode
region by direct sequencing using a bidirectional Sanger sequencing approach utilizing

BigDye
chemistry version 3.1 (Applied Biosystems). Excess primers and dNTPs were removed from the
sequencing reaction using EdgeBio’s AutoDTR96 (Gaithersburg, MD, USA), after which, the
purified products were visualized on an ABI 3730xl sequencer, Applied

Biosystems (
Foster City,
CA, USA).


Quantitative PCR

Figure 2 provides an overview of the qPCR experimental workflow. Below I provide the
details of major steps in this workflow.


1. Template selection and normalization

Quantitative PCR experiments were
performed using three dilutions of DNA extracts (10
-
1
, 10
-
2

and 10
-
3
) starting with the same concentration (250 ng/µl) in all tested specimens.
Additionally, normalized and purified amplicons from each species (amplified barcode region
using standard barco
ding primers) were used as the DNA template for qPCR in six different
normalized dilutions.


2. Experimental design

Measurements with the Nanodrop spectrophotometer showed the DNA concentration
acquired from target species (Table 2). Quantitative PCR optim
ization was performed for

15


dilutions of 10
-
1
, 10
-
2

and 10
-
3

for normalized genomic DNA

extracts (250 ng/µl), whereas for
purified amplicon
-
based material (70 ng/µl), six dilutions (10
-
1
, 10
-
2
, 10
-
3
, 10
-
4
, 10
-
5
, and 10
-
6
)
were used. The experiment was
designed as a matrix so that the PCR product for each species
matched with its own primers and every other primer. The matrix layout also allowed the primer
behavior among all target species to be studied.

Three dilutions (1000, 250 and 50 pg/ µl) were subsequently tested in qPCR (see below).
To obtain a presumably equal number of the target DNA template and to avoid fluctuations in
gene/mitochondrial copy number, normalized DNA extracts were used as a templ
ate to produce
an amplicon from the standard barcode region (Figure 2) that was then used as template for
subsequent qPCR analyses. Primer set, LCOI490/HCO2198
(Folmer
et al
., 1994)

was used to
amplify the full
-
barcode amplicons. The same PCR condition for amplicon preparation used for
Sanger sequencing was used for preparation of amplicon based ma
terial. All amplicons were
purified using the QIAquick 96 PCR Purification Kit (Qiagen Inc. Toronto, Ontario, Canada) and
subsequently quantified using the NanoDrop spectrophotometer ND
-
1000 (V3. 3.0), and
normalized on the basis of the least concentrated
amplicon.


3. Reaction conditions for qPCR experiments

QuantiTect SYBR® Green PCR kit (Qiagen) and Eppendorf Mastercycler® ep realplex Thermal
Cyclers were used for all qPCR experiments. Based on primer optimization results,
the
annealing
temperature was
set at 50°C for all subsequent qPCR experiments. Other PCR variables were
optimized as well. For example, the concentration of MgCl
2

was set to 7mM final concentration
instead of 2mM. Likewise, primer concentration was set to 900 nM after testing 300 nM, 6
00

16


nM, 900 nM and, 1200 nM. PCR reactions also included 2x quantitech SYBR green PCR master
mix (12.5

l per reaction), 2

l of DNA template (for both genomic DNA and full
-
barcode
amplicons as template) and, RNAse
-
free water to a total volume of 25

l for
each reaction.

4. Data analysis

All qPCR experiments were performed in triplicate to determine the stability of the
results and the average of the three replicates was used for the qPCR analysis
(Rieu and Powers,
2009; Udvardi,
et al.
, 2008)
. Standard curves were generated from the machine default software
and the logarithm of relative amplification and threshold cycle (C
T
) values were determined. The
C
T

value is used commonly in reporting qPCR results and corresponds to the cycle number in
which the fluorescent signal of the reaction passes the threshold line. The C
T

value is inversely
related to the amount of starting template. Assuming that PCR is oper
ating with 100% efficiency,
the copy number of amplicons doubles every cycle.

The Eppendorf analysis software (Eppendorf mastercycler ep, realplex 2.0) was used to
analyze the results; C
T

values were recorded with a default threshold setting of 100 and an

automatic mode baseline setting for all target specimens. To ensure consistency of qPCR
experiments in different target species and primer combinations, a standard curve was generated
for each primer/species using the C
T
value with the threshold set at 10
0 in 6 different dilutions.

To describe the difference between the C
T

value of the target gene and the C
T

value of the
corresponding gene (COI), ∆C
T

value is calculated:


C
T

= C
T

(target species with specific primers)


C
T

(non target species with the same

primers)



17


I used 2∆CT to calculate the copy number of generated amplicons in sample A relative to
that in sample B. For example if ∆CT between species A and B is 7 cycles (it takes 7 more
cycles to see amplification of A), then there is:


2
7

= 128 times
more B than A





18


454 pyrosequencing

1. Experimental design

Amplicon
-
based metagenomics analysis is one of the major applications of next
generation sequencing (NGS) technology in biodiversity science. The amount of data produced
by NGS technology provides

insights into the diversity of organisms in bulk samples in an
unprecedented way. Specifically, for amplicon
-
based analysis of biodiversity, Roche 454
-
pyrosequencing technology has been the most practical choice since this technology produces
longer reads

as compared to other available NGS options, namely Illumina and SOLiD (
Pandey
et al.
, 2011)
.

Since 454 pyrosequencing and other NGS approaches are becoming the main tools for
the analysis of mixed environmental samples, I used two experimental mixtures to test primer
-
binding properties in 454 experiments. The first mixture consisted of an equimola
r pool of the
DNA extracts from all six target species,
whi
le

the second included an equimolar pool of purified
full
-
length COI DNA barcode amplicons of the target species (following the same procedure as
amplicon
-
based qPCR analysis described above). Full
-
length DNA barcode amplicons of each
target were normalized to 70ng/µl, and a 10
-
3

dilution (1µl of PCR in 999µl of water) was used
to prepare the equimolar pool (Figure 3).


2. Multiplexing amplicons

In order to combine sequencing reactions for multiple specimens in a single 454
sequencing lane and further separate and track individual 454 sequencing reads, Multiplex
Identifier sequence tags/ molecular barcodes (MID)
(Binlad
en et al., 2007)

were designed for

19


each target species and were incorporated in each species
-
specific primer set (A and B) .
Additionally, because the sequences of the primers themselves were not fully discriminatory and
in order to rule out any mismatc
h and wrong assignments or sequencing errors, MIDs were
employed in this 454 analysis. Each MID is a 10
-
base oligonucleotide (Table 3).

The 454 experiment was completed in two physically separated lanes in a 16
-
lane 454
picotiter plate. One lane was used f
or genomic DNA
-
based analysis (for primer sets A and B)
and the other for PCR product based material (for primer sets A and B).


3. Amplicon preparation

The first PCR was performed with target specific primers. Each PCR reaction contained 2
µ
l pooled

DNA t
emplates
(250

ng/µl each), 17.5 µl molecular biology grade water, 2.5 µl 10×
reaction buffer, 1 µl 50× MgCl
2

(50 mM), 0.5 µl dNTPs mix (10 mM), 0.5 µl forward primer (10
mM), 0.5 µl reverse primer (10 mM), and 0.5 µl Invitrogen's Platinum Taq polymerase (5

U/µl)
in a total volume of 25 µl. The PCR started with heated lid at 95°C for 5 min, followed by 15
cycles of 94°C for 40 sec, 43.5°C for 1 min, and 72°C for 30 sec, a final extension step at 72°C
for 5 min, and hold at 4°C. All target species amplicons w
ere purified using Qiagen's MiniElute
PCR purification columns and eluted in 50 µl molecular biology grade water. The amplicons
from the first PCR were used as template in the second PCR with similar conditions using 454
fusion
-
tailed primers in a 30
-
cycle

amplification regime. The second PCR was used to attach
fusion tails to the amplicons to allow them to bind to the beads in the 454 emulsion PCR
(described below). For all PCRs the Eppendorf Mastercycler gradient S thermalcycler was used.

20


The results for
PCR success were visualized by agarose gel electrophoresis (1.5%) and negative
controls were included in all experiments.


4. 454 Pyrosequencing amplicon library preparation

In 1.5ml tubes, 22.5ul of the generated amplicons were mixed with 22.5ul of molecu
lar
grade water. To this mix, 72µl of AMPure beads were added and vortexed well. The mixture was
stored at room temperature for 10 minutes in a Magnetic Particle Concentrator (MPC). Unused
reagents

and primer di
mers

were washed away with 70% ethanol and fragments were eluted with
10µl of 1× Tris EDTA (TE) buffer.

Subsequently, the quantified libraries were amplified in micro
-
reactors through emulsion
PCR (emPCR) followed by Streptavidin bead enrichment and emulsion b
reaking. The beads
attached to amplified DNA fragments were denatured with 1N sodium hydroxide solution and
annealed to a specific sequencing primer. All these steps and subsequent sequencing steps on the
454 instrument were performed according to Roche
-
45
4 GS FLX amplicon sequencing manual
protocol updated in October 2009

and revised by November 2010
(Roche

2009)
.


5. 454
D
ata
a
nalysis
f
ramework

The FASTA files (FNA) and the quality score files (QUAL) were obtained from the 454
FLX Sequencer after signal p
rocessing. Both FNA and QUAL files were generated through
Roche signal processing software using amplicon processing with default settings.

Data analysis was performed using two approaches:


21


A. Manual analysis: sequences were inspected by eye in sequence ed
iting software such as
Bioedit (Hall, 1999) and the quality
-
filtering step was omitted for manual filtering. This
approach allowed the retrieval of a maximal number of reads for subsequent analysis (see
Results for details). I used all the generated sequen
ces to count the number of sequences
generated by each primer set for each target species.

B. Automated analysis: In this approach the SeqTrim software (Falgueras
et al.

2010) was used
for filtering low quality sequences based on set criteria (See below).


Automated sequence filtering

After obtaining both FNA and QUAL files, all MIDs were sorted with zero mismatches.
Using quality filter software SeqTrim
(Falgueras
et al
., 2010)
, the sequences were filtered as
follows:

A quality filter with a 10bp sliding window was applied to the sequences. If the Phred score
(Ewing
et al.

1998; Ewing and Green 1998)

was less than 20 for any window of 10 bp, the
sequence was deleted. After quality filtering, all sequences were sorted based on their
amplification primers and all sequences shorter than 80bp were removed. The remaining
sequences were clustered using the

UClust program
(Edgar, 2010)

and all clust
ers with less than
3 reads were removed. Finally, all sequences were Megablasted to the reference library and the
number of reads for each target species was determined. The above routine was performed using
a Perl script
(
W
all
,

et al.,

2000)

and filtering was completed using SeqTrim filtering software.



22


Manual sequence analys
is

By using the manual sequence analysis method I omitted the filtering step to keep all
sequences and used BioEdit and MEGA to sort sequences and eliminate low quality sequences. I
sorted the sequences based on the multiple identifiers (MID) with zero mis
-
matches. After
sorting each MID based on the forward and reverse primer sequences, all MIDs and primers
were trimmed and the remaining sequences were sorted by length to a minimum of 100bp to be
prepared for alignment. Sequences were then aligned using av
ailable reference sequences of the
6 target species. Finally, all sequences were clustered by constructing a
neighbor



joining (NJ)
tree from Kimura 2
-
parameter sequence divergence estimates in MEGA4
(Tamura
et al.
, 2007)
.

I
used this tree to cluster my sequences so I could count the sequences belong
ing

to each species
more effectively
.














23


RESULTS

Quantitative PCR Results

The 1000 pg/ µl and 50 pg/ µl template dilutions gave C
T
values that were either too low
(≤ 10 cycles) or too high (≥ 38 cycles), respectively using a 100 fluorescence threshold. The 250
pg/ µl dilution gave C
T

values in the expected range (≥10 and ≤ 38).

The results from qPCR experiments using total genomic DN
A as template did not show a
general trend that can either support or refute the expected higher efficiency of species
-
specific
primers in amplifying target species in any of the 6 species tested. Therefore I could not generate
standard curves based on gen
omic DNA results, because of the lack of data points in several
qPCR cycles in different combinations; therefore there were not sufficiently consistent to allow
generation of a standard curve.


In fact, there were cases of target species being less effici
ently amplified as compared to
non
-
targets (Figure 4, Appendix 1). Thus, primer match may not be the only factor at play in this
experimental design and availability of target mitochondrial DNA might vary to the point that it
may offset potential primer mi
smatch. Hence, normalized PCR products were used to test the
primer binding bias.


Based on the results from experiments using genomic DNA templates, it was
hypothesized that the fluctuations and non
-
linear results might be due to variation in the
mitochon
drial copy number or non
-
specific amplification.

Using the full
-
length DNA barcode amplicons as template for qPCR allowed me to
generate consistent standard curves for different target species (Figure 5, Appendix 2). It is

24


important to note that there are

fluctuations in the slope of standard curves, which may indicate
different efficiencies of primer binding at different concentrations of template DNA (Figure 5).

In the qPCR experiments using genomic DNA as template, amplification only occurred
with 10
-
1

and 10
-
2

dilutions of the template DNA (250g/ul) with the exception of E1 (
C.
diminuta
) primers (sets A and B) that produced detectable amplification with a template dilution
of 10
-
3

as well (Tables 4 and 5). As previously mentioned, I noted a lack of cons
istency in
standard curve calculations of genomic DNA
-
based experiments (see above), and with the small
number of data points in the actual cross species qPCR experiments, I decided not to pursue this
line of experimentation further.

Unlike standard curves

generated using genomic DNA as the template, standard curves
using full length amplicon templates were consistent across primer sets (Figure 5, table 6 and 7).
Hence, I predict that amplicon
-
based qPCR analysis of cross species primer tests should provide

reliable results on the effect of primer specificity in qPCR efficiency.

Results of the amplicon
-
based qPCR in set A supported this hypothesis. With the
exception of two primer sets, all target
-
specific primers amplified their target species more
efficien
tly
than

non
-
target species (Table 4 and 5). The first exceptional case was the primer set
designed for
Maccaffertium

modestum

(E2mod), which amplified
Maccaffertium

interpunctatum

(E3int) in earlier cycles (e.g. more efficiently) than its own target spec
ies. The second exception
involve
d

Caenis diminuta
(E1dim) which amplifie
d

Ceratopsyche bronta
(T2bro),
Maccaffertium

modestum
(E2mod)

and

Maccaffertium

interpunctatum

(E3int) in earlier cycles,
than it amplifie
d

itself. This observation is important beca
use primer E1dim was designed for an
Ephemeroptera species but in fact amplified a Trichoptera species more efficiently (Table 4 and
5).


25


The number of species which could pass the threshold in all 6 different dilutions in qPCR
was higher when using primer

set B than set A (Tables 4 to 7). For example, using primer set A
for
Chimarra obscura

(T1obs), none of the other species passed threshold in all dilution except
the target. However, set B primer designed for this species produced positive qPCR results fo
r
other species (table 6 to 8). In the majority of experiments, using primer set B, target species
amplified more efficiently (passed threshold at lower cycle numbers) except for primers designed
for
C. diminuta

(E1dim.) and
M. modestum

(E2mod).



Relative

Amplified Copies (RAC) Analysis

The Relative Amplified Copies (RAC) approach shows the rate of amplification of each
species as compared to the target species of a specific primer set
(Jolla, 2004)
. Based on qPCR
results from experiments using full
-
length barcode amplicons as template, RAC plots were
generated for each target species for both primer sets (A and B) and for all dilutions. The plot for
C. obscura

as the target species (Figure 6) shows
the importance of dilution in relative
amplification of non
-
target species. All non
-
target species were amplified less efficiently as
compared to target species. However, substantial differences exist in the relative amplification of
the non
-
target specie
s with RAC values ranging from 212 for
C. bronta

to 76 million for
C.
sparna

at a template dilution of 0.1. Moreover, only three of the five non
-
target species amplified
with the 10
-
2

template dilution and none amplified at higher dilutions. Similar analys
es were
conducted for all other combinations of target species and primer sets A and set B (Appendix 3).



26


Quantitative and qualitative analysis of pyrosequencing reads

Using both primer sets A and B, 454 pyrosequencing reads were obtained for amplicons
di
rectly generated from genomic DNA mixtures of the target species and from mixtures of full
-
length COI barcode amplicons. A total of 10,034 reads were generated from genomic DNA
templates and 13,681 reads from full
-
length COI barcode amplicons.

The distrib
ution of sequence read lengths has been used as a measure to evaluate
pyrosequencing run quality. I sequenced two amplicons of 143 bp (set A) and 305 bp (set B).
However, the addition of PCR primers, pyrosequencing fusion tail and MID tags increases the
to
tal size of each amplicon by about 50 bases. Hence, sequence reads should optimally be
distributed around 193 bp (set A) and 355 bp (set B).

Automated sequence analysis conducted by SeqTrim software greatly reduced the
number of sequence reads as compared
to raw sequences obtained. Only 6.5% and 4.6% of the
reads passed SeqTrim in genomic DNA
-
based and amplicon
-
based analyses, respectively. This
rather small proportion of reads did not provide a stable trend for target species specificity of
primers and com
patibility of 454 analysis and qPCR results. For example, in the automated
analysis of genomic DNA templates, both primer sets designed for
C. obscura

(T1) showed
results only for their target species (108 sequences for set A and 35 sequences for set B) (T
able
13
). Conversely, primer set A for
C. bronta

(T2) did not produce any results for the target
species and set B only produced 34 reads. However, primer set B for T2 produced 179 reads for
non
-
target Ephemeroptera species,
C. diminuta

(E1), which is much

more than the number of
reads produced for its target species.

Fewer reads were obtained after Seqtrim filtering of genomic DNA templates compared
to pyrosequenced COI amplicons (Table
11

and 1
2
, 10034 reads from genomic DNA pooled

27


templates and 13681 re
ads from pooled full
-
length barcode amplicons templates)
. In 4 out of 6
cases, target species produced more reads than non
-
targets, but these results were only obtained
by one of the two primer sets in each case (Table 1
3
). On the other hand, there were 4
cases in
which the target species did not show the highest amount of amplification. As an example, 4
reads were obtained for the target species using
C.obscura

(T1) primer set B while 8 reads were
obtained for
C.diminuta

(E1) and 30 reads for
C. bronta

(T2
) (Table 1
3
).

The manual analysis of 454 sequences provided a higher number of sequence reads
compared to SeqTrim (Table 1
3
). In other words, many sequences that did not pass SeqTrim
filter were retrieved after manual inspection and editing of each pyroseq
uence read.
Consequently, 21.5% of sequences obtained from genomic DNA templates and 31.4% of
sequence reads from amplicon templates passed manual inspection and were used for subsequent
comparisons.

In the manual analysis of the pyrosequences obtained fro
m DNA
-
based
pooled material
,
target species produced more reads in both primer sets with the exception of
C.

bronta

(T2). In
this case, E2 (
M
.
modestum
) and E1 (
C.diminuta
) produced more reads, for primer sets A and B,
respectively (Table 1
3
). In manual
analysis of amplicon
-
based material, target species produced
more reads in both primer sets with two exceptions. In one case, T1 (
C. obscura
) COI amplicons
produced 1.6X more reads than the target species using T2 (
C.bronta
) primer set B (Table 1
3
).
The se
cond exceptional case was manual analysis of DNA
-
based material using E2 (
M.
modestum
) primer set B, for which T1 (
C. obscura
) COI amplicons produced 1.9X more reads
than the target species.

An important factor in the utility of NGS is the ability to para
llelize the analysis of many
templates in one sequencing reaction. Aside from using this approach in analyzing mixed DNA

28


templates such as environmental samples, sets of specific oligonucleotide tags (MIDs) can be
used for mixing amplicons and then retriev
ing corresponding sequences bioinformatically.
However, the efficiency of this MID approach needs to be evaluated to be able to use this
approach in applications reliably. Here, we used 6 MIDs for our target species primer sets (A and
B). Based on the anal
ysis of raw 454 reads, it is clear that the MID approach can provide a rather
uniform distribution of sequence reads for each MID (Figures 7, Table 1 in Appendix 3).
However, we observed some fluctuations
in the number of reads per MIDs in amplicon based
m
aterial (Figure 8).










29


DISCUSSION


Since the early days of NGS, most of its applications in biodiversity science have been
focused on discovering unknown biodiversity from the bottom of the ocean (Sogin
et al.
, 2006)
to the human microbiome (Gilbert
e
t al.
, 2008). These applications have mainly been focused on
data generation and biological interpretations by using much higher sequencing capacity offered
by NGS platforms. However, some recent studies have illuminated the importance of NGS data
quality
and the fact that low quality data may lead to misleading biological interpretations
(Quince
et al.
, 2009). NGS workflow and potential biases associated with it become even more
critical in applications that involve targeting specific groups of organisms,
especially in socio
-
economically important taxa such as pathogens, pests and bioindicator species. This study was
conducted to specifically address the issue of amplification bias in NGS analysis of DNA
barcodes (and similar marker gene amplicons) from two

sets of closely related target species
(fresh water bioindicator species in this case).

NGS technologies, in general, have made the genomic analysis of environmental samples
such as benthos, soil, water or bulk samples of terrestrial or marine biota more
feasible. For
example, several recent studies have demonstrated the accuracy and reproducibility of the 454
pyrosequencing results
(Hajibabaei
et al.
, 2011; Schwartz
et al.
, 2011; Shokralla
et al.
, 2012)
.
More specifi
cally, short fragments of COI DNA barcodes were successful in providing data for
identification of freshwater invertebrates for biomonitoring purposes (Hajibabaei
et al.
, 2011).

The purpose of this study was to advance our understanding of genomics analys
is of
mixed environmental samples by developing a qPCR
-
based approach with customized primers to
quantify species from mixed samples and to optimize and select primers for downstream next
-

30


generation sequencing analysis. This work will hopefully help us use

NGS technologies in real
-
world biomonitoring applications. The majority of studies using qPCR have focused on gene
expression analysis and methods developed to analyze and interpret qPCR results are mainly
geared towards gene expression
(Livak & Schmittgen, 2001; Ohtsu
et al.
, 2007; Selinger
et al.
,
1998; Torres
et al.
, 2008; Wang, 2003)
.
However, in recent years qPCR has been used in the
molecular diagnosis of infectious diseases or genetic defects
(Francois
et al.
, 2003; Menard
et
al.
, 2008)
. Because this study aimed at evaluating qPCR as a method to test efficiency and
behavior o
f primers in multi
-
template amplifications and no reference gene or target was
involved I decided to use an alternate approach to analyze the data.


Primer behavior in multi
-
template PCR

Previous studies on multi
-
template PCR bias in template
-
to
-
product ratios on bacteria
suggested that there are numerous uncertainties about the source of this problem
( Polz and
Cavanaugh 1998; Thompson,
et al.
, 2002; Acinas
et al.

2005)
. However, aside from a number of
studies mainly conducted prior to the introduction of NGS, the
issue of primer selection for
multi
-
template PCR remains understudied. Perhaps an important factor that has contributed to
this problem is the notion of universal primers and that selecting genomic targets with conserved
primer binding sites is the only so
lution to achieve optimal amplification
(Sogin et al., 2006)
.
Consequently, the majority of studies that target environmental samples for NGS analysis use
ribosomal m
arkers such as 16S rDNA and 18S rDNA genes for targeting prokaryotes and
eukaryotes, respectively. The proponents of these genes have suggested that the difficulty in
designing primers for other genes such as COI DNA barcodes is a reason to abandon using t
hese

31


markers in NGS studies of environmental samples
(Creer
et al.
, 2010;
Wang
et al.
, 2007)
.
However, recent work has shown that differential amplification (PCR bias) is problematic in
NGS analysis even for ribosomal genes
(Hajibabaei
et al.
, 2011; Schwartz
et al.
, 2011)
. It is
widely accepted that
quantitative analysis of NGS amplicon results should be interpreted with
caution. In many cases, a number of specific taxonomic or functional genes are targets of NGS
analysis
(Hajibabaei
et al.
, 2011)
. These cases demand better understan
ding of primer behavior
in multi
-
template PCR.


Quantitative PCR as a tool for
target identification

Different commercially available qPCR tests are increasingly used for measuring levels
of gene expression and for target identification in molecular diagn
ostic tests of genetic diseases
or infectious agents. These tests typically use differential amplification as a measure for a
specific gene expression or gene target validation. Because qPCR instruments and reagents are
relatively cheap and tests can be pe
rformed rather quickly and do not require large lab
operations, qPCR is now a workhorse in many molecular biology labs. In this study, I
demonstrated that qPCR has the potential to be used for validation of PCR primers before they
are used in more expensiv
e NGS analysis of multi
-
template DNA such as bulk environmental
samples. However, my experimental design challenged the sensitivity of qPCR when genomic
DNA was used as the template. Hence, results obtained from these comparisons could not
provide conclusi
ve evidence for primer specificity. Nevertheless, when more uniform amplicons
were used as templates, qPCR behaved mainly as expected and target species showed more
efficient amplification in the majority of cases.


32


This study is the first attempt to use q
PCR for validating primers for NGS analysis of
multi
-
template PCR for taxonomic identifications. However, qPCR has recently been suggested
as a method for library quantification in NGS analysis of whole genomes
(B
uehler
et al.
, 2010)
.
In this case, specific primers target adaptor sequences (common among all genomic fragments)
at different dilutions and qPCR analysis is conducted at different steps of library preparation to
provide a guide for selecting the optim
al dilution for downstream NGS. However, in the current
study, primers that target different taxa were selected based on their target specificity in qPCR
analysis of amplicons, which offsets fluctuations in target copy number in genomic DNA.
Primers with t
he best target specificity can then be used in NGS experiments.


Optimal NGS analysis of target genes and taxa

Although NGS approaches have the capacity to generate a large volume of DNA
sequences, they often involve tedious workflows and require highly sk
illed bioinformaticians to
handle the data. Additionally, available software may not provide the optimal tools for data
filtering and analysis, as I observed in this study.
Lack of efficient software dictated a rather
tedious manual approach in data editin
g, but this approach allowed recovery of many additional
sequences for downstream analysis.

Results from pyrosequencing analysis provided evidence for the utility of specific primer
sets for targeting genes and species of interest. In contrast to universal

primers, combinations of
species
-
specific primers may provide a more reliable solution to avoid false negatives in NGS
analysis of bulk environmental samples. Fluctuations in gene copy number or differences in
biomass can influence the utility of any prim
er set in a mixture. However, even in our analysis of

33


genomic templates (which are potentially more prone to gene copy number fluctuation) we were
able to detect our target species using a combination of two target
-
specific primer sets (Table 8)

Moreover t
he efficiency and slope of each primer set has been calculated and shown in table 9.

Quantitative analysis of bulk samples using mitochondrial markers is challenging. The
mtDNA copy number per reaction can vary between species and tissue types in mixed
environmental samples. In my experiments, I overcame the fluctuation
effect of different

gene
cop
ies within certain biomass on amplification dynamics

by performing another set of
experiments using normalized full
-
length DNA barcode amplicons as templates for my species
specific PCRs (Figure 2).

In real environmental sample with a wide range
of individuals’ sizes,
this approach can help to interpret the NGS results and relate the numbers of generated
sequences to the known information about each individual biomass.


Automated SeqTrim analysis greatly reduced the number of sequences that were u
sed in
comparative analysis of primers. Moreover, there was a trend towards target specificity using
some primer sets, suggesting that it is not reasonable to use only a few sequences in these
comparisons. However, substantially more sequences passed manua
l inspection and provided the
basis for our comparative analysis. We investigated two types of material as template for PCR,
genomic DNA and COI amplicons. Analysis of both genomic and amplicon templates shows a
rather strong trend towards more efficient s
equencing (as reflected in number of sequence reads
in each comparison; Tables 9 to 12) by target
-
specific primers. However, there are few
exceptions in both DNA and amplicon
-
based analyses. These exceptions may be due to higher
number of available templat
es, especially for genomic DNA, as a consequence of variation in
mitochondrial DNA copies. However, in the only exceptional case using genomic templates
(
C.bronta

(T2)), two different non
-
target species produced more sequences than target species

34


for the t
wo primer sets A and B. If a single non
-
target species had outcompeted the target
species, then the likelihood of a higher mitochondrial copy number for this non
-
target species
seems

to be

higher. On the other hand, the only two exceptions in analysis of a
mplicon templates
are linked to primer set B and in both cases, the non
-
target species that outcompetes the target
species is T1 (
C.obscura
).


Comparing qPCR and 454 results

I had hypothesized that qPCR results obtained for each target species using its s
pecific
primer sets would be reflected in the corresponding pyrosequencing reads (Table 8). In each case
I ascertained if the target species was more efficiently amplified in qPCR and pyrosequenced.
These comparisons show more consistency between qPCR and
pyrosequencing using amplicon
templates. However, results obtained for each primer set are somewhat different. In primer set A,
we observed an almost perfect agreement between qPCR and 454 results, supporting the
hypothesis that target species are amplifie
d and sequenced more efficiently using their specific
(i.e.100% matching) primers. In this comparison, with the exception of qPCR using the T2
primers, all other target species were more efficiently amplified and pyrosequenced. On the other
hand, qPCR and
pyrosequencing data showed the same pattern using primer set B with the
exception of the T3 primers. However, the target species amplified and pyrosequenced more
efficiently in only half the cases.

Perhaps the most realistic (applicable) comparison can be

conducted between qPCR
analysis of amplicon templates and pyrosequencing of genomic templates. In other words,
because qPCR testing of primers can potentially be used as a guide to select optimal target
-

35


specific primers for pyrosequencing analysis, these
comparisons can provide important insights
concerning the utility of this approach in a wider context. The majority of target species showed
a consistent pattern between qPCR of amplicon templates and pyrosequencing of genomic
templates
.

However, the two t
arget species, T3 and E3, were exceptions as results using primer
sets for both these species were not similar when I compared the qPCR and pyrosequencing
results.

Towards an standardized approach for metagenomics analysis of environmental DNA

Based on rec
ent advances in genomics instrumentation and bioinformatics tools it is clear
that biological sciences and a wide range of socio
-
economic applications will rely on genomics
information captured from environmental DNA. For example, a special issue of Molecu
lar
Ecology (April 2012) was devoted to Environmental DNA focusing on recent advancements and
applications of NGS in ecological research
(Baird & Hajibabaei, 2012; Shokralla
et al
., 2012;
Taberlet,
et al
.,

2012)

The excitement in using NGS tools for many different applications has led
to many primary publications and may potentially lead to better technologies and tools.
However, the user communities (i.e. ecologists) should work with genomics and bioinf
ormatics
experts to overcome technical challenges that can limit the usability of NGS in larger
-
scale
studies. Most of the studies in ecological use of NGS are in fact one
-
off or proof of concept
(
Callaway
, 2012).

A recent Genome Web article highlights challenges in moving NGS tools to real
-
world
diagnostics and the fact that many industry leaders believe NGS is still far from being applicable
in standard diagnost
ics settings (
Karow
, 2012). These challenges mainly involve difficulties (and
black boxes) in data generation and workflows as well as data quality and lack of efficient

36


standard software to differentiate accurate
sequence information from errors. In fact, this thesis
confirms the above
-
mentioned issues both at the level of molecular biology (PCR bias and
primer specificity) and bioinformatics analysis (automated versus manual sequence filtering).
This line of work
will hopefully set the stage for the use of available tools such as qPCR and
specific primers for more efficient and standardized application of NGS in biodiversity analysis.

Based on my study, qPCR can be efficiently used for designing primers for target
specific
groups of organisms according to the environmental/ecological question. Commonly used
bioindicators for fresh water bioassesment, such as Trichoptera, Ephemeroptera and Plecoptera
would be a good potential target for this type of studies. By this
approach, testing the designed
primers through qPCR would be applicable among different individual species of these groups
starting with both gDNA and amplicon material.

This study showed that qPCR can be used as a proxy for testing the efficiency of PCR
p
rimers for amplifying mixed environmental samples for biomonitoring applications.

The primers designed for this study could be able to perform in relatively efficient way
however for further studies slight changes in designing the primers for in
-
groups is
recommended.


DNA material could illustrate the behavior of the primers with individual sample as the
real life while the amplicon material could provide the chance of dealing efficiently with the
sequence variation only. Also the amplicon based material

will eliminate the variation in the
mitochondrial copy number between different species at the same time. Once we reach the
optimal primer design which can amplify the majority of the targets in a relative uniform pattern
,

then the results

obtained

from q
PCR could be used in developing optimized 454 pyrosequencing
amplicon based analysis for bulk environmental sample.


37



REFERENCES

Acinas, S. G., Sarma
-
R
upavtarm, R., Klepac
-
C
eraj, V., & Polz, M. F. (2005).

PCR
-
Induced
sequence artifacts and bias: Insights from comparison of two 16S rRNA clone libraries
constructed from the same sample. American Society of Microbiology, 71(12), 8966
-
8969.

Aird, D., Ross, M. G., Chen, W.
-
S., Danielsson, M., Fennell, T., Russ,

C., Jaffe, D. B.,
et al.

(2011). Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries.
Genome Biology, 12(2), R18.

Applied Biosystems. (2008). Guide to performing relative quantitation of gene expression using
real
-
time quantit
ative PCR. Applied Biosystems.

Baird, D. J., & Hajibabaei, M. (2012). Biomonitoring 2.0: a new paradigm in ecosystem
assessment made possible by next
-
generation DNA sequencing. Molecular ecology, 21(8),
2039
-
44.

Baird, D. J., Pascoe, T. J., Zhou, X., & Haj
ibabaei, M. (2011). Building freshwater
macroinvertebrate DNA
-
barcode libraries from reference collection material: formalin
preservation
vs.

specimen age. Journal of the North American Benthological Society, 30(1),
125
-
130.

Becker, A., Reith, A., Napiwotz
ki, J., & Kadenbach, B. (1996). A quantitative method of
determining initial amounts of DNA by polymerase chain reaction cycle titration using
digital imaging and a novel DNA stain. Analytical Biochemistry, 237(2), 204
-
207.

Binladen, J., Gilbert, M. T. P.,

Bollback, J. P., Panitz, F., Bendixen, C., Nielsen, R., &
Willerslev, E. (2007). The use of coded PCR primers enables high
-
throughput sequencing
of multiple homolog amplification products by 454 parallel sequencing. PLoS ONE, 2(2),
e197.

Bonada, N., Prat,

N., Resh, V. H., & Statzner, B. (2006). Developments in aquatic insect
biomonitoring: a comparative analysis of recent approaches. Annual Review of
Entomology, 51, 495
-
523.

Buehler, B., Hogrefe, H. H., Scott, G., Ravi, H., Pabón
-
Peña, C., O’Brien, S., Fo
rmosa, R.,
et al.

(2010). Rapid quantification of DNA libraries for next
-
generation sequencing. Methods,
50(4), 15
-
18.

Callaway
, E. (2012). A bloody boon for cons
ervation. Nature News. Available:
http://www.nature.com/news/a
-
bloody
-
boon
-
for
-
conservation
-
1.10499


38



Claesson, M. J., Wang, Q., O’Sullivan, O., Greene
-
Diniz, R., Cole, J. R.,

Ross, R. P., & O’Toole,
P. W. (2010). Comparison of two next
-
generation sequencing technologies for resolving
highly complex microbiota composition using tandem variable 16S rRNA gene regions.
Nucleic Acids Research, 38(22), e200.

Creer, S., Fonseca, V. G
., Porazinska, D. L., Giblin
-
Davis, R. M., Sung, W., Power, D. M.,
Packer, M.,
et al.

(2010). Ultrasequencing of the meiofaunal biosphere: practice, pitfalls and
promises. Molecular Ecology, 19(s1), 4
-
20.

Edgar, R. C. (2010). Search and clustering orders o
f magnitude faster than BLAST.
Bioinformatics, 26(19), 2460
-
2461.

Ewing, B., & Green, P. (1998). Base
-
Calling of automat
ed sequencer traces using Phred
. II .
Error probabilities. Genome Research, 8(3), 186
-
194.

Ewing, B., Hillier, L., Wendl, M. C., & Green
, P. (1998). Base
-
Calling of automat
ed sequencer
traces Using Phred
. I . Accuracy assessment. Genome Research, 8(3), 175
-
185.