sequences - UW Staff Web Server

whipmellificiumΒιοτεχνολογία

20 Φεβ 2013 (πριν από 3 χρόνια και 10 μήνες)

217 εμφανίσεις

Automation of sequence
-
based
identification for the molecular
microbiology laboratory

Noah Hoffman MD, PhD


University of Washington Department
of Laboratory Medicine

2

What is a bacterial species?


Historically defined according
to phenotypic criteria, such as


Degradation or metabolism of
certain chemicals


Antibiotic resistance or production


Nutrient utilization


Staining reactions


Morphology


Automated platforms used
extensively in clinical labls
(such as the Vitek)



3

Molecular microbiology:


from specimen to sequence

4

When is a molecular/sequencing approach
to identification indicated?


Unculturable


Slow growing


Dead


Phenotypic variability (common in context of
chronic/complicated infections, eg CF)


Phenotypic classification fails


Improved turnaround time, accuracy, cost
-
effectiveness in some contexts



5

Genes useful for molecular identification


Bacteria


16S rRNA


Mycobacteria


Hsp65


rpoB (RNA polymerase)


Fungi


Internal Transcribed Spacer (ITS) 1 and 2
regions


26S rRNA

6

Elements of microbial identification by
sequence analysis


Database searching (eg, GenBank)


Requires software for performing rapid
sequence comparisons (Blast)


Multiple sequence alignment


Comparison of related sequences base
-
by
-
base


Phylogenetic analysis


Describes evolutionary relatedness of
sequences using clustering algorithms

7

BLAST


Basic Local Alignment Search Tool


Provided by the National Center for
Biotechnology Information (NCBI)


Rapid search of Entrez databases


Retrieves sequences similar to query


Statistical tests describe the significance of
matches


Blast and other NCBI applications provide an
interface that can be used outside of a web
browser
-

eg, by custom software


8

16S sequence analysis
-

manual workflow


Blast search


Access each GenBank,
PubMed entry
individually


Trim, transfer each
similar sequence with
copy/paste


Add relevant sequences
from UW database


Make alignment, tree in
several steps

9

UW Molecular Micro Test Volume

Annual tests involving sequence analysis

10

Goals of MDX automation project


Create software to integrate the steps
involved in sequence identification:


Provide search interface to reference
sequence database (GenBank)


Filter database search results


Perform analyses on a set of reference
sequences


Summarize data

11

Software implementation


Simple text
-
based user interface


Written in Python programming language


Open source


Cross
-
platform


Installed on PCs in molecular section

12

Sequence identification is only as good as
the sequence database


Database search is part of the assay


A perfect reference database contains:


All clinically relevant organisms


Only sequences of high quality


Only sequences from well
-
characterized
strains


Entries that are fully compliant with current
nomenclature

13

Garbage in, garbage out

On two occasions I have been asked, "Pray, Mr. Babbage,
if you put into the machine wrong figures, will the right
answers come out?" [...] I am not able rightly to apprehend
the kind of confusion of ideas that could provoke such a
question.



-

Charles Babbage, 1864

http://en.wikipedia.org/wiki/Garbage_in,_garbage_out

14

GenBank
-

you get what you pay for


Data included in public
databases tend to have


ragged sequence ends


faulty sequences entries
-

due
to early error
-
prone
sequencing techniques


no quality control (no primary
sequence data)


out
-
dated nomenclature


missing entries of medically
important microbes


Also:


no clinical or phenotypic
information provided


not designed for clinical
context

15

16S sequences in GenBank, 1993
-
2006


*Sequences > 450 bp without the title words
"uncultured", "sp.", "bacterium"


All bacterial 16S
sequences

"Useful"
sequences*

16

Blast result filtering strategy


Results of Entrez search are "filtered" to
find useful reference sequences


Example filtering criteria:








Length

> 85% of query length

Record title

Not: "
clone", "sp.",
"unidentified", "uncultured",
"bacterium", "organism"

Quality

< 4 ambiguities (N’s)

17

Effect of filtering on interpretation of Blast
results


100 Blast searches
-

16S sequences from clinical
isolates


retrieved 500 hits for each

Median number of
acceptable sequences…

per search (500 hits)

161 (min 3
-

max 364)

in top 100 hits

36

with peer
-
reviewed
publication in top 100 hits

15

18

Other filtering criteria


Sequence accompanied by peer
-
reviewed publication?


Limit number of representatives of each
species


Minimum percent identity to query

19

Sequence analysis


The unknown sequence is compared to a set
of reference sequences


A taxonomic name is assigned based on
percent identity, as well as a set of shared
characteristics as visualized in…


multiple sequence alignments


phylogenetic trees


These analyses are performed as part of the
automated process


A spreadsheet summarizes the Blast results




20

Multiple sequence alignments


Base
-
by
-
base comparison of related
sequences


Invariant positions removed to highlight
differences


21

Identification by identity vs phylogeny
-

why make trees?


More biologically meaningful description of
relatedness, higher resolution than % identity


Statistical tests can estimate significance of
groups


An appropriate set of reference sequences is
the critical factor.

BLAST search
rank

22

Spreadsheet summary

23

Future directions


Refinement, validation of SOP for
sequence
-
based ID


Unify search of GenBank and UWMC
collection of clinical bacterial and fungal
sequences


Data
-
mining of MDX laboratory
database


Automation of UW database
maintenance and QC

24

MDX database sequence entry form

25

Automation of UW database QC

26

Conclusions


NCBI database resources are readily
accessible via a custom interface


An automated approach can facilitate


identification and assembly of suitable set of
reference sequences from a public database


comparison and analysis of sequence data


report generation


Taxonomic errors/inconsistencies in GenBank
remain a major challenge


Expertise remains essential component of the
identification process


27

University of Washington
Department of Laboratory
Medicine


Brad Cookson

Ajit Limaye

Jen Rakeman

Dan Hoogestraat

Dhruba Sengupta

Tia Pedersen

Kresta Austin

Patrick Morkel

Glendon Pflugraf

Brittany Bennett

Jenny Prentice









Thank you

28

Software components


Interface to NCBI services (Blast, GenBank, PubMed)


Queries via NCBI url
-
encoding protocol


Data retrieved in machine
-
readable XML format


Data aggregation, filtering, and processing


Relevant sequences are selected from Blast results
according to a set of rules


Sequence data reformatted for alignment software


Sequence analysis and report generation


Spreadsheet


Multiple sequence alignments (ClustalW)


Phylogenetic trees (ClustalW)


29

Effect of filtering on interpretation of Blast
results

30

mini
-
method comparison study


16S rRNA sequences from 10 clinical isolates


fully automated search, sequence selection (of 500
top hits),
PLUS

alignment, phylogenetic analysis


compare to lab's result


Selection criteria:

Length

> 90% of query length

Record title

Not: "
clone", "sp.",
"unidentified", "uncultured",
"bacterium"

Quality

< 4 ambiguities

Similarity

> 94% similar to query

Identification

Maximum of 4 representatives
of each species

31

mini
-
method comparison results

* Marchandin, et al. Microbiology 2003, 149:1493. PMID 12777489



Discrepancies were caused by
inaccurate or outdated
classification of GenBank sequences

or
inability of 16S
rDNA to resolve to species level.