nifH database documentation - University of California, Santa Cruz

ticketdonkeyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

69 εμφανίσεις

ARB

nifH

Database

Last database update:
December 2, 2011

Last documentation update:
February 17, 2012

University of California, Santa Cruz, California

Maintained and Distributed by Zehr lab (http://www.es.ucsc.edu/~wwwzehr/research/database/)


Important additions to this update:

Upgraded to ARB 5.2:

This database has been upgraded to be compatible with ARB 5.2. This means
users might have difficulty merging old databases with this current database. Please contact us if you
need to do this, a
nd we can navigate you through how to update old databases to ARB 5.2.

Integrated CD
-
HIT and CD
-
HIT
-
EST analysis into the pipeline:

We now send out the entire database,
after updating with new sequences, to CD
-
HIT (
Huang et al 2010
), to determine represent
ative
sequences (based on both amino acid and nucleic acid sequences).

Integrated Chimera
-
check analysis:

We have used UCHIME (
Edgar et al, 2011
) to evaluate potential
chimeras in this database. This is a necessary, but imperfect, approach. Sequences that not only meet
threshold criteria (outlined below) to be defined as a chimera, but also have parent sequences from the
same study, are marked a
s Putative Chimeras, and left out of most trees.

Utilized new masks for the creation of trees:

We have several new masks that, in some cases, mask out
regions of the gene that are problematic for HMMalign
. These can be explored by searching for *
mask*
i
n the name field.


Basics about this database:

Nitrogenase gene sequences in the databases are accumulating rapidly. BLAST analysis is not always the
best approach for comparing sequences or identifying phylogenetic relationships. Due to the large
number o
f sequences in the databases, and different formats, it is not simple to download and align all
extant
nifH

protein sequences and their corresponding DNA encoding sequence. Such capabilities are
necessary for environmental studies where the amino acid sequ
ence is needed for phylogenetic
analysis, but the corresponding DNA sequence is also needed for probe design. The problem
of
obtaining all extant sequences
is compounded by misannotation, and homologous proteins in the
databases.

The ARB software environment is a useful environment for visualizing and manipulating aligned and
unaligned sequences, and for maintaining metadata on sources, publications etc. ARB also contains
features for probe design, and the construction of phylogene
tic trees. However, ARB is not well suited
for downloading and validating new data.
Our group has developed
a semi
-
automated process for
constructing the
nifH

database from public genomic data sources
.

The procedure uses representative
nifH

protein sequenc
es to BLAST against GenBank to identify
potential
nifH

and nifH
-
like genes. The output is screened for false positives, which are eliminated from
the database. Once identified the
nifH

protein and the encoding DNA sequence are retrieved. The
sequences are
imported into ARB using the nucleotide GI (GenBank identifier) as the sequence name to
prevent redundancy problems. After import the DNA sequence is used to generate the amino acid
sequence (which should be identical to the GenBank record), the amino acid
sequences are exported
and aligned against a
nifH

PFAM using HMMR and the Amino acid sequences re
-
imported into ARB. The
aligned amino acid sequences are then used to align the DNA sequences using the "Backalign" feature of
ARB.

Features of the nifH_ARB d
atabase

The database
contains all
nifH

amino acid and DNA sequences obtainable from BLAST analysis. The
Cluster IV
nif
-
like sequences are included, to allow identification in environmental surveys.

The amino acid sequences are aligned with a Hidden Markov

Model, not by Clustal.

The DNA sequences are aligned according to the amino acid sequences so that DNA sequences can also
be used for phylogenetic analysis.

Start and stop positions of amino acid sequences are included in searchable fields, enabling rapi
d
selection of equal length sequences for phylogenetic analysis.

Virtually all of the GenBank metadata is imported along with the sequences, allowing rapid searches and
assembly of sequences for analysis.

A nomenclature for
nifH

clusters is provided.
This

should be used with caution, since many branches of
the phylogenetic tree (such as within the Proteobacteria) are poorly supported.

Disclaimers and Notes for Use

Most of the
nifH

sequences have been obtained from PCR amplification. A variety of primers have been
used and so the sequence database is comprised of sequences of a variety of lengths, some of which are
very short. The trees provided and the cluster naming is very depend
ent upon the length of sequence
and the region amplified and used in the phylogenetic analysis. Unfortunately, shorter sequences and
the deletion of certain regions in the analysis, quickly deteriorate the
robust

clustering of the trees. The
analysis provi
ded in the database starts with genomic length sequences, and uses the sequence
clustering based on the genome tree to evaluate shorter trees. Clustering deteriorated rapidly when the
3' end of the sequence
is

not used
. Thus any use of the analysis for tre
es generated from shorter length
sequences should be interpreted very cautiously.

Furthermore, although the QuickAdd function in ARB is very useful for quickly screening the phylogeny
for new sequences, we are finding that clustering breaks down when addin
g thousands of sequences to
a backbone tree. We advocate building neighbor joining trees with your sequences, and close relatives,
for publication quality analyses.

Putative chimeric sequences were determined using the UCHIME algorithm (Edgar et al, 2011
).
Nucleotide alignments for all 22,579 sequences (Dec 2010 update; including those obtained from
genomes) were clustered using CDHit (Huang et al 2010) at a 98% sequence identity cut
-
off, and the
resulting 8579 representative sequences were analyzed for
chimeras using the UCHIME algorithm in
de
novo

mode. As accurate abundance data was unavailable for many studies in this database, the number
of sequences in each CDHit cluster was used as a proxy. UCHIME was run using all the default
parameters, but the

resulting chimeras were subject to the additional criteria determined empirically
(please contact us for further details) to reduce the number of false positives. Putative chimeric
sequences were tagged as such in a field named “PutativeChimera” in the A
RB database. Likely
chimeras were further defined if the two parent sequences were recovered from the same study, and
tagged in the “PutativeChimeraSameStudy” field.




Fields in ARB


The fields in the
nifH

database are primarily derived from Genbank r
ecords. A number of fields have
been created by the curators for convenience and to facilitate analysis of the data

Fields in ARB that have been derived from GenBank fields are described in
the Appendix
Table 1. Some
fields have been obtained from translat
ion to EMBL format, and a few field names were changed for the
corresponding field in ARB.

Fields in ARB that have been created for data analysis and curation are:

StartAlign
-
the position in the current database for the first amino acid in the alignment

E
ndAlign
-
the position in the current database for the last amino acid in the alignment

Raymond_group
-

Major cluster designation (1
-
5) as defined by Raymond et al. Clusters 1
-
3 annotated.

Young_group
-

Major cluster designations (B, A, C) as used by Young.

S
EQ_PROBLEMS
-
field where problems in sequences can be stored. Currently the only annotation is
"DNA unaligned" to designate sequences where translation "X" prevented backaligning the DNA
according to the amino acid sequence.

AMINO_201
1
-

Amino acid sequence
clusters as in AMINO_2009, except using a new numerical
designation (1.1, 1.2, 3.1, etc).

AMINO_2010
-

Current designation of clusters using the Alphabetical clustering system of Zehr et al.
2003. Numerous sequences had to be reclassified according to the n
ew tree topology.

Therefore, some
subgroups will differ, and there are fewer subclusters than in Zehr et al. Many of the Proteobacterial
groups are not robust, although there is good general agreement between the 2003 and 2010 clustering
(as long as the sh
orter
nifH

sequences are not used to make the tree). The
cluster labeling

of
A
MINO_2010 was based on the tree

included in the database:
tree_
genome_AA_Dec2010_MASK_GENOME
)

AMINO_2003
-

Amino acid sequence clusters as defined in Zehr et al. 2003.

DNA_2003

-

DNA groups as defined in Zehr et al. 2003.

PuatativeChimera



Sequences identified as possible chimeras using UCHIME.

PuatativeChimeraSameStudy



Sequences identified as likely chimeras using UCHIME, because parent
sequences are from the same study as t
he chimeric sequence.

AddUpdt_
MonthYear



Sequences pulled in on the update date designated in the field name.

CDHITnt_ClusterNo



Cluster ID from the most recent CD
-
HIT
-
EST analysis (
98%
nucleotide

identity
).

CDHITnt_NumSeq



Number of sequences associate
d with the cluster number from the most recent CD
-
HIT
-
EST analysis (
98% nucleotide identity
).

CDHITnt_RepSeqFlag



This field will have a “Y” if the sequence is the designated representative of a
cluster from the most recent CD
-
HIT
-
EST analysis (
98% nucleo
tide identity
).

CDHITnt_RepSeqID



This field will have the sequence ID of the designated cluster representative from
the most recent CD
-
HIT
-
EST analysis (98% nucleotide identity).

CDHITaa_ClusterNo



Cluster ID from the most recent CD
-
HIT analysis (98% amino acid identity).

CDHITaa_NumSeq



Number of sequences associated with the cluster number from the most recent CD
-
HIT analysis (98% amino acid identity).

CDHITaa_RepSeqFlag



This field will have
a “Y” if the sequence is the designated representative of a
cluster from the most recent CD
-
HIT analysis (98% amino acid identity).

CDHITaa_RepSeqID



This field will have the sequence ID of the designated cluster representative from
the most recent CD
-
HIT

analysis (98% amino acid identity).

Fields relevant to old database users:

NR
-

sequences with "AA99" written to this field were selected as representative sequences using the
program "cd
-
hit" using the default parameters and selecting sequences representi
ng 99% sequence
identity at the amino acid level.


Cluster



Original CD
-
HIT cluster number, based on Dec2010 database.

RepFlag

-

Original CD
-
HIT Representative Sequence Flag (“Y” or “N”), based on Dec2010 database.

RepSeq

-

Original CD
-
HIT Representative Sequence, based on Dec2010 database.

NumSeqsInCluster

-

Original CD
-
HIT number of sequences in cluster, based on Dec2010 database.


Using the Database

The
nifH

database is provided as a resource for the community. We have
tried to curate and maintain
the database to facilitate the analysis of environmental sequences in particular. The alignments are
generated by HMMR from PFAMs in order to provide some objectivity in approach, such that multiple
users will obtain similar tr
ee phylogenies, and sequences have been identified by cluster names in order
to make it easier to discuss and compare datasets. However, neither the alignments
n
or cluster naming
is absolute. Much of the cluster naming appears to be robust, but some branch
es and clusters are poorly
resolved and typically not supported by reasonable bootstrap values. The cluster naming in the
2010/2011 efforts have condensed some clusters to approach a more robust cluster designation.
There
are multiple sequences that have p
roblems, as imported from GenBank (for example sequences that
cannot be backaligned because of X's in the nucleotide sequence).


Trees

We’ve added several additional trees to this new version of the database:

Trees created using new masks:

From Dec 2010 up
date:

tree_Genomes_AA_Dec2010_MASK_GENOME



Non
-
redundant genome sequences (as of Dec 2010),
tree created using the MASK_GENOME mask.

tree _AA_RepSeqsDec2010_MASK1



Repres
en
tative sequences from the original CD
-
HIT analysis; tree
created using the MASK1 m
ask.

tree _AA_QuickAddtoRepSeqsDec2010_MASK1



Quick add tree of all sequences of suitable length (as
of Dec 2010), added using the MASK1 mask.

tree _AA_ QuickAddtoRepSeqsDec2010_NoPutChimeras_MASK1



same as above, but with likely
chimeras removed.

From D
ec 2011 update:

tree_Genomes_AA_Dec2011_MASK_GENOME



In this case, all genome sequences as of Dec 2011 are
included, meaning that there are redundancies (e.g. draft vs. complete genomes).

tree _AA_RepSeqsDec2011_MASK1

-

Representative sequences from the
most recent CD
-
HIT analysis;
tree created using the MASK1 mask.

tree _AA_ RepSeqsDec2011_plusAllGenomeSeqs_noPutChimeras_MASK1

-

Representative sequences
from the most recent CD
-
HIT analysis with all the genome sequences included and likely chimeras not
in
cluded; tree created using the MASK1 mask.

tree _AA_ QuickAddtoRepSeqsDec2011_
n
oPutChimeras_MASK1



Quick add tree of all sequences of
suitable length (as of Dec 2011), with likely chimeras removed, added using the MASK1 mask.


From the old version of the
database:

Several trees are provided as starting points. Tree names have information on the type of sequences (AA
or DNA), whether it was generated from genome or nonredundant representative sequences (NR), and
the start and stop positions used for the mas
k to generate the tree. Kimura correction was used for
amino acid sequence trees, and Jukes
-
Cantor for DNA trees.

Genome tree: tree_genome2010_81_630. This is probably the most robust tree since it uses the longest
amino acid sequences obtained from genom
e sequencing efforts.
The sequences were selected by
searching the records for "genome". 81
-
630 refers to the positions in the current amino acid alignment.

tree_AA_NR_Dec3_2010_134_481
:

Includes 2982 sequences identified as representative sequences by
cd
-
hit (at 99% identity clustering). There were 8389 representative sequences identified by cd
-
hit, but
sequences were selected from the representative sequences that also were long enough for the
134_481 mask. The original cd
-
hit representative sequences can

be found by searching the NR field for
"AA99".

tree_AA_NR
_134_481
_quickadd
:

Tree was built by the quickadd parsimony feature in arb using the
149_478 mask, in order to add as many sequences as possible. This tree is not as reliable as the other
two trees
, but generally shows the same clustering and allows positioning of more sequences. There are
17627 sequences in this tree (out of total 22574 sequences in the database).


Tip for making your own trees

In order to use this database for
making trees of addi
tional sequences (e.g. newly derived sequences
from PCR, genomes, etc), the sequences can be 1) manually aligned (very slow), 2) quick aligned (fast,
but check alignment manually), 3) aligned using the same procedure used to make the database
(requires som
e skills, and results in loss of some information on aligned positions and features of the
current database, since the sequences may be repositioned). In order to use the quickalign feature (see
ARB documentation):

Import sequences (in fasta format). The
other option is to create a new ARB database from just your
sequences. Then use the merge function to bring the sequences from your database into this one. Note
the Genbank fields will not have any information in them.

You might want to create a field of y
our own
and write something to it (e.g. "myseqs') so that you can easily search for your own sequences.

Mark your sequences and some of the sequences that are already in the database (if you know some
sequences that are relatively close, e.g. cyanobacteri
a if you are working on cyanobacteria and can mark
those with your sequences to use as the aligning sequence, it is probably better). The sequences are
generally so well conserved that the quickalign function works pretty well, as long as you use sequences

in the same major cluster (Clusters 1
-
4) to align. Once the sequences are marked open the ARB_EDIT
window. Put cursor
somewhere in

sequence you want to use to align, and unmark it (and any other
sequences that are already aligned). Go to Edit, Integrated
Aligners, Click Fast Aligner radio button, Click
Al
ign Marked Species radio button, Click Reference Species by name radio button (and then make sure
the GI of the sequence you want to use as aligning sequence in the box to right
-
you should be able to do
th
is by clicking the Copy button on the right, if your sequence GI appeared in the Align what box above).
Click on Range Whole sequence radio button.

Click Go. Check alignment visually.

Selecting sequences of the same length in the database is time consuming
.
Sequences that are of the
same length can
now
easily be selected using the Search and Query feature. Search StartAlign for "<xxx"
AND EndAlign for ">yyy", where
xxx

is the starting position of the amino acid alignment mask you want
(the tree mask can be
1 residue shorter than x
xx

since
"
<
"

will return all sequences that start one
position to the left) and
"
yyy
"

is the end of the alignment mask (the tree mask can be 1 residue longer
than y
yy
). Do the search, mark listed species and make the tree.

We typically open up the amino acid sequence alignment prior to making the tree, to make sure the
correct sequences are selected, and put the cursor in one of the sequences. That sequence will then
appear as an option for making the mask by positions in th
e Neighbor Joining
"Filter"
tree window.



Recent c
ontributors to the nif_ARB project are:


Rachel Foster

Philip Heller

Pia Moisander

Kendra Turk

H. James Tripp

Jonathan P. Zehr


The database can
currently
be ackno
wledged with Zehr et al. (2003).
N
ew publi
cations are in
preparation on the 2011 developments of the database.


References

Edgar et al (2011). “UCHIME improves sensitivity and speed of chimera detection.” Bioinformatics
27(16): 2194
-
2200.

Huang et al. (2010)
“CD
-
HIT Suite: a web server for clustering and comparing biological sequences.”
Bioinformatics 26:680.

Ludwig, W., O. Strunk, et al. (2004). "ARB: a software environment for sequence data." Nucleic Acids
Research 32(4): 1363
-
1371.

Raymond, J., J. L. Siefer
t, et al. (2004). "The natural history of nitrogen fixation." Molecular Biology and
Evolution 21(3): 541
-
554.

Young, J. P. W. (2005). The phylogeny and evolution of nitrogenases Genomes and genomics of
nitrogen
-
fixing organisms. R. Palacios and W. E. Newto
n. Netherlands, Springer. 3: 221
-
241.

Zehr, J. P., B. D. Jenkins, et al. (2003). "Nitrogenase gene diversity and microbial community structure: a
cross
-
system comparison." Envionmental Microbiology 5(7): 539
-
554.





APPENDIX


Table 1. Fields for
nifH

meta
data in the
nifH

ARB database and their source in GenBank records.

GenBank

ARB

parsed from "LOCUS"
line

nuc_len

parsed from "/coded_by"

nuc_acc_and_pos

parsed from "/coded_by"

nuc_version

parsed from "LOCUS"
line

date

DEFINITION

description

KEYWORDS

key_words

SOURCE (One line)

full_taxon_name

ORGANISM (All lines)

tax

REFERENCE

num_bib

medline
-
ID

medline

AUTHORS

author

TITLE

title

JOURNAL

submission

mol_type

mol_type

taxon

taxon

clone

clone

collection_date

collection_date

collected_by

collected_by

country

country

GI

name

GI

GI

gene

gene

host

host

strain + isolate

strain

isolation_source

isolation_source

lat_lon

lat_lon

note

note

operon

operon

PCR_primers

PCR_primers

product

product

protein_id

protein_acc

translation

amino
_acid




Table 2. Unique
nifH

ARB database fields. (This does not include ARB
-
specific fields that are not specific
to the
nif

database).

Field name

description

NR

Field for non
-
redundant information. AA99 in this field indicates
sequence was selected by

cd
-
hit as representative of cluster based on
99% identity.

AlignedAAs

Total number of aligned amino acids in sequence

StartAlign

First position of amino acid sequence in current alignment

EndAlign

Last position of amino acid sequence in current alignme
nt

SEQ_PROBLEMS

Field for keeping sequence problem information (not curated)

DNA_2003

Cluster designation based on DNA sequences from Zehr et al. 2003

AMINO_2003

LOCATION

Field for writing sampling location information (not curated)

Raymond_group

Cluster designation using Raymond et al. scheme

Young_group

Cluster designation using Young scheme

AMINO_2011

Cluster designations from current alignment (1.1, 1.2, etc)

AMINO_2010

Cluster designations from current alignment (1A, 1B, etc).

Cluster

Original CD
-
HIT analysis: Cluster No

RepFlag

Original CD
-
HIT analysis: Representative sequence flag

RepSeq

Original CD
-
HIT analysis: Representative sequence ID

NumSeqInCluster

Original CD
-
HIT analysis: number of sequences in cluster

PutativeChimera

Potential chimeric sequence based on UCHIME analysis

PutativeChimeraSame
Study

Likely chimeric sequence; parent sequences from the same study

AddUpdt_Jul2011

New sequences from the July 2011 update

CDHITnt_ClusterNo

Most current CD
-
HIT
-
EST analysis: Clu
ster No

CDHITnt_NumSeq

Most current CD
-
HIT
-
EST analysis: number of sequences in cluster

CDHITnt_RepSeqFlag

Most current CD
-
HIT
-
EST analysis: Representative sequence flag

CDHITnt_RepSeqID

Most current CD
-
HIT
-
EST analysis: Representative sequence ID

CDHITaa_ClusterNo

Most current CD
-
HIT analysis: Cluster No

CDHITaa_NumSeq

Most current CD
-
HIT analysis: number of sequences in cluster

CDHITaa_RepSeqFlag

Most current CD
-
HIT analysis: Representative sequence flag

CDHITaa_RepSeqID

Most current CD
-
HIT
analysis: Representative sequence ID

AddUpdt_Dec2011

New sequences from the Dec 2011 update