ColteDNAPatternFinderx - University of South Dakota

thingyoutstandingΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

213 εμφανίσεις

















DNA Pattern Finder (DNA
-
PF
)









Colte Haines








Introduction to bioinformatics

Spring 2009

Monday, April 27, 2009




Abstract



The DNA Pattern
Finder

is a web based tool

that can be found at
http://usd.edu/~chaines/bio/Input_form.htm
. The tool is mainly

used by bioinfo
rmatics
researchers to help discover

the function
s of unknown genes by grouping

genes

that have similar
transcription factors
.
This provides

researcher
s with
a li
ttle more
insight
in
to the unknown genes

role
.

The purpose of this paper is

to give a description
of
the developme
nt, input and output
of
the DNA Pattern
Finder
.



Background

The DNA Pattern
Finder
was written based off a similar
web based
b
ioinformatics
tool
developed by Sujuan Ye as part of her PHD dissertation
.

Dr. Ye’s tool
is
called DNA Pattern
Counter

and was

designed to find pattern in DNA or DNA upstream sequences. The DNA
P
attern
Finder
tool accepts one or more accession numbers along with a set of patterns and
returns the number and positions of
genes
that match each

of the given

pattern
s
. The

DNA
Pattern Finder
search patterns
can contain "wildcard" characters,
allowing user to detect a variety
of similar patterns.

A

user
also
has the ability to search for pattern
combinations, allowing
variable spacing between them.

The DNA Pattern
Finder
can be found at
http://www.usd.edu/~sye/pattern_counter.htm
.


Tool Desc
ription

The DNA Pattern
Finder

(
DNA
-
PF
) tool
is
used

to find matching transcription factors
among a given list of accessions
.
I
t
can also be used to find patterns in sequences provided by
the user

in FASTA format
. This tool accepts one or more
Refseq accession numbers along with a
set of patterns and returns groups of genes with similar transcription factors. The DNA sequence
data is retrieved from NCBI GenBank, and the upstream

sequence data is extracted by t
he
University of South D
akota BioTea
m. Currently, human
, mouse, and rat upstream sequences are
available. Search patterns can contain "wild card" characters, allowing user to detect a variety of
similar patterns
.

The maximum number of accession numbers for each search is set up to 100.


DNA
-
PF

is a web based tool coded primarily in Perl.
With
DNA
-
PF

b
eing a
pattern
matching tool
,

Perl

was chosen for the code of choice because
of its powerful
regular
expression

capabilities

a
long with its very useful
BIO package
. The BIO packa
ge

contains

built in function
s

for retrieving
a
ccession data from
the
GenBank
database and extracting
the
sequence data.



The
DNA
-
PF

was
proposed by

Dr. Kathleen Eyster

of the University of South Dakota.
Dr. Eyster wanted a program that was able to
group

genes
together
by their
similar
ity of

transcription factors
.
The
DNA
-
PF

does this b
y searching though each specified gene

s promoter
r
egion.
Then it returns several diff
erent groups with the group of accessions with the most
matches to one another at the top of the page
.

The
rest
of the matches
are listed in descending
order.



Possible impact on researchers

The goal of
DNA
-
PF

is to help facilitate the identification of unknown genes. The tool is
based
o
n

the assumption that if you have two genes with similar transcription factors it is
plausible that they have similar function(s). So given an unknown gene, one would list t
he
unknown gene or group of unknown genes along with a list of known genes and their
transcription factors into the
DNA
-
PF
. The
DNA
-
PF

will return groups of genes with similar
transcription factors; this will hopefully give the researcher an idea of where
to look for the
expression of the unknown gene or genes in future research.


Parameter Descriptions

The input for
DNA
-
PF

is very similar to
DNA Pattern Finder

but there are
a few

p
arameters

in
DNA
-
PF

that are not
required for
DNA Pattern Finder.

The following is a
description of the parameters required by DNA
-
PF.


I think for each of these parameters, state its name and primary purpose. Then describe
how it is used.



Patters

use

text as input of motifs. This parameter uses whitespace, tab or semicolons as a
delimiter to input multiple patterns. Patterns may be followed by parenthesis. e.g.
TTTCCA(#NFAT)
. This input a
lso accepts

the

ambiguity codes listed
below
.


M
--
(AC)

R
--
(AG);

W
--
(AT
)

S
--
(CG)

Y
--
(CT)

K
--
(GT)

V
--
(ACG)

H
--
(ACT)

D
--
(AGT)

B
--
(CGT)

N
--
(ATGC)



Pattern length can be varied. The minimum and maximum numbers of characters are placed
inside a pair of curly braces. e.g. TATAAAN{20,30}YYANWYY
,
indicates that there are 20
to
30 characters between them.


ACCNs or Sequences

a
lso accepts as text in
put where e
ach line contains one accession number.


This parameter does not allow comments or

blank lines

in the input field. Also if the user wishes
to provide his or her
own
sequence it must

be in FASTA format. Also when a
researcher

want
s

to load their sequence

make sure
one

select
s

the “Sequence
provided

by the user” radio button in
the
advanced

options
.

The

next group of parameters are initially hidden from the user one can

access these parameters
by clicking the “Show Advance Search Options”.

Search patterns in

allows the user to specify in what part of the sequence the program should
start searching.

This allows the researcher to specify what part of the sequence they bel
ieve the
promoter region starts. Here is also where a user would specify to load their sequence manually
or which type of species they are working with. Default is DNA sequence where species does not
matter.


Search criteria has several different options
for the user
, these options
will tweak the format of
the users search.

Percent Matched

will accepts a whole number value between 0 and 100 this parameter sets at
what percent does one want their groups set. In other words the number of matches to each other
in the group accession should be at or above the specified number in the text box.

D
efault value
is 75.

Type of gene c
orrelation

sets the way the group is put together. Weak correlation is when there is

at

least
one
accession in the group that is within the percent matched for all other accessions and
there is a chance that some other ac
cessions in the group that fall under the specified percent
matched.
An
example of weak correlation is given
in figure 3
. Strong correlation is when every
accession is in the group is at or above the given number for percent matched

example in figure
4
. De
fa
ult value is strong correlation.


The option for not showing matching transcription factors less than a given number is there for
when you have a big search input
. T
his
option
does not show
any groups that are at or below the
number specified
for
the number of
matching transcription factors.

This will eliminate

groups
that have a high percent matched
,

but are probably not
very
similar to one another.
The d
efault
value is 2.

Find reverse compliment version for each pattern does exactly as it sounds
. I
f
the box is
check
ed
the program
it

will also search for the reverse compliment
as well as the forward strand
of every
given pattern
. T
he reverse will not show up as a new column it will just be tagged with the
forward pattern.

Hide transcription facto
rs will format the columns so that if no transcription factor in the group

genes

is matched to a
ny

gene that
column with the transcription factor will

not be shown for that
group.
This h
elps to format

the output

so when you input several patterns the page
does not get
overly wide.


Output

As mentioned earlier the
DNA
-
PF

returns several
different groups,

with the group of
acc
essions

with the

most

patterns matched

at the top of the page
,

but
there are also several other
features on the output page. First of
all
,

there is a link to op
en up a div to show
a list
all of the
a
ccessions the researcher entered along with all of the patters. This information is displayed in a
matrix format showing which accession is matched to which pattern
. This shows

the researcher

all matches

from their search input. A
long with showing the
matrix

of all matches the parameters
the user entered for the search is also displayed

on this page
. Finally the page list groups
of genes
that fall within the search pa
ra
meters
. The genes and mo
tifs

are sorted before they are displayed
with the highest number of matches to the top and left respectively.
E
ach of the groups will have
the accession number along with a link to NCBI GenBank for that particular
accession.


Similar Tools


A quick
Google search for similar tools came up empty. There were several tools such as
P
-
Match and TFSEARCH that find transcription factors in genes
. But
DNA
-
PF

is different
,

because it tries to match genes themselves together based on similar transcription facto
rs instead
of the current approach of giving one gen
e and just returning its list of

transcription factors for
that single gene. P
-
Match and TFSEARCH
tools can be used before a
DNA
-
PF

search but the
DNA
-
PF

gives a wider range of information on the ways sp
ecific genes may be related through
the similarity of their transcription factors.






Figures


Figu
r
e

1:

Screen shot of example input for the
DNA
-
PF

tool with patterns separated by white space and
tabs. Also each accession entered on its own line











Figure 2:

Screen shot of the results page with searching DNA database finding reverse compliment and
hiding empty Transcription Factors strong correlation and 75%
matched.



Figure
3
: An example of weak correlation
.

G
ene NM_019142 is within the 75% with the other 2 genes
XM_218313.2 (4/5 = 80%) and NM_133425.2 (3/4 = 75%) but the genes XM_218313.2 and NM_133425.2
are not (3/5 = 60%).






Fi
g
ure

4
:

A
n example of strong correlation where every gene in the group is within the
specified

75% percent
matched.