# docx - Department of Computer Science - Western Michigan University

Biotechnology

Oct 2, 2013 (3 years and 1 month ago)

46 views

1

Abstract

In order to visualize differences between nucleotide
sequences it is often nice to have a visual representation in a tree
form. The algorithms for generating such trees are referred to as
phylogenetic tree generators and
can
generate trees by
calculating t
he difference between sequences and then clustering
them.
The distance based algorithms create generally accurate
evolutionary paths

showing where one genetic se
quence might
have branched from
. To solve this problem the Fitch
-
Margoliash
method is commonly
used.

Index Terms

phylogenetic, bioinformatics, fitch

I.

I
NTRODUCTION

HYLOGENETIC

trees allow for the visualization of the
similarities and dissimilarities between genetic sequences
and plot the possible evolutionary paths where the sequences
might have o
riginated from. These algorithms allow
researchers to find information that would not be obvious from
just viewing the raw distance or other data calculated about
the sequences

[1
].

In some algorithms the evolutionary rate is taken into
consideration. So
the further away from the root the longer it
took to for those changes to appear. Knowing the evolutionary
rate can then allow researchers to connect a time value to the
tree.

A tree can either be unrooted
,

indicating there is no start or
primary parent i
n the sequence
, or rooted, indicating a possible
beginning to the sequence mutations.

There are many ways to create a phylogenetic tree. One of
the most common ways is via distance information between
each pair of sequences. The distances can be calculate
d in
many ways depending on the input data, but represent the
amount of dissimilarity in the sequences. It is often critical
that this distance data is linear in relationship to the amount of
dissimilarity and some algorithms require that criteria to
produ
ce accurate trees.

Two distance based algorithms are usually discussed. The
Unweighted Pair Group Method with Arithmetic Mean
(UPGMA)

method and the Fitch
-
Margoliash method. The
former is not used and produces inaccurate graphs since it
assumes each child

has equal weighting, while the Fith
-
Margliash algorithm uses weighting to help closely preserve
the distances between nodes in the tree.

The Fitch
-
Algorithm is the focus in the following
paragraphs and the implementation.

Man

II.

R
ESEARCH AND
D
ESIGN

A.

Research

The

research performed was simply to look for examples
and explanations of the algorithm and its proper
implementation. I found a slide which contained a brief
example of the algorithm which was enough to program the
whole algorithm from [2].

The other resear
ch was to get basic information about
phylogenetic trees; however, it turns out there’s a lot of
misinformation about how they are defined. Some papers for
instance will erroneously assume that they must be binary
trees when in fact a distance at a branch
of zero indicates
multiple branches [1].

B.

Design

Phylogenetic trees can be drawn linearly in two dimensions
or they can be drawn using polar coordinates which creates a
l drawing for
aesthetic reasons. The core
-
hoc
from an example in a Powerpoint presentation [2].

III.

I
MPLEMENTATION

The implementation used Javascript for computing and
HTML 5 Canvas for rendering. The Context2D API in Canvas
allows

a user to easily draw lines,

arcs
,

and text which are all
that is required to draw a phylogenetic tree from the outputted
nodes of the Fitch
-
Margoliash algorithm.

Since the algorithm can take in sequence data it often
requires that the distance matrix be computed. I used a very
simple hamm
ing distance algorithm that counts the number of
mismatches and returns that as the distance into the distance
matrix for each pair of seque
nces. It works, but it is not a very
good algorithm for judging the dissimilarity of sequences. If a
single nucleoti
de is missing it skews the distances, so in
practice a better
dis
similarity algorithm would be required.

IV.

T
ESTING AND
S
AMPLE
D
ATA

There is one set of test data that I kept finding over and over
again which is listed in Fig 1 [2].
Since the input to the
Fitc
h
-
Margoliash method is just a distance matrix it is either
generated from the sequence data or given separately in a
precomputed form. The example in Fig 1 was given
precomputed.

Phylogenetic Tree Generation

(
August 2012
)

Brandon J. Andrews

P

2

V.

R
ESULTS AND
A
NALYSIS

The Fitch
-
Margoliash method, while there might be minor
errors, generally creates the expected phylogenetic tree. The
lengths for each branch also correctly
correspond

to the
dissimilarity allowing easy visualization of which sequences
are similar and which are completely different.

To analyze the tree’s accuracy to the distance matrix one
only had to recomputed the distances between each node in
the tree and
calculate the path distances through the branches.
An error could then be created to judge the accuracy of the
tree.

VI.

C
ONCLUSION

The Fitch
-
Margo
liash distance based method creates very

accurate

trees that preserve most of the distances in the
inputted dis
tance matrix.

However, it’s important to realize
that the phylogenetic tree might not relate to any real
evolutionary mutations without further analysis, especially on
the scale of whole organisms [1].

R
EFERENCES

[1]

K. Louhisuo.

(2004, May 4). Constructing phylogene
tic trees with
UPGMA and Fitch
-
Margoliash.
[Online].
Available:

http://www.niksula.cs.hut.fi/~klouhisu/Bioinfo/phyltree.pdf

[2]

J.

Bacardit
, and N.

Krasnogor
,

Phylogenetic Trees

[PPT].

Available:

http://www.cs.nott.ac.uk/~jqb/G53BIO/Slides/Phylogenetic%20Trees.pp
t

Brandon J. Andrews

Lives in Kalamazoo, Michigan and
undergraduate computer science degree at Western
Michigan University (WMU), Kal
amazoo, Michigan in
2010.

He works for the Office of Information Technology at
WMU maintaining printer and database servers while also being a part
-
time
teaching assistant.

A

B

C

D

E

A

0

22

39

39

41

B

0

0

41

41

43

C

0

0

0

18

20

D

0

0

0

0

10

E

0

0

0

0

0

Fig. 1.
This

testing data found in multiple presentations
.