Longest Common Subsequence Assignment - Computer Science

moredwarfΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 3 μήνες)

130 εμφανίσεις

Assignment 3


Longest Common Subsequence


Dean Zeller
1

Due: Thursday, September 21
st

by 2:00
PM

CS33001

10 points

Fall, 2006



Objective:

The student will write a C++ program to implement a dynamic programming
approach to the longest common subsequence
problem.


Topics covered

This assignment is a review of your skills as a programmer using the
string

and
two
-
dimensional
array
data structures.


Background: Bioinformatics

From Wikipedia:

Deoxyribonucleic acid

(
DNA
) is a nucleic acid


usually in the form
of a double helix


that contains the genetic
instructions (or genocode) monitoring the biological development of all cellular forms of life. DNA is a long
polymer of nucleotides and encodes the sequence of the amino acid residues in proteins using the gen
etic code of
nucleotides. DNA is thought to date back to between approximately 3.5 to 4.6 billion years ago.

Computers execute
machine code
, a series of 0’s and 1’s. The machine code for living organisms is DNA, a
sequence of four nucleotides: adenine, c
ytosine, guanine, and thymine. Machine code and DNA are very
similar in theoretical structure. Thus, a technique that is useful in computer science can also be useful in
genetics. DNA can be represented digitally by long strings of A, C, G, and T. One
of the most simplistic tests
to determine genetic similarity is the longest
subsequence

of nucleotides common to both species. Determining
this similarity value is of great use to bioinformatic scientists to determine position on an evolutionary tree.
(E
volution trees will be discussed later in the semester.)


It is a common problem in computer science to determine the longest common sub
string

of two strings,
particularly useful in web search engines. In bioinformatics, the largest subsequence of two str
ings is far more
useful (and difficult) to determine. It is simple to write a brute
-
force program to solve the problem. DNA
strings can be millions of characters long, and thus a brute
-
force method could take years (or even centuries) to
give results for

real
-
world data. This assignment implements a dynamic programming method of efficiently
solving the longest common subsequence problem.


Program Requirements

Input

Prompt the user to enter two strings, representing the DNA sequence of two species.
Alter
nately, the strings can be read in from a file.

Error checking

Before executing the algorithm, check each character in both strings to ensure only
acceptable DNA letters are entered. Do not run the algorithm on input with erroneous
data. In the event o
f an error, indicate how many characters are illegal, and recreate
the string with the illegal letters highlighted (example: AGCTSACT


agctSact).

Output

The program should output the dynamic subsequence table and the length of the
longest subsequence. T
he actual subsequence string can be printed for extra credit.


Program Architecture

Functions

Use functions to modularize your code. The following is a suggestion on how to
organize your functions.




functions.h

header file for functions




functions.c
pp

implementation of the functions




main.cpp

main program implementation that calls the functions

Documentation

Each function must include a documentation block describing the input, output, and
method of solving the problem. Create documentation for th
e main program
describing the program overall. Use proper indentation and self
-
descriptive variable
names.





1

Written by Dean Zeller. Edited by Jasmine Boscom, Daryl Popig, and John Withers.

Grading

You will be graded on the following criteria:

Accuracy

Correctly implementing and using the LCS algorithm

Readability

Documentation, descr
iptive variable names, and indenting

Testing

Ensuring the code works for a variety of inputs

Creativity

Using new methods to solve the problem


Extra Credit

Extra points will be given for including the following features:

File input

Allow the user to enter

a filename with the two input strings

Subsequence


Print the longest common subsequence in addition to the length

Brute
-
force


In addition to the dynamic programming method described in this assignment, also
implement a brute
-
force solution to the proble
m. Create a short report that describes
the program duration for a variety of input lengths.

Experiment


Reconstruct the evolution tree on the species described in Input Set #5. Run your
program on each pair of species in Input Set #5 below. Use the si
milarity values to
reconstruct a possible evolution tree, in which species that are more similar are closer
on the tree.


Turning in your assignment

1.

Print all C++ and header files.

2.

Print a test of your program with input sets given below.


Algorithm Pseudo
code

Given below is the pseudocode to solve the problem. You are to implement the pseudocode in object
-
oriented
C++ functions. Note that this pseudocode has not yet been tested.


Step 1: initialize variables


string s = string1


string t = string2


Twi
dth = s.length + 1


Theight = t.length + 1


int T[Twidth][Theight]


for (i=0; i<Twidth; i++)



T[i,0] = 0


end
-
for


for (j=0; j<Theight; j++)



T[0,j] = 0


end
-
for


Step 2: Run LCS algorithm


for (j=1; j<Theight; j++)



for (i=1; i<Twidth; i++)




if (s[i
] == t[j]) // match in subsequence





T[i,j] = T[i
-
1,j
-
1] + 1




else // no match, use previous value





T[i,j] = max (T[i
-
1,j], T[i,j
-
1])




end
-
if



end
-
for


end
-
for


subsequenceLength = T[Twidth][Theight]


Step 3: Determine subsequen
ce string (extra credit)


To recreate the string, you must keep track of the times the algorithm found a matching letter in the
sequence. The exact method to keep track of the sequence matches is up to you.



Test Input Sets

Test your program on the foll
owing sets of input data.


Input Set #1

String1:

ATCCGACAAC

String2:

ATCGCATCTT

Output:

7 (ATCGCAC)


Input Set #2

String1:

AACGTTCOGMA

String2:

GGATACCASAT

Output:

errors in String1 (aacgttcOgMa)


error in String 2 (ggataccaSat)


Input Set #3

String1:

AAA
ATTTT

String2:

TAAATG

Output:

4 (AAAT)


Input Set #4

String1:

TAGTAGTAGTAGTAGTAG

String2:

CATCATCATCATCA

Output:

8 (ATATATAT)
or

8 (CACACACA)



Example of Method

Rule for T[i,j]:


if (string1[i] == string2[j])

T[i,j] = T[i
-
1,j
-
1] + 1


otherwise

T[i,j]
= maximum (T[i
-
1,j], T[i,j
-
1])


string1 = “ATCCGACAAC”

string2 = “ATCGCATCTT”



+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1


string1 = “AAAATTTT”


string2 = “TAAATG”



+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

Input Set #5 (extra
credit)

Use your program to determine which of the following DNA
sequences are most similar. This input set is to simulate a real
genetic experiment. In a real experiment there could be many
species, each DNA sequence thousands of characters long.

Stri
ng1: GTCACTTCACGGGTACAGACTTAAACG

String2: ACCATTACGGCGATACCAGGATAC

String3: TTTTATTTAGGACAGACTAGACCAGGT

String4: CCGTAGATCGATACGATACCCACCTCAGG

String5: CCGTAGATCGATACGATACCCACCTCAGG