Here

throneharshΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

77 εμφανίσεις

David Goldberg

CS 1950

Proposal Paper


The directed study task that I am doing is programming for the biology department.
Specifically, I am coding a program that parses RNA sequences and searches for patterns. RNA
sequences are

the building blocks for life
. Every cell of every living thing on this planet has RNA.
From the leaf of the tallest tree, to the smallest
bacterium
, all need RNA to function. Similarly
to lines of code, RNA is instructions on how to make the most funda
mental parts of living
things. RNA is read by ribosomes in cells,

which then
are used to
create a functional product.

This product is the proteins for which all
life is built by. N
eedless to say, RNA is an importa
n
t
topic in biology. Unlocking what
seq
uences correspond to what attributes is a very significant
quest.



For my research project, the code I will be programming searches
RNA containing 2

pattern
s

that the scientist has deemed important. This is done to help understand more about
RNA. For e
xample, understanding what causes some attributes to be expressed in the person,
while other times that attributes, despite appearing in the RNA, is skipped, due to something in
the code preceding it.


The
RNA
sequences that I will be dealing with includ
e the possible characters:


A

,


C
’,

G

,

T

,

Y

,

R

,

N

. In the RNA code itself
,

only

A

,

C

,

G

,

and ‘
T


exist. In the RNA sequence
they represent both the amino acids that make up what the RNA is trying to build as well as
procedural information, with commands such as ‘start transcribing here’ and ‘end transcribing
here’ . Y, R, N are used only used in the

patter
n

that is inputted by the user. A ‘Y’, stands for
either a ‘C’ or a ‘T’. An ‘R’ corresponds to either an ‘A’ or a ‘C’. And a ‘N’

corresponds to any of
them, so

A

,


C

,


G

, or

T


will be an acceptable character.


The current program reads in f
or a flat file containing all of the RNA sequence. The file
is broken up into several parts. There are 2 RNA sequences that correspond to one another.
One is the sequence in human RNA; the other is the sequence in mouse RNA. Both sequences
have their o
wn identification number, which is separated from the sequence by a tab. The 2
sequences are separated from the next pair of sequences by a new line character. The RNA
sequences are broken up into 3 different parts, each separated by an ‘X’. The 3 parts

are called
the up intron, the exon, and the down intron, in that order.


For this research project, I will be improving an already existing code. The code that I
will be improving is very minimalistic in its functionality. The existing code has no use
r
interface. The 2 patters being looked for are entered as command line arguments. It will not
accept ‘Y’, ‘R’, or ‘N’ characters. In order for the user to have the functionality that those
characters provide they need to manually construct their defini
tion
. For example,

a
n


N

, which
can be any of the 4
characters,

A

,


C

,


G

,

and ’
T

, the user would need to put this in the
pattern [A|C|G|T]. The file containing all of the RNA sequences contains both human RNA
patters, and the corresponding pattern in a mouse RNA sequence. However, the current
program only checks for the patterns in
the human RNA sequence. Also, it only checks the
down intron for the patterns and completely ignores both the up intron and the exon.

The 2
patterns are checked for in the last 75 characters of the down intron.

This distance is fixed.
Also, the pattern
s are searched for separately. Therefore, the matches found can overlap. This
overlap is undesirable. Lastly, the patterns where both matches are found are stored in a
results file. The human RNA sequences identification number is put into the file, fo
llowed by a
tab, followed by the 75 character string that was searched.


There are several desired changes to this program, which I will implement. My
improved program will provide a better user interface, over just the command line arguments.
My program

will add the functionality of the ‘Y’, ‘R’, and ‘N’ characters. My program will also
give the user the ability to control what part of the RNA sequence they are searching in, either
the up intron, the down intron, or the exon.
My program will also check

the mouse RNA for the
matching patter. In addition to that it will print to the results file if it matches just the human,
just the mouse, or both.
My program will also greatly increase the options available to the user
in how they define their search.

Instead of having a fixed value for

the section of the sequence
that it is checking, currently is 75, the user can specify this length. My program will also allow
the user to specify whether this section is taken from the beginning or the end of the part

of
the RNA that is being checked. (i.e. length of

50
, for the
beginning

of the

up intron
.) Also, the
patterns will be checked for matches together, so the distance between the patters will also be
able to be specified by the user. Finally, the readabilit
y of the results file will be improved. My
program will make the matches apparent. This will be done by separating the parts of the
sequence. There will be tabs in between the beginning of the sequence and the first match, the
first match and the middle

of the sequence, the middle of the sequence and the second match,
and the second match and the end of the sequence. My program will also shave off part of the
beginning and end of the sequence, farther away from the matches, to make it more readable.





There are several problems I have the potential to encounter while working on this
project. First off, the current program is written in Perl, as will my updated version of it. The
problem with this is I do not have any knowledge of Perl, or any othe
r scripted language at this
point.
Also this program relies on somewhat complex regular expressions. I do not currently
have any knowledge of regular expressions. More general problems will be the difficulty in
understanding of the material I am working

with, as well as trying to ascertain what the person
I am working for is looking to be done. The last biology course I took was in 9
th

grade, so my
knowledge of biology is rather limited. Therefore, it may be harder to figure out what exactly
they are t
alking about and that will make my job of figuring out what needs to be done harder.
Similarly to my lack of biology knowledge, the people I am working for have a limited
knowledge of computer science. Therefore, I will also need to make sure that what t
hey are
looking to be done can in fact be done.