Summary for August 25 -Sept 1 2011

beadkennelAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

80 views

Summary for August 25
th
-
Sept 1
st

2011

Krishnakumar Sridharan

Project A


TSS Prediction using Machine
-
learning

1.

Framework for getting more meaningful results from DNAFeatures
:

A.

While I was testing different negative datasets using DNAFeatures (
See previous

update
), I
was unable to get results for the nucleosome prediction features for all datasets. I met with
XK regarding this and he pointed out a few errors in my approach, and how to improve the
nucleosome prediction part

B.

To get meaningful results from the

NuPoP code used for nucleosome
-
based features, larger
sequences

(
-
5kb, +5kb)

need to be supplied

at the “boundaries”, to take into account
“boundary effects” as said by authors of the NuPoP package. I was not taking these into
consideration in my 400bp lo
ng sequences and thus
,

the nucleosome prediction results I got
for some datasets would not mean anything.

C.

The new approach that I plan on using in the future to get nucleosome
-
prediction features is
as follows:



Calculate all features, except nucleosome
-
pre
diction features, from DNA
Features package using many single standard
-
length (400bp in examples) fasta
sequences, and get feature values.



For nucleosome
-
based features, extend the standard
-
length sequence
to about
5kbp on both sides,
with either “Filler” (
Random nucleotides and/or Unknown
nucleotides, “N”)

or “Actual” (Upstream and downstream sequences), and then
run NuPoP on them for the limited region [5001,5400]



The above approach will take care of the “boundary effects”, while giving the
nucleosome
-
base
d feature values for only the selected 400bp (or any standard
-
length) sequence.


2.

Discussions on machine
-
learning
:
I also discussed with XK about the machine
-
learning
approach that he was following, and got some good directions from him regarding this. Some

interesting pointers were that:

A.

The actual difference is not based on the machine
-
learning algorithm used, but
instead, it’s heavily based on the datasets and the “features” used for prediction

B.

BioBayesNet

is a good place to start these analyses, since it

has been used before in
the group (YZ and XK) and is user
-
friendly.
WEKA

is the next go
-
to tool since it has
capabilities to run many different types of analyses

C.

The
S
upport
V
ector
M
achines (SVM)

approach that I tried to implement is a popular
and well
-
us
ed algorithm in machine
-
learning circles, but is considered
computationally expensive (Frequently runs out of memory, see previous updates)
.
Since SVM is often recommended and used in literature, I will be definitely coming
back to it once I have implement
ed a run in any of the other machine
-
learning
algorithms.


3.

Research Discussion with Chris Eiesley (Dr.Dorman’s student):

I.

After a conversation at my poster last Saturday, I met Chris Eisley

(CE) from Dr. Karin
Dorman’s lab over lunch and discussed opportunities to collaborate in research.

II.

We explained each of our research methodologies, approaches and progress to each
other so as to think of areas to work together on. CE works on the
IMM
-
based models
and codes by Mike Sparks from our group, he is currently working on expanding some of
the models in this work to predict coding VS non
-
coding sequences.

III.

CE works on estimating the effects of various hidden states, in the markov model, that

correspond to genomic features such as G/C content. He mentioned developing a
probability
-
based model that performs a binary classification in predicting whether a
given sequence is coding or non
-
coding.

IV.

I explained to him about how I use genomic features

as predictors to predict whether or
not a sequence has a Transcription Start Site and if it does, where it has this TSS. These
discussions helped me think of a statistically sound methodology for doing the following
things for my project:

A.

Formatting and m
aking sure of the integrity of the positive and negative training
data, and testing data too.
An

idea that I came up with for testing data for the
eventual machine
-
learning based algorithm is to take a genomic
DNA/Chromosome sequence and fragment it. These

fragments can together
form an unbiased test set
.

B.


Cross
-
validation of the eventual TSS
-
predicting algorithm



I could use either a
Random or Leave
-
one
-
out
-
cross
-
validation approach and separate 4/5
th

of the
genomic DNA fragments into training data and 1/
5
th

as testing data.

C.

CE suggested a machine
-
learning approach of his choice,
Random Forest
, and
we discussed how that would compare to SVMs or other machine
-
learning
methods

V.

In conclusion, based on the discussions we had, some of the possible opportunities

to
collaborate with CE would be as follows:

1)

Adding to the genomic features
:



We discussed implementing a model loosely based on the one that CE is
working on, which will tell if a sequence has TSSs, based on the 200 k
-
mer upstream sequence (Promoter eleme
nts).



This model will output a probability score that is higher in case the given
upstream sequence is upstream of a TSS; it is a chain
-
based model that
has been used for predicting coding sequences.



The concerns we would have are that the sequences with
no TSS in
them would have no defined “upstream sequence” and also, choosing
the right sized k
-
mer



CE required a little more time to look at the finer details of implementing
such an algorithm for a promoter sequence and I agreed to provide him
with some +v
e and

ve sequences once he is ready (he will notify me by
email)

. If we could add this feature, it might be able to increase the
prediction efficiency of my approach.

2)

Estimating more hidden states
: CE mentioned that the prediction efficiency
for the mark
ov chain, which he is working on, increases if there is an
estimate for the G/C content. I suggested that the DNAFeatures we have
might help him estimate more “hidden” states or genomic fetaures, so that
he may test if that increases the prediction accurac
y for him.

3)

Machine
-
learning:

Since CE has applied the
Random Forest

approach in
some previous works, I could learn how to apply that method from him.
Also, the machine
-
learning part of my work could benefit his side of things.


4.

Future Work in the coming we
eks
:

Based on these discussions I have had and the analyses I have done previously, my future plans
for the coming weeks are as follows:

A.

Convert data into C4.5 machine
-
learning compatible data format (used in
BioBayesNet and WEKA) using perl scripts of my
own

(Timeline = Mid
-
next week)

B.

In parallel, I will be working to implement the newer approach to calculate
nucleosome
-
based features using perl scripts to format data and the NuPoP code
within DNAFeatures package to predict the actual feature values

(Timel
ine= End of
next week)

C.

Once the C4.5 format is done, I will run these data in machine
-
learning tools
(Starting with BioBayesNet) and see what I get
(Timeline = Within next 1.5 weeks)
.


Project B


Transcription
Initiation

and
Promoter architecture across
species

1.

Objective of work this week
: To predict the transcription initiation and promoter architecture
data for one species end
-
to
-
end, on a proof
-
of
-
concept basis, to see the challenges associated
with this task and to observe the hierarchy of script usag
e.


2.

Tasks done
: Running protist (Plasmodium Falciparum) data in GeneSeqer and EST2TSS and
following through till the TSS prediction step. Also, look at the different datasets available for a
given species and formulate an approach to capture information fr
om the various formats of
available data


3.

Plasmodium falciparum

:

Malaria parasite which was sequenced very recently (data released
May 2011), the sequenced data is available at plasmodb (
http://plasmodb.org/plas
mo/
)



4.

Approach used
:


A.

P.falciparum data was extracted from plasmodb in form of GFF, EST and Genomic
sequence files

B.

A perl script was written to parse the GFF file into coordinates corresponding to
Transcription Start Sites

C.

Geneseqer was run with the EST s
equences and each of the 14 chromosomes in the
genomic sequence file (Running time: 30
-
40 minutes per chromosome). The
GeneSeqer output was given as input to EST2TSS based on which predictions are
given for possible Transcription Start Sites along with the
ir orientation

D.

EST2TSS can give both the individual TSSs and can cluster TSS
-
matches, within a
user
-
specified window, into a prospective TSR (window size to bundle/cluster TSSs is
usually good between 40
-
100bp, best predictions for example chromosome 14
se
en
with 40bp window size
)

E.

Both GFF and EST files are admittedly “preliminary” by authors, so the data might
not point to an exact TSS, but using the strict criteria for matches in EST2TSS, we can
compress the various nearby TSSs into plausible TSRs.

F.

This l
ittle exercise helps us shape an approach for 2 data types
-

GFF and
EST/Genomic Sequence file


5.

Plans for each format of data
:

A.

GFF


Use perl or R scripts to run data and extract possible TSS positions

B.

EST/Genomic sequence


Use GeneSeqer+EST2TSS and tweak
input parameters in EST2TSS
to give a better supported annotation than the one given in GFF file

C.

CAGE


R scripts by TR

D.

SAGE and RNA
-
Seq


Interesting, but attempt only if the package development process is
done or close to done

Contingency plan for
incomplete/absent GFF files for some species
: For species with
incomplete GFF data, optimize EST2TSS parameters on existing data and extrapolate other data
from runs of EST2TSS with optimized parameters. For absent GFF
files,

use EST2TSS with a set of
more

lenient/non
-
restrictive parameters to get some genome annotations.

6.

Research discussion over meeting today
:

A.

Over our weekly meeting, we updated each other on the progress that we had made since
our last meeting. We discussed the progress till present from
both our ends and the
directions to follow after this.

B.

TR has been working on the CAGE datasets for Humans in the FANTOM dataset. He uses the
BioMart package within R to handle these datasets and obtains an output that contains the
Gene Name, Gene Start P
osition, Gene Orientation and Gene Sequence. I showed him the
outputs from EST2TSS

C.

We decided upon the first of many data
-
format
-
based “checkpoints” in our workflow. I will
be consolidating the outputs that I get from EST2TSS and the GFF files into the fol
lowing
formats:

(1)

.ClusterFormat


This file extension has Gene Names as columns and all
start
positions and strands they occur in
,

as rows. This is the input format for the TSS
Clustering part of our workflow

(2)

.mod
.gsq


Modified GeneSeqer

output format contains the columns


Gene Name,
Gene Start, Gene Orientation, Gene Description, Gene Sequence. The purpose is to
connect EST2TSS outputs to the existing annotations, where EST2TSS is used as a
quality control tool to select only “strongly
-
supported” TSSs for the next step.

D.

These formats will form a common basis of comparison of different data types (EST, GFF
,
CAGE
) from different species and also, act as a checkpoint before we proceed to our next
step which is clustering TSSs into a TSR.

E.

TR

will probably travel to Ames
in the week of

September 24
th

and we aim to have a
substantial amount of our single
-
kingdom analyses, some preliminary results and the set of
scripts that we plan to put together in one package, by
September 24
th



7.

Future work

in the coming weeks:

(i)

Use Perl scripts to format data into the two afore
-
mentioned formats

(
Timeline


Next
Wed./Thurs
)

(ii)

Once the format is set, run these scripts on the next protist species
Toxoplasma Gondii

and get species, chromosome
-
specific results

(
Ti
meline


Next Wed./Thurs
)

(iii)

Provide these files as input to the x
-
means clustering algorithm and see what we get

(
Timeline


End of Next week
)