L519: Bioinformatics: Theory & Application (3CR) - Computational ...

moredwarfΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

114 εμφανίσεις


1

I529:


Bioinformatics in Molecular Biology and Genetics: Practical
Applications

(
3
CR)

HW
2

(Due:
Feb
.
17

BEFORE

Lab session)

http://darwin.informatics
.indiana.edu/col/courses/I529
-
12


INTRODUCTION
:

There are two sessions to be completed. The session 1 is fo
r programming

using Perl, Python or
C/C++
and the session 2
consists of problems related to computational methods

and

algorithm
s
.
In order to submit your completed homework (Session 1), please use drop box at the
Oncourse
.
Though you may turn in handwritten session 2 at the lab class, using MS Word (doc)

or
Acrobat
(pdf) is strongly encouraged. These files can also be submitted through Oncourse.


QUESTION:

Don

t hesitate to contact me (Haixu Tang:
hating
@india
na.edu).


INSTRUCTION
:

1.

Please start to work

on

the homework as soon as possible. For some of you without enough
computational background may need much more time than others.

2.

Include
README

file for each programming assignment. This is not supposed to be l
engthy
but
should

contain concrete and enough information;

A.

Function of the
program

B.

Input / Output

C.

Sample usage

D.

You should submit a single compressed file for the session 1.

3.

Please ENJOY

learning and practicing new things.


WARNINGS
:
YOU ARE SUPPOSED TO WO
RK IN GROUP FOR THE MINI CLASS
PROJECT. HOWEVER, YOU MUST DO HOMEWORK SESSION 1 AND 2 ON YOUR
OWN.


2

---------------------------------------------
Section 1

--------------------------------------------------------

For section 1, you are required to write Per
l scripts to do the following tasks.




Note: Sequence file should be in
FASTA

format. Please refer to the following site for further
information on FASTA format; (
Reference 1
,
Reference 2
)
,
3
0 points.



(
A profile model for protein motifs
) Build a profile model from a set of given protein sequences
of the same length.




GOAL



Build a profile model for a set of given protein sequences of the

same length.



Use one training set retrieved from BLOCK database. Include the fasta file in your
homework package.



Implement a program BuildProfile that computes and outputs the profile matrix from the
input sequence data. Note: the input protein sequence
s should be in fasta format, and
each of them has the same length.



Implement a program SearchProfile that takes as input a profile matrix (with the format
as you defined), and an input protein sequence, computes the log likelihood ratio score
for each subs
equence of the input sequence, and outputs the instance with the highest
score.





Result



A package including the following files.



The source codes of programs BuildProfile and SearchProfile;



A sample file for the input protein sequences to BuildProfile;



A
sample file for the profile matrix output by BuildProfile, which can be used as the
input of SearchProfile



A sample input file with a protein sequence for SearchProfile;



The sample result file from SearchProfile.



A simple readme file with short description
s of the files in the package.





3

-----------------------------------

Mini Group Project #
2

----------------------------------------


Mini group project #

2

is sequential to the HW Section 1.

30 points


In last mini group assignment, we
built a proba
bilistic model of gene finding based on the codon
usages. This time we want to build a gene finding model based on the 1
st

order Markov chain of
codons. You can utilize parts of the codes from the last group project in this assignment.




Procedure (hints)



Collect 400 E. Coli gene sequences as the training set;



Collect another 100 E. Coli gene sequences, and 100 non
-
coding sequences as test sets;
(Note: the sequences of the gene and non
-
coding sequences should be in similar length).



Build the 1
st

order Marko
v chain of codons using the training set; the program
(ECgnfinder_mc) should take the same kinds of format for input and output as
ECgnfinder from the last assignment;



Evaluate the performance of ECgnfinder on the test, and compare it with the
performance
of ECgnfinder; if the performance is different, try to explain.



Build a web server for the
ECgnfinder
_mc, and present it as well as the evaluation
results in the lab of 2/17. You can modify the code from the previous group project.




Result



The program ECgn
finder_mc, including the source code and a short readme file.



Two FASTA files for the collected 400 genes as training set and 100 genes as test set;



A report on the performance evaluation.







4

---------------------------------------------
Section 2

------
----------------------------------------------------

For section 2, you are NOT required to write scripts.
4
0 points

1.

Given 10
DNA Segments of the same length L

=10 (DNA source: H
-
NS, Histone
like, nucleoid
-
associated DNA
-
binding protein),


CGCCTGAATA

CGAG
AAAGTT

CGCCGGAATT

GGCATGAATA

TAAAGGAATC

TAATTTAATT

CAATTAAATT

GACATGAATC

TGGCTAATTT

CAACTGAATT


answ
er
the following
questions:

a)

Building a Position
-
Specific Scoring Matrix (PSSM),
θ
1
;

b)

Building a PSSM,

θ
2

, incorporating prior probability;

c)

Compute the relative entropy H for both models;

d)

Given another sequence S
0
, CAAATTATTT, compare two models
θ
1
and
θ
2
.



2.

Devise a hidden Markov model for the prediction of protein secondary structure

using Q3 representation.



3.

Different from the codon bias that describes the codon frequencies for all 61
codons, codon bias may also

referred to
as the relative frequencies of different
codons encoding the same amino acids. It has been shown that differen
t bacteria
may hav
e different codon bias, which may be

related to different identities (or
abundances) of corresponding tRNA encoded (or expressed) in the
se

organisms.
Between bacterial organisms, there are frequent lateral gene transfers, i.e. certain
gen
es a
re transferred from one bacterium

to another, which are
sometimes
evolutionarily distant. Assume it is hypothesized that there exist some genes
transferred from bacterium A to bacterium B, and the codon bias in these two
genomes are very different.
Sup
pose genome of bacterium A

has been
fully
characterized
, and the genome sequence of bacterium B is recently sequenced,

d
evise a probabilistic model to identify the genes in bacterium B that are laterally
transferred from bacterium A.



4.

In a casino they us
e a fair die most of the time, but occasionally they switch to a
loaded die. The switch between dice is a Markov process shown below:


5


Compute the most likely sequence of the dices that were used in 6 consecutive
experimen
ts, if 6 consecutive ‘6’ (“6, 6, 6, 6, 6, 6”) were observed.