Programming for Bioinformatics - BIOTEC - Biotechnology Center ...

abalonestrawΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

94 εμφανίσεις

Michael Schroeder



BioTechnological Center

TU Dresden

ms@mpi
-
cbg.de

Biotec

Programming for Bioinformatics

The module…


will teach students
basic programming skills
relevant to
bioinformatics, which will enable them to
actively develop
bioinformatics tools
.


will take a
problem
-
driven approach
.


will present
bioinformatics problems
and show
how to solve
them
using existing online tools and how to
implement
such
tools
.


will
revisit some of the problems
and databases discussed in
applied bioinformatics.


will be very
practical and hands
-
on
approach to basic
computer science tools such as using command line
operating
systems
, programming in
Python
, and using
relational
databases
.

Objectives


Students will have an understanding of
different operating
systems


Students will be able to
automate simple repetitive
information retrieval tasks


Students will be able to write
simple programs in Python


Students will be able to
work with relational databases


Students will appreciate the
principles, limits, and
possibilities of programming


Students will be able to
formulate biological questions as
information processing problems


Students will understand
when and how programming can
help to automate bioinformatics problems

Module Structure


Introduction



Databases


Introduction to SQL


A Little Exercise


A Little Science



Introduction to Python


Data types and loops


Sequences and lists


Patterns and functions


Dictionaries


Advanced topics


More Python


Dynamic programming


Clustering



Revision Class


Books



You will need two books for the module: a
reference book on MySQL and a book on Python

Books: Python


We will follow a number of online
resources.

(see course web page)



Further, we look in Python in a Nutshell,
Alex Martelli, O’Reilly


Wesley Chun's Core Python
Programming



Python Cookbook

(O’Reilly)



The publisher O’Reilly has many
general programming books on linux,
python, etc.


They allow you to read all books for 2
weeks online for free. This is very nice
to decide what to buy and what not.


You can also buy electronic copies of
the book.


Books: MySQL


There are many, many
books on MySQL


The following two are just
sugestions, as there are
many other books covering
the same material



MySQL Cookbook
by Paul
DuBois, O'Reilly or



MySQL
by Paul DuBois,
Michael Widenius, O'Reilly

Structure of Labs


Databases


Lab 1,2: Simple SQL


Lab 3,4: SQL to answer interesting scientific questions


Python


Lab 5: Data types and loops, accessing a DB from Python


Lab 6: Sequences and lists


Lab 7: Patterns and Functions


Lab 8: Dictionaries


Lab 9: BioPython


Lab 10: Python & PyMOL



More Python:


Lab 11: Dynamic programming revisited


Lab 12: Clustering revisited


Lab 13: Revision

Assessment


Lab


Exercises
:


Each week during
the
lab
you get exercises which you have
to do during the lab and finish on your own during the week


These
exercises
need to be
handed in on paper at the next
lecture


Results are discussed during the labs and as part of the
assessment you will have to
present a solution once


Doing the
exercises
is
compulsory
, but there are
no marks


Project


You will demonstrate your programming skills by implementing
and presenting a software project


Exam


Pen and paper exam on material covered in lecture

Programming Project


Goal: Demonstrate ability to use SQL and
programming


Goal 2: Produce science movie for Long Night of
Science


You will work in a team and get a biological problem.


Part 1: Programming: You have to implement some
workflows, which integrate data from various sites and
use various tools programmatically. This includes an
animation of your target protein in PyMol.


Part 2: Make a movie. Tell the story about your protein
based on the data collected and analysis carried out.
Create a story board and turn all material and Pymol
animations into a movie.

Motivation: Databases


In the
last term
,


we accessed most
information online
via the web


we
interacted directly and manually
with databases and tools


we had to
manually submit queries
, interpret results. select interesting
results, cut&paste them, and submit queries again,…


Pro:


Reasonably easy to get hold of information


Con:


Not possible to ask many queries


Queries limited
by interface provided by web page


Difficult/impossible to integrate
information from different sites


In
this term
, we will look at the
databases underlying the online
front ends


How is the data internally stored?


How can we
-

and more important computer programs
-

directly interact
with the underlying data, so that we can ask more powerful queries, large
queries, and integrate different systems

What actually happens

You are limited by what web server
allows you to ask:

Example CATH:


PDB ID,


CATH code, or


General text

But you cannot ask:


In how many different PDB
structures is there a P
-
loop domain?


Is there a PDB entry with a P
-
loop
and a DNA
-
binding domain


How many different superfamilies
does the largest structure in PDB
have?


With direct access to the underlying
database you could answer all these
questions (and many more)

Motivation: SCOP as Relational Database


We worked with
SCOP
, the Structural Classification of Proteins


Family
: >30% sequence identity


Superfamily
: Similar structure and function (possibly lower 30% sequence
identity)


Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt

30%

Family

Same
Superfamily,

But not family

Motivation: Databases


We wish to answer the following questions:


How many families
and
superfamilies
are there?


Do
all superfamilies
roughly have the
same number of families
?


How many families does the immunoglobulin superfamily have?


Which superfamily has the
most families and how many
?


How many
percent of superfamilies
have only
one family
?


Which
PDB
structure has the
largest number of distinct
superfamilies
?


How many percent of PDB structures have
only one type of
superfamily
, how many percent have at least two?


Which is the
most popular superfamily
?


Are all superfamilies
equally likely to co
-
occur
or do they have
preferences?


Which superfamily has the
most co
-
occurrence partners
?


Is the number of co
-
occurrence partners and the frequency of the
superfamily correlated?


What is a Database


SCOP contains relevant information, but we cannot answer the
above questions through the web
-
interface of SCOP


The problem is that we do not have access to the underlying
database



What is a database anyway?


A database provides…


Logical organization
of data


data models, schema design, dictionaries


Physical organization
of data


Fast retrieval, indexing, compact storage of data


Relational Database


Central Idea: Data as relations in a table


E.g. Employee

+
-------
+
------
+
---------
+
---------
+

| id | name | salary | role |

+
-------
+
------
+
---------
+
---------
+

| 46457 | pete | 50.000 | director|

| 46458 | jane | 60.000 | nurse |

| 46459 | asif | 70.000 | driver |

+
-------
+
------
+
---------
+
---------
+

Relational Database


Central Idea: Data as relations in a table


E.g. SCOP, Structural Classification of Proteins

+
-------
+
------
+
---------
+
---------
+
--------------------------------------
+

| id | type | sccs | sid | description |

+
-------
+
------
+
---------
+
---------
+
--------------------------------------
+

| 46457 | cf | a.1 |
-

| Globin
-
like |

| 46458 | sf | a.1.1 |
-

| Globin
-
like |

| 46459 | fa | a.1.1.1 |
-

| Truncated hemoglobin |

| 46460 | dm | a.1.1.1 |
-

| Truncated hemoglobin |

| 46461 | sp | a.1.1.1 |
-

| Ciliate (Paramecium caudatum) |

| 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: |

| 46462 | sp | a.1.1.1 |
-

| Green alga (Chlamydomonas eugametos) |

| 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: |

| 63437 | sp | a.1.1.1 |
-

| Mycobacterium tuberculosis |

| 62301 | px | a.1.1.1 | d1idra_ | 1idr A: |

+
-------
+
------
+
---------
+
---------
+
--------------------------------------
+

SCOP Tables

mysql> select * from cla limit 1;

+
---------
+
--------
+
---------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+

| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |

+
---------
+
--------
+
---------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+

| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |

+
---------
+
--------
+
---------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+


mysql> select * from des limit 1;

+
-------
+
------
+
------
+
------
+
--------------------
+

| id | type | sccs | sid | description |

+
-------
+
------
+
------
+
------
+
--------------------
+

| 46456 | cl | a |
-

| All alpha proteins |

+
-------
+
------
+
------
+
------
+
--------------------
+


mysql> select * from astral limit 1;

+
---------
+
---------
+
-----------------------------------------------------------
+

| sid | sccs | seq |

+
---------
+
---------
+
-----------------------------------------------------------
+

| d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|

+
---------
+
---------
+
-----------------------------------------------------------
+


mysql> select * from subchain limit 1;

+
----
+
-------
+
----------
+
-------
+
------
+

| id | px | chain_id | begin | end |

+
----
+
-------
+
----------
+
-------
+
------
+

| 1 | 14982 | A | | |

+
----
+
-------
+
----------
+
-------
+
------
+

SCOP Tables

mysql> select * from cla limit 1;

+
---------
+
--------
+
---------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+

| sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px |

+
---------
+
--------
+
---------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+

| d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 |

+
---------
+
--------
+
---------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+
-------
+


mysql> select * from des limit 1;

+
-------
+
------
+
------
+
------
+
--------------------
+

| id | type | sccs | sid | description |

+
-------
+
------
+
------
+
------
+
--------------------
+

| 46456 | cl | a |
-

| All alpha proteins |

+
-------
+
------
+
------
+
------
+
--------------------
+


mysql> select * from astral limit 1;

+
---------
+
---------
+
-----------------------------------------------------------
+

| sid | sccs | seq |

+
---------
+
---------
+
-----------------------------------------------------------
+

| d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...|

+
---------
+
---------
+
-----------------------------------------------------------
+


mysql> select * from subchain limit 1;

+
----
+
-------
+
----------
+
-------
+
------
+

| id | px | chain_id | begin | end |

+
----
+
-------
+
----------
+
-------
+
------
+

| 1 | 14982 | A | | |

+
----
+
-------
+
----------
+
-------
+
------
+

Querying Relational Databases


SQL = Structured Query
Language


Select

Which attributes?
from

Which tables?
where
Which conditions?



Select … from … where …


Distinct


Like


Union/intersect


Join


Count/average/sum/min/m
ax


Group by


Having



Show tables


Show databases


Use


Create database


Create table … as


Drop table


Load data


Insert into


Databases



Given SCOP as relational database, we can answer all
the questions raised above using the SQL constructs
of the previous slide!

Programming


We will use Python (Guido van Rossum, named after Monty
Python) as a convenient extension to the operating system


Easy to write quick programs


More than just a scripting language


Interpreted, interactive, indented


Supports string processing well


Widely used in bioinformatics


Object oriented, general purpose


Many nice libraries for database access, Graphics, Web, GUI, R…


Scientific orientation: Numerical Python (math), Scientific Python,
Biopython



Beware: Python is inefficient, but computationally expensive parts
can be included as C
-
libraries

Motivation: Families and Identity


We said that SCOP
families share >30% identity


What does that mean?


Any two structures in a family >30%?


At least one other member in family with >30%?


What is the
average sequence similarity within a
family
?
Within a superfamily
?



Given a sequence and that we know already which
superfamily it belongs to. Can we
find
the
superfamily’s
family best suited for the sequence


Two approaches: Blast vs. DIY


We can answer the above easily:


We use SCOP database and run database queries
from a Python script


For
a given
superfamily select
all corresponding
sequences
from the astral table


For all
pairs of selected sequences


Call
Blast
and
record
the sequence
identity


Or run your
own dynamic programming
algorithm and
record
the sequence
identity


For second problem:
Compare sequence to all family
sequences
and assign it to the family which shares
the highest (must be >30%) similarity with the
sequence

Motivation: Sequence vs. Structure


Can we
verify

the
plot
below?


Can we create
a similar plot for specific
superfamilies
? E.g. DNA
-
binding domains?

Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt

30%

Family

Same
Superfamily,

But not family

Motivation: Sequence vs. Structure

Again:
select
the relevant
sequences
from the
astral table and besides
computing the
sequence identity
, we
compute structural
similarity
to the relevant structure using an
algorithm like Dali or CE

Then plot the two similarities against each other in
a scatter plot

Motivation:

Amino Acid Composition of Families



Can we characterise the
amino acid composition of
different families/superfamilies
?



Again:
select
the relevant
sequences
from astral
and
count the frequencies
of amino acids



Is the amino acid
composition at the interface
of a
domain
different
from the rest of the domain?

Motivation:

Let’s rebuild SCOP families


Given a
SCOP superfamily
and its sequences, how
can we
divide it into families
?



First, we need
dynamic programming
to determine
the
sequence similarity



Then we do the following:


For all pairs of sequences
, call the
sequence
similarity
algorithm and
record the similarity into a
distance matrix


Next,
run hierarchical clustering
to cluster the
sequences.

What’s needed…


…programming in Python

Python Programming Constructs


Variables, strings,


For/while Loops


If statements


File I/O


Regular expressions


Data structures: Lists, Hashes


Code Structure: Objects, classes, modules

Hello World in Python


Given a file
helloworld.py



Open a shell and type at the command prompt
helloworld.py


The shell then executes your programme


In the first line, it realises that the python interpreter
needs to be loaded and that what follows is a python
program


The line below prints a message

print "Hello World"

File: helloworld.py

Read a text file in python


The command
open
opens a text file and creates


“r”

as second argument after the filename indicates that file is
read (this is default, ie. can be left out)


“w”

as second argument indicates that file is written to


“a”

as second argument indicates that file is appended to


The for
-
loop reads all lines of the file one by one (requires python
>2.2)


The body of the loop prints them on the screen (note that
print

adds a new line automatically, avoid that with adding a ”
,
” )

data = open("seq.txt“, “r”)

for line in data:


print "Line:”, line,

acgt

gggt

File: seq.txt

File: fileIO.py

Line: acgt

Line: gggt

Output

Variables in Python



The
=

symbol is used to assign values to variables


The
+

symbol is also used to concatenate strings

lineNo = 1

for line in open(“seq.txt”):


print lineNo+”: ”+line,


lineNo = lineNo+1

acgt

gggt

File: seq.txt

File: fileIO.pl

1: acgt

2: gggt

Output

If
-
then
-
else and
strings in
Python

data = open("seq.txt")

line1 = data.readline().rstrip()

line2 = data.readline().rstrip()


len1=len(line1)

len2=len(line2)


if len1 < len2:


minLen = len1

else:


minLen = len2


line3 = ""


for i in range(minLen):


if line1[i] == line2[i]:


line3=line3+"*"


else:


line3=line3+" "


print "Sequence comparison"

print line1

print line2

print line3

acgt

gggt

File: seq.txt

File: seqcomp.py

Sequence comparison

acgt

gggt


**

Output

Programming
Example