COMP 5115 Programming Tools in Bioinformatics

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

66 views

COMP 5115

Programming Tools in
Bioinformatics


Dr. Huseyin Seker

E
-
mail: hseker@dmu.ac.uk

Office: GH 6.57 The Gateway

http://www.cse.dmu.ac.uk/~hseker

Lecture


14:00
-

16:00 Monday (GH4.74)


Lab



16:00
-

17:00 Monday (GH4.74)


Assessment

Coursework (100%)


Coursework
-
1

(30%)


Coursework
-
2

(70%)

What is Bioinformatics?


Bioinformatics

is a multidisciplinary subject that
integrates developments in information and computer
technology as applied to Biotechnology and Biological
Sciences.


Bioinformatics

uses computer software tools for
database creation, data management, data
warehousing, data mining and global communication
networking.


Bioinformatics

is the recording, annotation, storage,
analysis, and searching/retrieval of nucleic acid
sequence (genes and RNAs), protein sequence and
structural information. This includes databases of the
sequences and structural information as well methods to
access, search, analyse, visualise and retrieve the
information.


Other definitions:
http://www.geocities.com/bioinformaticsweb/index.html



Important sub
-
disciplines within bioinformatics


The development of new algorithms and statistics with which to
assess relationships among members of large data sets


The analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and
protein structures


The development and implementation of tools that enable
efficient access and management of different types of
information


Activities in bioinformatics


Organization of biological data



The creation of databases of biological information


The maintenance of these databases


Analysis of biological data


Development of methods to predict the structure and/or
function of newly discovered proteins and structural RNA
sequences.


Clustering protein sequences into families of related
sequences and the development of protein models.


Aligning similar proteins and generating phylogenetic trees to
examine evolutionary relationships


What are Programming Tools in
Bioinformatics?


Programming tools

are software development tools that can
be used to create Bioinformatics tools


Bioinformatics tools

are software programs that are designed
for extracting the meaningful information from the
mass

of molecular
biology / biological databases & to carry out sequence or structural
analysis.


Designing bioinformatics tools, software and
programmes requires careful consideration as



The end user (the biologist) may not be a frequent user of
computer technology


These software tools must be made available over the internet
given the global distribution of the scientific research community


These software must handle mass amount of information
including an ability of analysing both biological data and text


The Role of Programming Tools in Bioinformatics

Programming Tools in Bioinformatics need to deal with mass amount of
scattered and complex information (data/text) accurately, reliably, and
effectively:


-

Number of proteins in Protein Data Bank (
www.pdb.org
) has reached to
53263

structures as of 23/09/08.

It was
46051

on 01/10/07,
39051

on
02/10/2006 and
33468
structures in October 2005


-
UniProtKB/TrEMBL (
http://us.expasy.org/sprot/
)


Release 39.2 of 23
-
Sep
-
2008:
6 534 543
entries


Release 37.2 of 11
-
Sep
-
2007:
4 864 587

entries



Release 33.7 of 19
-
Sep
-
2006:
3 189 332

entries.


It was also
2 151 724
entries in September 2005.


-

Release 56.2 of 23
-
Sep
-
08 of UniProtKB/Swiss
-
Prot contains
398181

sequence entries, comprising
143 572 911
amino acids abstracted from
173235

references. Release 54.2 of 11
-
Sep
-
07 of UniProtKB/Swiss
-
Prot
contains
283454

sequence entries, comprising
104 030 551
amino acids
abstracted from
159 599

references. Release 50.7 of 19
-
Sep
-
06 of
UniProtKB/Swiss
-
Prot contains
232345

sequence entries, comprising
85 424 566

amino acids abstracted from
145 962

references. Release 48.1
of 27
-
Sep
-
05 of UniProtKB/Swiss
-
Prot contained
195058

sequence entries,
comprising
70 674 903

amino acids abstracted from
134132

references

Number of Entries in
UniProtKB/TrEMBL
Protein Database

3D Structures Growth

Source

(http://www.rcsb.org/pdb/holdings.html)

Source

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

Programming Tools in
Bioinformatics (just a few!)


JAVA in Bioinformatics:


BioJava
: (
www.biojava.org
)
Java tools for processing biological
data which includes objects for manipulating sequences,
dynamic programming, file parsers, simple statistical routines,
etc.


Other examples: Physiome Sciences' computer
-
based biological
simulation technologies and Bioinformatics Solutions'
PatternHunter


Perl in Bioinformatics:


BioPerl
: (
www.bioperl.org
) The BioPerl project is an
international association of developers of Perl tools for
bioinformatics and provides an online resource for modules,
scripts and web links for developers of Perl
-
based software.


Python

is an interpreted, interactive programming language
created by Rossum in 1990. Python is fully
dynamically typed

and
uses
automatic memory management
; it is thus similar to
Perl
,
Ruby
,
Scheme
,
Smalltalk
, and
Tcl
. Python is developed as an
open
source

project, managed by the non
-
profit
Python Software
Foundation
. Python 2.4.2 was released on
September 28
,
2005
.


Biopython (http://biopython.org/) and biojava (
www.biojava.org
):
Biopython and biojava are open source projects with very similar goals
to bioperl


MATLAB Bioinformatics Toolbox:

MATLAB
Bioinformatics
Toolbox

http://www.mathwo
rks.com/products/b
ioinfo/



R
-
language
for Statistical Computing
(
http://www.r
-
project.org/
):
R
is a free software environment for statistical computing and
graphics. It compiles and runs on a wide variety of UNIX platforms,
Windows and MacOS



Bioconductor
: Bioinformatics with R

(http://www.bioconductor.org/)


The broad goals of Bioconductor are to


provide access to a wide range of powerful statistical and graphical
methods for the analysis of genomic data;


facilitate the integration of biological metadata in the analysis of
experimental data: e.g. literature data from PubMed, annotation
data from LocusLink;


allow the rapid development of extensible, scalable, and
interoperable software;


promote high
-
quality and reproducible research;


provide training in computational and statistical methods for the
analysis of genomic data.


Microarray Software Comparison

(http://ihome.cuhk.edu.hk/~b400559/arraysoft_rpackages.html)


A website organizing and commenting on links to R software for gene
expression data analysis
, including software not available from Bioconductor
or CRAN.

Developing
Bioinformatics
Computer Skills

by
Cynthia Gibas,
Per Jambeck, Lorrie
LeJeune

Bioinformatics:
Managing Scientific Data

by
Zoé Lacroix
,
Terence
Critchlow



Bioinformatics
Computing

by
Bryan
Bergeron



Useful text books

Beginning Perl for
Bioinformatics


James Tisdall

R for Bioinformatics
by
Kim Seefeld
,
Ernst
Linder


MATLAB
Bioinformatics
Toolbox

http://www.mathwo
rks.com/products/b
ioinfo/


MATLAB


at DMU,

Start/Programs/MATLAB



Get updated info, latest user manual from the
mathworks’ web site

www.mathworks.com



The URL for MATLAB user manual

http://www.mathworks.com/access/helpdesk/help/pdf_doc/matlab/refbook
.pdf

or

http://www.mathworks.com/access/helpdesk/help/techdoc/matlab.shtml



The URL for Bioinformatics Toolbox user manual

http://www.mathworks.com/access/helpdesk/help/toolbo
x/bioinfo/


MATLAB


The name MATLAB stands for
mat
rix
lab
oratory



A high
-
performance language for technical computing


Used extensively in industry and universities


An interactive system whose basic data element is an
array that does not require dimensioning. This allows
you to solve many technical computing problems,
especially those with matrix and vector formulations


Features a family of add
-
on application
-
specific
solutions called
toolboxes


Toolboxes (e.g., bioinformatics) are comprehensive
collections of MATLAB functions (M
-
files) that extend the
MATLAB environment to solve particular classes of
problems

Bioinformatics Toolbox


The Bioinformatics Toolbox extends MATLAB® to provide an integrated and
extendable software environment for genome and proteome analysis. Together,
MATLAB and the Bioinformatics Toolbox give scientists and engineers a set of
computational tools to solve problems and build applications in drug discovery,
genetic engineering, and biological research.


You can use the basic bioinformatic functions provided with this toolbox to
create more complex algorithms and applications.


Data formats and databases



Connect to Web accessible databases. Read
and convert between multiple data formats.


Sequence analysis



Determine statistical characteristics of data. Manipulate
and align sequences. Model patterns in biological sequences using Hidden
Markov Model (HMM) profiles.


Phylogenetic analysis



Create and manipulate phylogenetic tree data.


Microarray data analysis



Read, normalize, and visualize microarray data.


Mass spectrometry data analysis



Analyze and enhance raw mass
spectrometry data.


Statistical Learning



Classify and identify features in data sets with statistical
learning tools.


Programming interface


Use other bioinformatic software (Bioperl and
BioJava) within the MATLAB environment.


http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo


MATLAB Desktop


Let’s have a look at how MATLAB handles matrices

>>A = [1 3 2 1; 5 10 11 8; 9 6 7 1; 4 15 14 1]


MATLAB displays the matrix you just entered.

A =

1 3 2 1



5 10 11 8



9 6 7 1



4 15 14 1


Let’s perform function
sum

for the matrix

>>sum(A)

ans =



19 34 34 11


>>B = sum(A’)

B =



7 34 23 34



Now, let’s run the function
fuzdemos
, which performs several fuzzy
logic based applications

Editor/Debugger

M
-
file


To see an existing M
-
file


go to File/open, or


current directory



To create a new M
-
file


go to File/new/M
-
file, or


simply use notepad and save the file as M
-
file
(e.g., flower.m)

Editor/Debugger

M
-
file

How to find your way through MATLAB
and Bioinformatics Toolbox


MATLAB


MATLAB Getting Started

http://www.mathworks.com/access/helpdesk/help/techdoc/learn_matlab/


Information can be found under various titles (e.g.,


Matrices and Arrays



Graphics



Programming



Creating Graphical User Interfaces



Desktop Tools and Development Environment


MATLAB Function Reference

is also very useful to
look for some functions


“help” function is also useful to find out a function’s
features if you know the name of a function



Bioinformatics Toolbox


http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/

Coursework 1

Report on

Bioinformatics Programming Tools


Date set: Monday, 29
th

of September
2008


Deadline for submission: Friday, 31
st

of
October 2008 (by 12:00 pm)



This coursework contributes 30%
towards your final grade


Details can be found on
the module
website