COMP 5115 Programming Tools in Bioinformatics

underlingbuddhaBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

81 views

COMP 5115

Programming Tools in
Bioinformatics


Dr. Huseyin Seker

E
-
mail: hseker@dmu.ac.uk

Office: GH 6.57 The Gateway

http://www.cse.dmu.ac.uk/~hseker

Lecture


14:00
-

16:00 Monday (GH4.74)


Lab



16:00
-

17:00 Monday (GH4.74)


Assessment

Coursework (100%)



Availability for meetings

13:00


14:00
Mondays/Wednesdays

What is Bioinformatics?


Bioinformatics

is a multidisciplinary subject that
integrates developments in information and computer
technology as applied to Biotechnology and Biological
Sciences.


Bioinformatics

uses computer software tools for
database creation, data management, data
warehousing, data mining and global communication
networking.


Bioinformatics

is the recording, annotation, storage,
analysis, and searching/retrieval of nucleic acid
sequence (genes and RNAs), protein sequence and
structural information. This includes databases of the
sequences and structural information as well methods to
access, search, analyse, visualise and retrieve the
information.


Other definitions:
http://www.geocities.com/bioinformaticsweb/index.html



Important sub
-
disciplines within bioinformatics


The development of new algorithms and statistics with which to
assess relationships among members of large data sets


The analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and
protein structures and


The development and implementation of tools that enable
efficient access and management of different types of
information


Activities in bioinformatics


Organization of biological data



The creation of databases of biological information


The maintenance of these databases


Analysis of biological data


Development of methods to predict the structure and/or
function of newly discovered proteins and structural RNA
sequences.


Clustering protein sequences into families of related
sequences and the development of protein models.


Aligning similar proteins and generating phylogenetic trees to
examine evolutionary relationships


What are Programming Tools in
Bioinformatics?


Programming tools

are software development tools that can
be used to create Bioinformatics tools


Bioinformatics tools

are software programs that are designed
for extracting the meaningful information from the
mass

of molecular
biology / biological databases & to carry out sequence or structural
analysis.


Designing bioinformatics tools, software and
programmes requires careful consideration as



The end user (the biologist) may not be a frequent user of
computer technology


These software tools must be made available over the internet
given the global distribution of the scientific research community


These software must handle mass amount of information
including an ability of analysing both biological data and text


The Role of Programming Tools in Bioinformatics

3D Structures Growth

Source:
(
http://www.rcsb.org/pdb/holdings.html)

Programming Tools in Bioinformatics need to deal with mass amount of
scattered and complex information (data/text) accurately, reliably, and
effectively:


-

# of proteins in Protein data bank (
www.pdb.org
) has reached to

33 468

(all structures)


-

UniProtKB/TrEMBL (http://us.expasy.org/sprot/) Release 31.1 of

27
-
Sep
-
2005 contains
2 151 724

entries


-

Release 48.1 of 27
-
Sep
-
05 of UniProtKB/Swiss
-
Prot contains

195058 sequence entries, comprising
70 674 903

amino acids

abstracted from 134132 references.


Programming Tools in
Bioinformatics (just a few!)


JAVA in Bioinformatics:


BioJava
: (
www.biojava.org
)
Java tools for processing biological
data which includes objects for manipulating sequences,
dynamic programming, file parsers, simple statistical routines,
etc.


Other examples: Physiome Sciences' computer
-
based biological
simulation technologies and Bioinformatics Solutions'
PatternHunter


Perl in Bioinformatics:


BioPerl
: (
www.bioperl.org
) The BioPerl project is an
international association of developers of Perl tools for
bioinformatics and provides an online resource for modules,
scripts and web links for developers of Perl
-
based software.


Python

is an interpreted, interactive programming language
created by Rossum in 1990. Python is fully
dynamically typed

and
uses
automatic memory management
; it is thus similar to
Perl
,
Ruby
,
Scheme
,
Smalltalk
, and
Tcl
. Python is developed as an
open
source

project, managed by the non
-
profit
Python Software
Foundation
. Python 2.4.2 was released on
September 28
,
2005
.


Biopython (http://biopython.org/) and biojava (
www.biojava.org
):
Biopython and biojava are open source projects with very similar goals
to bioperl


MATLAB Bioinformatics Toolbox:

MATLAB
Bioinformatics
Toolbox

http://www.mathwo
rks.com/products/b
ioinfo/



R
-
language
for Statistical Computing
(
http://www.r
-
project.org/
):
R
is a free software environment for statistical computing and
graphics. It compiles and runs on a wide variety of UNIX platforms,
Windows and MacOS



Bioconductor
: Bioinformatics with R

(http://www.bioconductor.org/)


The broad goals of Bioconductor are to


provide access to a wide range of powerful statistical and graphical
methods for the analysis of genomic data;


facilitate the integration of biological metadata in the analysis of
experimental data: e.g. literature data from PubMed, annotation
data from LocusLink;


allow the rapid development of extensible, scalable, and
interoperable software;


promote high
-
quality and reproducible research;


provide training in computational and statistical methods for the
analysis of genomic data.


Microarray Software Comparison

(http://ihome.cuhk.edu.hk/~b400559/arraysoft_rpackages.html)


A website organizing and commenting on links to R software for gene
expression data analysis
, including software not available from Bioconductor
or CRAN.

Developing
Bioinformatics
Computer Skills

by
Cynthia Gibas,
Per Jambeck, Lorrie
LeJeune

Bioinformatics:
Managing Scientific Data

by
Zoé Lacroix
,
Terence
Critchlow



Bioinformatics
Computing

by
Bryan
Bergeron



Useful text books

Beginning Perl for
Bioinformatics


James Tisdall

R for Bioinformatics
by
Kim Seefeld
,
Ernst
Linder


MATLAB
Bioinformatics
Toolbox

http://www.mathwo
rks.com/products/b
ioinfo/


MATLAB


Get CD from Student Support Centre (Ground floor, the
gateway)



In your PC at DMU,

Start/Programs/MATLAB 7.04/MATLAB 7.04



Get updated info, latest user manual from the mathworks’
web site

www.mathworks.com



The URL for MATLAB user manual

http://www.mathworks.com/access/helpdesk/help/pdf_doc/matlab/refbook.pdf

or

http://www.mathworks.com/access/helpdesk/help/techdoc/matlab.shtml



The URL for Bioinformatics Toolbox user manual

http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/


MATLAB


The name MATLAB stands for
mat
rix
lab
oratory



A high
-
performance language for technical computing


Used extensively in industry and universities


An interactive system whose basic data element is an
array that does not require dimensioning. This allows
you to solve many technical computing problems,
especially those with matrix and vector formulations


Features a family of add
-
on application
-
specific
solutions called
toolboxes


Toolboxes (e.g., bioinformatics) are comprehensive
collections of MATLAB functions (M
-
files) that extend the
MATLAB environment to solve particular classes of
problems

Bioinformatics Toolbox


The Bioinformatics Toolbox extends MATLAB® to provide an integrated and
extendable software environment for genome and proteome analysis. Together,
MATLAB and the Bioinformatics Toolbox give scientists and engineers a set of
computational tools to solve problems and build applications in drug discovery,
genetic engineering, and biological research.


You can use the basic bioinformatic functions provided with this toolbox to
create more complex algorithms and applications.


Data formats and databases



Connect to Web accessible databases. Read
and convert between multiple data formats.


Sequence analysis



Determine statistical characteristics of data. Manipulate
and align sequences. Model patterns in biological sequences using Hidden
Markov Model (HMM) profiles.


Phylogenetic analysis



Create and manipulate phylogenetic tree data.


Microarray data analysis



Read, normalize, and visualize microarray data.


Mass spectrometry data analysis



Analyze and enhance raw mass
spectrometry data.


Statistical Learning



Classify and identify features in data sets with statistical
learning tools.


Programming interface


Use other bioinformatic software (Bioperl and
BioJava) within the MATLAB environment.


http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo


MATLAB Desktop


Let’s have a look at how MATLAB handles matrices

>>A = [1 3 2 1; 5 10 11 8; 9 6 7 1; 4 15 14 1]


MATLAB displays the matrix you just entered.

A =

1 3 2 1



5 10 11 8



9 6 7 1



4 15 14 1


Let’s perform function
sum

for the matrix

>>sum(A)

ans =



19 34 34 11


>>B = sum(A’)

B =



7 34 23 34



Now, let’s run the function
fuzdemos
, which performs several fuzzy
logic based applications

Editor/Debugger

M
-
file


To see an existing M
-
file


go to File/open, or


current directory



To create a new M
-
file


go to File/new/M
-
file, or


simply use notepad and save the file as M
-
file
(e.g., flower.m)

Editor/Debugger

M
-
file