Basic Concepts of Bioinformatics - GeoCities

abalonestrawBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

92 views

How Bioinformatics can change your life


Basic Concepts of
Bioinformatics


M. Alroy Mascrenghe

MBCS, MIEEE, MIT

mark_ai@yahoo.com


A lecture given for the BCS Wolerhampton Branch at the University of Wolverhampton


http://www.geocities.com/mark_ai/

M.Alroy Mascrenghe

2

TOC


Introduction


Basic concepts in Molecular biology


Bioinformatics techniques


Areas in bioinformatics


Applications


Related Computer Technology


Conference in Glasgow


Acknowledgements


Reference

M.Alroy Mascrenghe

3

Introduction……

M.Alroy Mascrenghe

4

2000


A Major event happened that was to
change the course of human history


It was a joint British and American
effort


nothing to do with IRAQ!


It was a race


who will complete
first


Race Test


not whether they have
taken drugs but whether they can
produce them!


Human genome was sequenced

M.Alroy Mascrenghe

5

A Situ…somewhere in the
near future


A virus

not ‘I love you’ virus
-

creates an epidemic


Geneticists and bioinformaticians role on their
sleeves


Genetic material of the virus is compared with the
existing base of known genetic material of other
viruses


As the characteristics of the other viruses are
known


From genetic material computer programs will
derive the
proteins

necessary for the survival of the
virus


When the
protein

(sequence and structure) is
known then medicines can be designed


M.Alroy Mascrenghe

6

What is


The marriage between computer
science and molecular biology


The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists


‘Information technology applied to
the management and analysis of
biological data’


Storage and Analysis are two of the
important functions


bioinformaticians
build tools for each

M.Alroy Mascrenghe

7

Biology




Chemistry










Statistics






Computer

Science

Bioinformatics

M.Alroy Mascrenghe

8

What is..


This is the age of the Information
Technology


However storing info is nothing new


Information to the volume of
Britannica Encyclopedia is stored in
each of our cells


‘Bioinformatics tries to determine
what info is biologically important’

M.Alroy Mascrenghe

9

Basics

of

Molecular Biology….

M.Alroy Mascrenghe

10

DNA & Genes


DNA is where the genetic information is
stored


Blonde hair and blue eyes are inherited by
this


Gene
-

The basic unit of heredity


There are genes for characteristics i.e. a gene
for blond hair etc


Genes contain the information as a
sequence of nucleotides


Genes are abstract concepts


like
longitude and latitudes in the sense that
you cannot see them separately


Genes are made up of nucleotides


M.Alroy Mascrenghe

11

M.Alroy Mascrenghe

12

Nucleotide (nt)


Each nt I made up of


Sugar


Phospate group


Base


The base it (nt) contains makes the only
difference between one nt and the other


There are 4 different bases


G(uanine),A(denine),T(hymine),C(ytosine)


The information is in the order of nucleotide
and the order is the info


Genes can be many thousands of nt long


The complete set of genetic instructions is
called genomes



M.Alroy Mascrenghe

13

Chromosomes


DNA strings make
chromosomes


Analogy


Letters
-

nt


Sentences


genes


Individual
volumes

of Britannica
encyclopedia


chromosomes


All voles together
-

Genome



M.Alroy Mascrenghe

14

Double Helix


The DNA is a double helix


Each strand has complementary
information


Each particular base in one strand is
bonded with another particular base in the
next strand


G
-

C


A
-

T


For example
-



AATGC


one strand


TTACG


other strand

M.Alroy Mascrenghe

15

Proteins


Proteins are very important
biological feature


Amino Acids make up the proteins


20 different amino acids are there


The function of a protein is
dependant on the order of the amino
acids

M.Alroy Mascrenghe

16

Proteins…



The information required to make aa is
stored in DNA


DNA sequence determines amino acid
sequence


Amino Acid sequence determines protein
structure


Protein structure determines protein
function


A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins


Storage
-

DNA


Information Transfer


RNA


RNA is the message boy!

M.Alroy Mascrenghe

17

Central dogma



DNA


transcription

RNA


Translation


Protein



RNA Polymerase



Ribosomes

M.Alroy Mascrenghe

18

M.Alroy Mascrenghe

19

Proteins…..


Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos


So in triplet codes


codon


protein
information is carried


The codons that do not correspond
to a protein are stop codons


UAA,
UAG, UGA
(RNA has U instead of T)


Some codons are used as start
codons
-


AUG as well as to code
methionine



M.Alroy Mascrenghe

20

Protein Structure


Shows a wide variety as opposed to the
DNA whose structure is uniform


X
-
ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure


Structure is related to the function or rather
structure determines the function


Although proteins are created as a linear
structure of aa chain they fold into 3 d
structure.


If you stretch them and leave them they will
go back to this structure


this is the
native
structure

of a protein


Only in the native structure the proteins
functions well


Even after the translation is over protein
goes through some changes to its structure

M.Alroy Mascrenghe

21

Gene Expression


Gene Expression


the process of
Transcripting a DNA and translating a RNA
to make protein


Where do the genes begin in a
chromosome?


How does the RNA identify the beginning
of a gene to make a protein


A single nt cannot be taken to point out the
beginning of a gene as they occur
frequently


But a particular combination of a nucleotide
can be


Promoter sequences


the order of nt
which mark the beginning of a gene

M.Alroy Mascrenghe

22

Bioinformatics
Techniques…..

M.Alroy Mascrenghe

23

Prediction and Pattern
Recognition


The two main areas of bioinformatics
are


Pattern recognition


‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it


Prediction


From a sequence (what we know) we
can predict the structure and function
(what we don’t know)


M.Alroy Mascrenghe

24

Dot plots….


Simple way of evaluating
similarity between two
sequences


In a graph one sequence is on
one side the next on the other
side


Where there are matches
between the two sequences the
graph is marked

M.Alroy Mascrenghe

25

M.Alroy Mascrenghe

26

Alignments


A match for similarity between the characters of two or
more sequences


Eg.


TTACTATA


TAGATA


There are so many ways to align the above two
sequences


1.


TTACTATA


TAGATA


2.


TTACTATA



TAGATA


3.


TTACTATA



TAGATA


So which one do we choose and on what basis?


Solution is to Provide a match score and mismatch score

M.Alroy Mascrenghe

27

Gaps


Introduce gaps and a penalty
score for gaps


TTACTATA


T_A_GATA


In gap scores a single indel which is two characters long is preferred to two indels which are each one
character long


However not all gaps are bad


TTGCAATCT


CAA


How do we align?


---
CAA
---


These gaps are not biologically significant


Semi Global Alignments


M.Alroy Mascrenghe

28

Scoring Matrix


For DNA/protein sequence alignment we create a matrix


If A and A score is 1


If A and T score is
-
5


If A and C score is
-
1

M.Alroy Mascrenghe

29

Dynamic Programming


As the length of the query sequences
increase and the difference of length
between the two sequence also increases

more gaps has to be inserted in various
places


We cannot perform an exhaustive search


Combinatorial explosion occurs


too much
combinations to search for


Dynamic programming is a way of using
heuristics to search in the most promising
path



M.Alroy Mascrenghe

30

Databases


Sequence info is stored in
databases


So that they can be manipulated
easily


The db (next slide) are located
at diff places


They exchange info on a daily
basis so that they are up
-
to
-
date
and are in sync


Primary db


sequence data

Major Primary DB

Nucleic Acid

Protein

EMBL (Europe)

PIR
-


Protein Information
Resource

GenBank (USA)

MIPS

DDBJ (Japan)

SWISS
-
PROT

University of Geneva,
now with EBI

TrEMBL

A supplement to SWISS
-
PROT

NRL
-
3D

M.Alroy Mascrenghe

32

Composite DB


As there are many db which one to
search? Some are good in some
aspects and weak in others?


Composite db is the answer


which
has several db for its base data


Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db



M.Alroy Mascrenghe

33

Composite DB


OWL has these as their primary
db


SWISS PROT (top priority)


PIR


GenBank


NRL
-
3D

M.Alroy Mascrenghe

34

Secondary db


Store secondary structure info
or results of searches of the
primary db



Compo
DB

Primary
Source

PROSITE

SWISS
-
PROT

PRINTS

OWL

M.Alroy Mascrenghe

35

Database Searches


We have sequenced and identified
genes. So we know what they do


The sequences are stored in
databases


So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.


Since there are large number of
databases we cannot do sequence
alignment for each and every
sequence


So heuristics must be used again.

M.Alroy Mascrenghe

36

Areas in
Bioinformatics…


M.Alroy Mascrenghe

37

Genomics


Because of the multicellular structure, each
cell type does gene expression in a
different way

although each cell has the
same content as far as the genetic


i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates

M.Alroy Mascrenghe

38

Genomics
-

Finding Genes


Gene in sequence data


needle in a
haystack


However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data


Is whole array of nt we try to find and
border mark a set o nt as a gene


This is one of the challenges of
bioinformatics


Neural networks and dynamic
programming are being employed




Organism

Genome
Size
(Mb)

bp * 1,000,000

Gene
Number

Web Site

Yeast

13.5

6,241

http://genome
-
www.stanford.ed
u/Saccharomyce
s

Fruit Flies

180

13,601

http://flybase.bio.
indiana.edu

Homo
Sapiens

3,000

45,000

http://www.ncbi.n
lm.nih.gov/geno
me/guide

M.Alroy Mascrenghe

40

Proteomics


Proteome is the sum total of an
organisms proteins


More difficult than genomics


4





20


Simple chemical makeup

complex


Can duplicate



can’t


We are entering into the ‘post
genome era’


Meaning much has been done with
the Genes


not that it’s a over

M.Alroy Mascrenghe

41

Proteomics…..


The relationship between the RNA and the protein it codes are
usually very different


After translation proteins do change


So aa sequence do not tell anything about the post
translation changes


Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell


So aa only hint in these things


Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material

M.Alroy Mascrenghe

42

Protein Structure Prediction


Is one of the biggest challenges
of bioinformatics and esp.
biochemistry


No algorithm is there now to
consistently predict the structure
of proteins

M.Alroy Mascrenghe

43

Structure Prediction methods


Comparative Modeling


Target proteins structure is
compared with related proteins


Proteins with similar sequences
are searched for structures

M.Alroy Mascrenghe

44

Phylogenetics


The taxonomical system reflects
evolutionary relationships


Phylogenetics trees are things which reflect
the evolutionary relationship thru a
picture/graph


Rooted trees where there is only one
ancestor


Un rooted trees just showing the
relationship


Phylogenetic tree reconstruction algorithms
are also an area of research




M.Alroy Mascrenghe

45

Applications….

M.Alroy Mascrenghe

46

Medical Implications


Pharmacogenomics


Not all drugs work on all patients, some good
drugs cause death in some patients


So by doing a gene analysis before the
treatment the offensive drugs can be avoided


Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited


volunteers wanted!


Customized treatment


Gene Therapy


Replace or supply the defective or missing gene


E.g: Insulin and Factor VIII or Haemophilia


BioWeapons (??)







M.Alroy Mascrenghe

47

Diagnosis of Disease


Diagnosis of disease


Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease
-


Symptoms


uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment


Death in 10
-
15 years


The gene responsible for the disease has
been identified


Contains excessively repeated sections of
CAG


So once analyzed the couple can be
counseled


M.Alroy Mascrenghe

48

Drug Design


Can go up to 15yrs and
$700million


One of the goals of
bioinformatics is to reduce the
time and cost involved with it.


The process


Discovery


Computational methods can
improves this


Testing

M.Alroy Mascrenghe

49

Discovery

Target identification


Identifying the molecule on which the
germs relies for its survival


Then we develop another molecule
i.e. drug which will bind to the target


So the germ will not be able to interact
with the target.


Proteins are the most common targets



M.Alroy Mascrenghe

50

Discovery…


For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins


This HIV protease has an
active
site

where it binds to other
molecules


So HIV drug will go and bind
with that active site


Easily said than done!

M.Alroy Mascrenghe

51

Discovery…


Lead compounds are the
molecules that go and bind to
the target protein’s active site


Traditionally this has been a trial
and error method


Now this is being moved into the
realm of computers


M.Alroy Mascrenghe

52

Related Computer
Technology………….

M.Alroy Mascrenghe

53

PERL


Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
symbols


The default CGI language


It started out as a scripting language
but has become a fully fledged
language


IT has everything now, even web
service support


http://bio.perl.org

M.Alroy Mascrenghe

54

The place of XML & Web
Services


Various markup languages are being created


Gene Markup language etc to represent
sequence/gene data


Web Services


program to program interaction,
making the web application centric as opposed to
human centric


So this has to platform language independent


Protocols like SOAP help in this regard


In bioinformatics various databases are being used,
different platforms, languages etc


So web services helps achieve platform
independence and program interaction


Since sequence data bases are in various formats,
platforms SOAP also helps in this regards

M.Alroy Mascrenghe

55

The place of GRID


GRID
-

new kid on the block


Using many computers to fulfill a
single computational tasks


Bioinformatics is the ideal
platform as it has to deal with a
large amount of data in
alignment and searches


E
-
science initiative in the UK


ORACLE 10g


the worlds first
GRID database

M.Alroy Mascrenghe

56

Data bases and Mining


Lot of the sequence databases are
available publicly


As there is a DB involved various
data mining techniques are used to
pull the data out


As there is a lot of literature


articles
etc


on this area a data mining on
the literature


not on the sequence
data has also become a PhD topic
for many

M.Alroy Mascrenghe

57

European Molecular Biology
Network (EMBnet)


A central system for sharing, training
and centralizing up to date bio info


Some of the EMBnet sites are:


SQENET


http://www.seqnet.dl.ac.uk


UCL


http://www.biochem.ucl.ac.uk/bsm/dbbro
wser/embnet/


EBI


European Bioinformatics
Institute


www.ebi.ac.uk


M.Alroy Mascrenghe

58

References


Dan E. Krane and Michael L. Raymer


Basic Concepts of Bioinformatics


Arthur M Lesk


Intro to Bioinformatics


T.K. Attwood & D. J. Parry
-
Smith


Intro to Bioinformatics


The genetic Revolution


Dr Patrick Dixon


Prof David Gilbert’s Site


http://www.brc.dcs.gla.ac.uk/~drg/

M.Alroy Mascrenghe

59

Thank You!