Introduction to Bioinformatics

earthsomberBiotechnology

Sep 29, 2013 (3 years and 10 months ago)

112 views

1

Self-studying compendium for master’s students
in bioinformatics

Authors: Jessica Carlsson, Angelica Lindlöf and Dan Lundh
University of Skövde
The compendium
This course compendium is intended as a self-studying material and provides you
with an introduction to the bioinformatics area. It will introduce basic biological
concepts, biological data and what bioinformatics is. At the end you will be given
some basic examples and exercises that you can do as a preparation for the
master’s programme studies.
2


Table of Contents

1. What is bioinformatics?......................................................................................... 3
2. Where does the data come from? ........................................................................ 4
3. Organisms, Organs and Cells ................................................................................. 4
4. What is DNA? ......................................................................................................... 7
5. Genes and proteins .............................................................................................10
6. Molecular function and regulation ......................................................................14
7. Genes and proteins – sequence data ..................................................................15
8. Databases ............................................................................................................21
9. Protein structure – structure data ......................................................................22
10. Expression of proteins and genes ......................................................................24
11. The Central Dogma and why bioinformatics is important ................................27
12. Examples of bioinformatics ...............................................................................28
13. Bioinformatics exercises ....................................................................................31


3

1. What is bioinformatics?

Bioinformatics is in a wide sense to develop methods, algorithms and programs
for analyzing and structuring biological data, i.e., the application of information
technology on molecular biological and biomedical data, and the use of computer
science to solve biological problems. Computer scientists have increasingly been
enlisted as “bioinformaticians” to assist molecular biologists, amongst others, in
their research. But recently also statistical knowledge is being sought after.



Figure 1. What is bioinformatics. Illustrates the use of computer science, statistics and
mathematics to solve biological problems, to store and analyze biological data. Biological
experiments are conducted that produces biological data, this data is thereafter stored in
databases and analyzed. After identifying interesting biological findings these are again
tested experimentally, which leads to new biological data, and so on.

Bioinformatics has grown as an interdisciplinary subject between molecular
biology/medicine, mathematics, statistics and computer science to be able to
analyze the large amount of biological information that is available today. For
example, more than 1000 organisms’ genomes have been mapped and
sequenced, which means that we now have mapped the location of millions of
genes to the chromosomes. To get knowledge of the function of these genes and
their product, the proteins, you have to be able to easily handle the information in
databases and analyze information on a structural way, for example through a
database analysis.
The strongest reason to map the genetic mass is to understand the function of the
many genes located in our genome. The genetic mass consists of millions of
nitrogen base pairs in the string of DNA where some parts of the DNA (sequences)
makes up genes. The largest part of the DNA consists of so called trash DNA,
whose function is still not known. Since the amount of data (the number of base
pairs and genes) analyzed is so large it would not have been possible without
computers.
To map the genetic mass you first have to analyze all of the nitrogen base pairs.
During sequencing the order of the base pairs are decided. The first step is to
destroy the DNA molecule to find the components and then puzzle them back
together in the so called assemblage. To find genes and decide their function you
4

then produce maps, which can have different resolutions depending on what you
want to find, e.g., it can be a gene or what separates two different individuals. All
the information produced is stored in databases. The most important reason to
store all information is that even though you have created maps, you are not
done with the genetic mass. When the whole sequence is available you can start
to study parts of it to gain practical profit. To make the information available
different search tools have been developed and the information can be presented
in different ways depending on the purpose.
But before we get into deep on what bioinformatics can be used for, you need to
have some basic knowledge about molecular and cell biology. We will therefore
start with some basic concept in these areas. Thereafter, in the next sections
there are some examples outlined of what bioinformatics can be used for.

2. Where does the data come from?

Data used in bioinformatics comes from molecular biological and biomedical
experiments, which have, e.g., resulted in the mapping of the human genome and
more than 1000 other organisms’ genomes. The mapping of the human genome
has been more or less finished since the year 2000. The Human Genome
Organization project (HUGO)
1
is a project where scientists from all over the world
have gathered to collaborate on this project.
Today we therefore have access to almost the whole complete map of the
genome, but we also need to learn how to interpret it. In this work, which is the
next large challenge, we have to find and assign functions to all single genes. This
is important so that we in the future can understand and treat many of the
genetic disorders that affect humans. With the methods used today, we expect to
have found all genes until the year 2050. However, then we have not taken into
account the time needed to assign functions to the genes. Not even today we
know how much of the information gained we will be able to use, but there is
probably no one that doubts that this is the first small step against a long row of
research projects in bioinformatics.
We should point out that the HUGO project is only one of many sequencing
projects. Today the genomes of over 1000 organisms are fully sequenced. These
organisms range from bacteria to viruses and humans to other animals and plants.
All organisms can be divided into multicellular (animals, plants), unicellular
(bacteria) or virus. These differ from each other but the underlying genomic
principals are the same. The biological descriptions following from here are mainly
concerned to multicellular organisms and mostly humans.

3. Organisms, Organs and Cells

A human consists of a lot of cells, which group together and form different organs.
Depending on where the cells are in the body, they perform different tasks, e.g. a
nerve cell has a different function then a liver cell. This is despite the fact that
they have the same genomic set up (DNA).


1
HUGO: http://www.hugo-international.org/
5

That different groups of cells (organs) have different functions is quite obvious,
but it gets more complex if you think about the fact that a single human has
evolved from one single cell and that through cell division has created all these
organs with different functions. Cells that from the beginning do not have an
assigned function (in contrast to a liver cell, e.g., with a distinct function) are
called stem cells.
Cells in different organs usually have different phenotypes (appearances), which
mean that the shape of the cell is dependent on its function. If we compare a skin
cell with a nerve cell they look completely different. We can also find differences
between cells of the same type, e.g., the phenotype of nerve cells varies
depending on where in the nervous system they exist.
Almost every cell contains the same genetic material and which is stored in the
chromosomes. A chromosome is a long chain of genes but also of other areas that
may or may not contain any information. Every cell can contain a number of
chromosomes, which can contain the same or variants of genes but also
completely different genes. The amount of chromosomes in a cell varies between
different organisms. For example, humans are diploid organisms and have 23
chromosomes (of which two are sex chromosomes) in duplicate, which make up
in total 46 chromosomes. However, durum wheat, e.g., are tetraploids (four sets
of chromosomes) with a total of 28 chromosomes.

Figure 2. Human chromosomes during metaphase. From www.wikipedia.org/wiki/
chromosomes.

A cell consists of different parts (organelles) that all have their own task, like the
cell membrane, cell nucleus, mitochondria and other organelles. The cell
membrane’s task is to transport substances between the cell and the surrounding
environment, and especially substances that the cell needs for its survival.
Another task of the cell membrane is to separate between substances that the cell
wants to keep within the cell and substances it wants to hinder from coming into
the cell. This is performed by receptors and channels on the membrane and only
certain substances can bind to these receptors or pass through the channels. The
receptors and channels are proteins (consisting of amino acids) that have been
constructed by ribosomes. The nucleus of the cell contains the genetic material,
which is the construction map for how different proteins should be constructed.
How this happens is discussed in further details later on.


Figure 3. D
ifferent parts of a cell
membrane and 9) mitochondria.
From www.wikipedia.org/wiki/Cell_(biology).

The nucleus contains the genetic material in form of DNA
genes and where
each gene codes for a specific protein. The genetic material is
divided into different parts called
of chromosomes which is actually consisted of genes coding for something that
the cell needs. The non-
coding r
used to produce any protein (gene product). So how does the cell nucleus know
when to start producing a protein? The nucleus gets signals from different parts of
the cell that triggers the production of a cer
the truth. There are proteins that
all the time. Other proteins can start to be produced based on changed
circumstances for the cell,
e.g.
temperature which requires
the action and production of specific proteins to
tackle this change. A
ll cells contain a specific organ called
produce energy needed for the cell to survive and produce protein
contain its own DNA,
which means that it carries the necessary information for
production of the proteins it needs to produce energy. The endoplasmatic
reticulum (ER) is an organelle specialized on synthesizing and transporting lipids
(fats
) and membrane proteins but also in metabolism (transformation) of lipids. At
the outside of ER there are
ribosome
(construction of proteins). The
involved in transf
ormation, sorting and packing of molecules that should be
secreted (leave) from the cell or be transported to other organelles. The
cytoskeleton
is similar to a skeleton that creates the shape of the cell and makes it
possible for the cell to move. It is a
transportation of molecules to different parts of the cell.
Depending on which cell you are studying, there might also exist other organelles
than is present in the human cells,
the photosynthesis and vacuoles that strings out
intracellular (within cell) nutrition melting.

8

coding for a protein. As if this was not enough, only a part of the gene is actually
coding for the protein. The parts of a gene coding for a protein are called exons -
they express a part of the protein. The different exons are put together to get the
whole protein. The parts of a gene that is not coding for a protein are called
introns. An area between two different genes is an intron.


Figure 5. Introns and exons. Illustrates how the gene on the chromosome contains both
coding (exons) and non-coding (introns) regions. From http://en.wikipedia.org/wiki/Gene.

A chromosome consists of two different strands of nucleic acid, where one strand
contains a complementary (opposed) nucleic acid to the other strand. If one DNA
strand contains ATGTTGCA then the other complementary strand will contain the
opposite nucleotides. This is because, as we stated earlier, a nucleotide only binds
to one other nucleotide. For example, take a look at the following piece of DNA
sequence:
DNA strand ATGTTGCA
Binding
Complementarity TACAACGT
This binding between the different nucleotides leads to a chromosome that is
double stranded. This also leads to that the DNA strands fold into something
called a helix. A helix is a corkscrew (or spiral), as illustrated in the figure above.
Genes can be found on both strands of the DNA. They also have a direction and
can either go from left to right (L2R) or from right to left (R2L). Important to
notice is that each gene only has one direction and that genes cannot overlap
each other in the DNA. In Table 1 below, some of the genes in the beginning of
chromosome 11 in thale cress (Arabidopsis thaliana) are listed. This information
was collected from the GenBank database
2
. From this list we can establish that a
gene can lie either on the DNA strand or its complementary strand, denoted as
plus and minus chain, respectively; sometimes also called top and bottom chain.
You can also establish that the direction of the genes can vary.
In the table, the start and stop position (base position) for the gene in the
chromosome can also be found. By comparing the distance between genes (start
and stop) we can also establish that the intron length below is about 8,400 bases
between the first and the second gene, about 14,800 bases between the second


2
GenBank: http://www.ncbi.nlm.nih.gov/genbank/index.html
9

and third, and 64,700 bases between the third and the fourth. The longest intron
below is about 270,000 bases.
You can also see a suggested function in the table. Observe that many suggested
functions in a database are based on similarity (sequence similarity) with genes
from other organisms. If you are lucky you can see in the database where the
similarity comes from, e.g., that a gene has been found to be similar to a gene in
another specified organism. In other cases there is a suggested similarity with
another protein, that is, based on the sequence similarity it is judge that the gene
has a similar function to an already known protein. In the table we can see that
many of the proteins have a suggested function based on similarity and this
similarity concept is an important concept within bioinformatics. By using
similarity between sequences, together with more abstract forms of similarity
measurements, we can predict a probable function for about 70 % of all found
genes.
Table 1. Genes and their function. Gene placement on a chromosome. Strand: denotes
which of the DNA strands the gene is located; Direction: denotes in which direction the
gene is located (L2R: left to right; R2L: right to left); Start: start base on the chromosome;
Stop: stop base on the chromosome; Function: putative function of the gene. Arabidopsis
thaliana chromosome II.
Strand Direction Start Stop Function



10

5. Genes and proteins

The genome in an organism is stored in the genetic structure. Almost every cell
has the genetic material, which tells the cell how it should function when it has
been developed. This means that a skin cell and a liver cell have the same genetic
material, but it is a big difference in their phenotypes and the function between
them. So, what is it that controls how the genome is expressed in different
situations? For starters, you can establish that almost all cells contain DNA; the
exceptions are, e.g., red blood cells. If we assume that we have a cell containing
DNA, how is this then organized? We will start by looking closer at how a gene is
transformed to a protein. The protein exhibits the function a gene codes for.
Thereafter we will look at how a gene knows when its function is needed.

A gene is not translated directly to a protein. This translation happens in several
different steps, first the gene have to be transcribed (transferred) from DNA to
mRNA (messenger RNA; ribonucleic acid). This means that a copy of the gene is
created that only contains the information needed for translation to a protein.
You can say that the cell makes a copy of the region of the gene. When this copy is
created, the T (thymine) nucleotide will be exchanged to a U (uracil). At the
transcription stage, only the strand containing the gene to be transcribed is
copied and it will therefore be single stranded. The area copied contains both
exons and introns, but not the complete copy is later on translated to a protein.
Before this happens, the introns have to be removed (spliced), which means that
the introns are cut away and the exons are added together to a continuous chain
of ribonucleic acids. The copy of the gene thereafter only contains the regions
that should be translated to a protein. We, although, should point out that some
“flank introns” called Kozak sequences can be found in this copy. Depending on
what exons that have been added together we can get some different variants of
the mRNA sequence. Most common is that the order of the exons is the same and
the exception is that the genes in the immune systems have been shown to
shuffle the exons without any intergroup order.


Figure 5. Examples of different splice variants. This figure illustrates how different splice
variants give rise to proteins with different properties and function. From
http://en.wikipedia.org/wiki/Alternative_splicing.


The copy can then be sent out of the nucleus to the
organelles whose task is to translate
process is called translation. In excess of the informati
protein, the mRNA also contain flank introns and information about where in the
cell the protein are heading. The last mentioned can be seen as an address tag at
the protein that tells where it should be transported
should be sent
to the cell membrane.
called an untranslated region,
while
translated.
How mRNA should be translated to a protein is decided by
called transportRNA (tRNA)
. These interpret the genetic information and translate
mRNA to an amino acid sequence that makes the protein. This is performed by
translating triplets of nucleotides in the mRNA to one amino acid
three nucleotides in the sequence is called a codon
Figure 6
. Translation of mRNA into amino acid chain.
amino acid to the ribomosome which refers two a specific combination of three
nucleotides (codon). From
http://en.wikipedia.org/wiki/Translation_(genetics)

There are 20 different amino acids that nucleotides code for. A group of three
nucleotides (a codon) gives 4
3
(4*4*4
that there are more codons then amino acids, and the solution for this is that
several codons code for the same amino acid
the amino acid Lysine. This means that some codons
meaning that a nucleotide can change without changing the amino acid. Several
nucleotides have to be changed before the amino acid changes. This in turn
means that the translation is relatively insensitive to mutations; mutations
happen without changes in the amino acid.
sometimes called silent mutations. There are also codons that codes for
signal, which terminate
the translation.

11
The copy can then be sent out of the nucleus to the
ribosomes,
which are
organelles whose task is to translate
the mRNA sequences
to proteins. This
process is called translation. In excess of the informati
on needed for creating the
protein, the mRNA also contain flank introns and information about where in the
cell the protein are heading. The last mentioned can be seen as an address tag at
the protein that tells where it should be transported
, e.g., if it a
fter translation
to the cell membrane.
A region
that is not translated to a protein is
while
a region that actually is translated is
called
How mRNA should be translated to a protein is decided by

other ribonucleic acids
. These interpret the genetic information and translate
mRNA to an amino acid sequence that makes the protein. This is performed by
translating triplets of nucleotides in the mRNA to one amino acid
. A grou
p of
three nucleotides in the sequence is called a codon
(fig).


. Translation of mRNA into amino acid chain.

Transport RNA (tRNA) carries an
amino acid to the ribomosome which refers two a specific combination of three
http://en.wikipedia.org/wiki/Translation_(genetics)
.
There are 20 different amino acids that nucleotides code for. A group of three
(4*4*4
) =
64 possible combinations. This means
that there are more codons then amino acids, and the solution for this is that
several codons code for the same amino acid
, e.g.,
both AAA and AAG code for
the amino acid Lysine. This means that some codons

are closer to each other
meaning that a nucleotide can change without changing the amino acid. Several
nucleotides have to be changed before the amino acid changes. This in turn
means that the translation is relatively insensitive to mutations; mutations

can
happen without changes in the amino acid.
These types of mutations are
sometimes called silent mutations. There are also codons that codes for

a stop
the translation.

11

which are
to proteins. This
on needed for creating the
protein, the mRNA also contain flank introns and information about where in the
cell the protein are heading. The last mentioned can be seen as an address tag at
fter translation
that is not translated to a protein is
called
other ribonucleic acids
. These interpret the genetic information and translate
mRNA to an amino acid sequence that makes the protein. This is performed by
p of
Transport RNA (tRNA) carries an
amino acid to the ribomosome which refers two a specific combination of three
There are 20 different amino acids that nucleotides code for. A group of three
64 possible combinations. This means
that there are more codons then amino acids, and the solution for this is that
both AAA and AAG code for
are closer to each other
,
meaning that a nucleotide can change without changing the amino acid. Several
nucleotides have to be changed before the amino acid changes. This in turn
can
These types of mutations are

a stop


Figure 7. The genetic code. A
group
There are four nucleotides available which refer to 64 different combinations and, hence,
available codons. From
http://en.wikipedia.org/wiki/Codon

When the translation has begun, the protein start
bindings between the amino acids are created, resulting in protein folding.
in the same way as the DNA strands, the protein will take a shape that gives the
protein its function. The folding itself is much more comp
in DNA strands. This whole process from a gen
translation and folding is usually called the
The whole point with bioinformatics is to try to understand and predict the whole
process or parts of the process.


Figure 8
. The central dogma of molecular biology.
that DNA is transcri
bed into mRNA, which is translated into amino acids and is thereafter
folded into a 3D protein structure.

13

What determines the shape and function of a protein is the properties of the
amino acids. Amino acids are denoted with an alphabet and we have already
introduced this in connection to codons and amino acids. A complete table of
amino acids with one and three letters combinations can be seen below.

Table 2. List of amino acids. The table lists all amino acids available in the organism, their
names and abbreviations.
Amino Acids and their abbreviations

G - glycine - gly
A - alanine  Ala
L - leucine - leu
M - methionine - met
F - phenylalanine  phe
W - tryptophan - trp
K - lysine - lys
S - serine - ser
N - asparagine  asn
D - aspartic acid - asp
P - proline - pro
V - valine - val
I - isoleucine - ile
C - cysteine - cys
Y - tyrosine  tyr
H - histidine  his
R - argenine - arg
T - threonine - thr
Q - glutamine - gln
E - glutamatic acid  glu
B - asparagine/aspartate - asx
Z - glutamine/glutamate  glx
X  Unknown



In the Table 2 there are 20 amino acids represented. In addition, three more
abbreviations have been added that stands for B - Asparagine or Aspartate, Z -
Glutamine or Glutamate and X for an unknown amino acid. The different amino
acids contain different properties, some like being in contact with water (hence,
are hydrophilic), while others hate being in contact with water (hence, are
hydrophobic). It is the properties of the amino acids that decide how the shape of
the protein will look like. To illustrate some properties we use the following
diagram:
14


In this diagram, groupings of amino acids with respect to certain properties are
shown.
6. Molecular function and regulation

When a protein has folded it means that it has the ability to perform a function,
e.g., a receptor in the cell membrane so that the cell can accept necessary
substances and signals from the environment. The function of a protein is usually
called the molecular function. A molecular function means, e.g., that a protein can
bind to a substrate (substance) and transform it to another substance which can
be used by the cell. Normally, a substrate has to go through several steps of
transformation before it can be used by the cell. These reactions, or
transformations, for a substrate are usually called a metabolic pathway
(http://en.wikipedia.org/wiki/Metabolic_pathway). In this context we refer to
metabolism as a substrate converted to another in several steps. For example, if
we take the sugar (Glucose), this can be transformed by several sub-steps to
alcohol (Ethanol). The first step is that glucose is transformed to hexose
monophosphate and then monophospahte is in turn transformed to Fructose 1.6
biphosphate, which is further transformed to Ethanol.
A connected reaction that transforms a substrate to another is usually called a
metabolic function. The reactions involved are usually connected so that several
metabolic pathways are crossing each other and this is called a metabolic
network. A metabolic network can correspond to a cellular process, e.g., an event
in the cell. Examples of this are translation, cell division and the digestion of sugar
to alcohol.
But how does the cell know when it shall start to produce or increase/decrease
the production of a certain protein? Tied to the gene on the DNA strand there is a
region lying in front of the gene and which is called a promoter region. This region
tells the gene when it should be transcribed or if the production of mRNA should
be increased or decreased.

Figure 9. Promoter location.
Schematic figure indicating the location of the promoter
region; on the chromosome and upstream of the start of a gene.

Promoter regions determine
, among
organism’s development
a gene should be expressed
circumstances it shall be expressed (e.g.
infections etc.). The key to this regulation lies in th
transcription factors (TFs)
, can bind to specific regions on the
promoter region. Easily explained, this means
when
and in what amount a gene should produce its mRNA, an mRNA t
translated into a protein.
Figure 10
. Transcription factor binding.
the promoter region of a gene and thereby control its expression.
7. Genes and proteins –
sequence data

In previous parts we have
covered
how the DNA strands are
shaped and how regions of these strands
different things. But how do we get this data? In the coming part we will show
infor
mation on how gene and protein sequence data is created.
be performed either on the DNA molecule or on the mRNA molecule. This
depends on if you do a genomic sequencing or a sequencing of expressed genes.
Sequencing
Sequencing means that
you decide the order of base pairs in
is a long line of letters; A, T (or U), G and C that represents the different bases.
Methods for sequencing were
developed during
difficult time consuming process. T
American Craig Venter took computers to help and applied the method called
shotgun sequencing on the whole genome.
16

The goal with whole-genome sequencing is to get a complete picture over the
genome, that is, to find sequences of bases. A sub goal is to find genes in a DNA
molecule. Genome sequencing means that you just not only get the genes that
exist but also the introns between genes and in genes. The fact that you get
information about introns is of big importance since these regions contain
promoter regions that control gene regulation. To sequence mRNA means that
you sequence an expressed gene, i.e., you only sequence the part of the gene
coding region for a protein. It can also be used for estimating the level of
expression from a gene in a sample.
Genome sequencing
In spring 1998, Craig Venter stuns the whole world when he decided to sequence
the human genome with the shotgun method. This meant that the project could
be completed for a cost just 1/10 of the budget for the original HUGO project. A
new company, Celera genomics, was started for this purpose.
Venter can although not take all the credit for the increased efficacy in the DNA
reading. The basic method is still performed using the method that Fredrick
Sanger got the Nobel Prize for in 1980. During the 1990’s this technique has been
improved and the technique for DNA replication and the chemical reaction in the
reading has been refined. The results have been improved and are now more
certain and the error frequency is down to one base per 10,000 bases.
The method that Sanger developed is the foundation for one of the automated
DNA sequencing machines and it works on the following way:
· DNA fragments are colored with different fluorescent colors, one for A, T,
G and C, respectively.
· The fragments are allowed to run through a gel solution. Depending on
size, the fragments go through the gel at different speed.
· Every color emits light at different wave lengths, a laser makes them glow
and via a filter the light is enhanced. Thanks to a photomultiplier that is
twisted, digitalized signals are obtained that is traced by a computer
which then can tell us in what order the bases are ordered in the
fragment.
A couple of years ago a new method was presented by Swedish researchers, the
PSQ 96 system. Instead of coloring the bases, different analysis liquids are used.
When the liquid are injected into the DNA sample, flashes of lightning emerge
which is registered by a digital camera. A computer program then determines by
the elimination method if a certain base position in the DNA is normal or
abnormal, a useful information in the tracking of diseases. It is also much faster
than the traditional sequencing method.
How you choose to split the human DNA into fragments depends on different
strategies. Some strategies try to minimize the overlaps between the different,
parts since its expensive and time consuming to sequence the same region several
times. Celera’s shotgun sequencing method is based on having great overlaps
between the different regions that you get when splitting the DNA into fragments.
They believe that this is necessary since the DNA contain long sequences of
repeats. By having larger overlaps you can identify how different parts interlock
17

with a higher certainty and thereby puzzle the DNA together in a correct way
again.
Strand Sequence
Original AGCATGCTGCAGTCATGCTTAGGCTA
First shotgun sequence
AGCATGCTGCAGTCATGCT-------
---------------TGCTTAGGCTA
Second shotgun sequence
AGCATGCTGC----------------
------CTGCAGTCATGCTTAGGCTA
Reconstruction AGCATGCTGCAGTCATGCTTAGGCTA

Figure 11. Example of shotgun sequencing. Showing an example on how to split the
genome into fragments and sequencing the genome several times. Thereafter the
fragments are put together to get the entire genome sequence.

Shot-gun sequencing is now considered an old method and newer techniques
(called next-generation sequencing) such as pyrosequencing and sequencing by
ligation are being used instead.
mRNA sequencing
To sequence mRNAs is different from genome sequencing. The big difference is
that the sequence that will be sequenced is expressed by a protein, meaning that
we know that the gene has been transcribed into an mRNA. The differences
compared to DNA sequencing is not great except that other experimental
techniques are used, such as RT-PCR and massively parallel signature sequencing
(MPSS).
Data assemblage
The sequencing produces data that makes it possible to identify the different
bases A, T (or U), G and C using the experimental methods. The task of the
computer is to put all these bases back to small fragments. These fragments
consist of 200-500 base pairs. Since the data generation is a far from certain
method this experiment must be repeated several times. The more times it is
repeated and the same results are produced the more certain you can be. The
results from these experiments are then compared and a risk assessment is
performed so that you can know with what certainty you can trust the results.
In the next phase the fragments are puzzled together to the region that you
planned to sequence. This is a complicated process since there are many factors
affecting the results:
· Bad coverage of DNA or mRNA, meaning that you have failed to overlap
the whole region you want to sequence
· Sequencing errors – tends to occur at the end of the sequence, the longer
the sequence sequenced the more errors introduced at the end
· Unknown strand, meaning that you do not know which one of the strands
you have sequenced in the different fragments
· Repetitions in DNA or mRNA, large regions of repetitions make it hard to
know exactly where the overlaps lie.
18

The work with assembling the fragments together usually needs a visual
inspection of a researcher, since there is no unified solution to the problem. The
method is not flawless since the researcher’s subjective judgments and ambitions
to be the first researcher with finding a sequence is of great importance. The idea
is to minimize the amount of incorrect bases per position. We expect that a
couple of bases per 1,000 or 10,000 get incorrect. To avoid errors, the sequencing
is performed several times for each DNA fragment and it is common to repeat it at
least twice. The assemblage of the sequences means that you try to find regions in
the end and the beginning of each fragment so that they can be assembled
together:

AACCGTTTACGAAACCAGGTGC
AACCGTTTACGAAACCAGGTGCGCGCGCCCGCGGGAA
TAACCGTTTACGAACCCGGTGC
CGCGCGCCCGTGGGAATCCTAAAAA
TGCGCGCGCCCGAGGGAATCCTAAAAA

AACCGTTTACGAAaCCAGGTGCGCGCGCCCGcGGGAATCCTAAAAA consensus
Above is an illustration of the assemblage procedure. The fragments have been
assembled so that overlapping regions have been found and aligned. The
sequence at the bottom illustrates the assembled sequence (called a consensus)
and where small letters denote an uncertainty.
IUPAC code
Sometimes, bases can be missing in the fragments which results in inserts and
deletes in the consensus. This might result in strange stop signals or frame shifts,
which increase the complexity of the interpretation of the sequence. Sometimes it
is difficult to decide which nucleotide is the correct one in a certain position. An
alphabet, the IUPAC code, was constructed in this purpose which is usually used in
mRNA analysis. This code means that you can denote uncertain positions with an
arbitrary nucleotide. By using the code below we can rewrite the sequence above
so it becomes more certain, by replacing the uncertain nucleotide with the
corresponding alphabet letter. This means that in the position where we
previously had an “A” and a “C” is replaced with a “M” and at the position where
we had a “C”, “T” and a “A” is replaced with a “H”. In the Table 3 you can find the
IUPAC code. Observe that a sequence expressed with the help of an alphabet
might look like a protein sequence but it is actually DNA (or mRNA).

19

Table 3. IUPAC code. Uncertain positions in the sequence can be denoted by a letter
indicating several possible nucleotides.
IUPAC code
Denotion Nucleotid
A A
C C
G G
T/U T
M A, C
R A, G
W A, T
S C, G
Y C, T
K G, T
V A, C, G
H A, C, T
D A, G, T
B C, G, T
X/N G, A, T, C


Gene identification and deciding function
Gene identification and deciding the function of a gene, also called sequence
analysis, means that you in several different ways tries to map the genome. By
these methods a number of different maps are generated, whose different
characters mostly derivate from the resolution in them. The physical map that has
the highest resolution is the one that describes the complete genome sequence
with the help of the nitrogen bases A, T, G and C. This is the map that was
generated in the first part of the HUGO project. Other physical maps, e.g.,
chromosome maps describe on which chromosome every gene (or other
identifiable DNA fragment) is located on. The distance between these genes or
other fragments are then measured in the number of bases.
A completely different type of map is based on the genetic links where you try to
find markers and their relative positions in a chromosome. The method means
that you search for markers that separate different individuals and a marker can
be a separate gene or a region of DNA, whose function is still unknown.
Differences in the DNA sequence are the most usable markers since they are easy
to characterize and you can also follow them through inheritance from a parent to
a child. The value of this kind of analysis is that you can find where a genetic
(inherited) disease is located in the genome. This is done by finding mutual
markers for a number of individuals that is affected by the same genetic disease.
These markers are then compared to DNA results from individuals that are not
affected by the disease. If you do not find this marker in the healthy individuals,
you probably have found which chromosome that determines the genetic disease.
To decide the function of a gene is not all about looking for genetic defects. You
are also interested in the normal behavior of the gene and what they are coding
for. One method used to do this is to compare a gene from the human genome
with a gene from another organism. The genes and their functions from the other
organism should be well documented. If there are large similarities between the
two genes you can assume that the human gene has approximately the same
function.
20

Deciding/predicting function
What does it mean that a gene has a certain function? First, you should separate
between two different concepts, one is deciding function and the other is
predicting function. To be able to say with certainty what the function of a gene
or protein is has to be decided by experimental tests. Today we cannot with the
help of bioinformatics decide the function of a gene, we can only predict it. Here,
we will only focus on prediction of the function.
Protein translation
The first step is to translate the sequence to a protein. This means that if you have
a DNA sequence you first have to find the exons so that you get the sequence
coding for the protein. If we assume that we have a sequence coding for a protein
we also have to consider splice variants, that is, that the same gene gives rise to
several different protein sequences. This complicates things when it comes to
comparing protein sequences against other known sequences, but this will be
covered later on. Given that we have a sequence, we could translate it to a
protein sequence using the amino acid table above. This means that we take
three nucleotides and translate them to an amino acid. There is although a small
problem. As mentioned before, the mRNA sequence contain untranslated regions
in the beginning and in the end, so called Kozak sequences. This means that we do
not know which base that is the first that should be translated. Another problem
is the direction of the gene. We cannot assume that the translated gene from left
to right (L2R), it could just as equally be translated in the reverse direction (R2L).
Theoretically this means that there are three possible positions for the start
codon that is based on one L2R direction and three possible positions for a start
codon based on a R2L direction.

Figure 12. ORFs. A DNA sequence with six different reading frames based on start from
different codons, three from the right and three from the left. Blue areas indicate
plausible regions that code for a protein. The longest region is the most likely sequence
resulting in a functioning protein. Program used for generating this figure is ORF finder,
http://www.ncbi.nlm.nih.gov/projects/gorf/.

A codon is usually mentioned as a reading frame for translation, this is also called
an ORF (open reading frame). As seen the sequence gives rise to six different
proteins based on the different reading frames. The problem here is that only one
of these six possible proteins is correct. There is no good technique for telling that
a certain ORF is the correct one. You need to experimentally verify the protein
sequence to be able to decide the correct one. There are although some thumb
rules that can be applied for finding the correct ORF. One such rule is that it is
normally the ORF that gives the longest sequence without stop codons that is the
correct one. Stop codons are a lot easier to find than start codons. Another rule
21

says that the amino acid Met (Methionine) is common as a start, i.e., the codons
AUG/ATG, but there are no guaranties for this. This is an area that is being
developed, to find better techniques to find the correct protein sequence for a
gene.
Search in databases
Eukaryotes (organisms that have cells with nuclei) are characterized by coding and
non-coding regions in the DNA. This means that the gene product (mRNA
sequence) can have different lengths for a certain gene depending on the splicing
of the gene. This is connected to the fact that not all exons have to be
represented in the transcribed gene. Although a general rule is that the order of
the exons is preserved, the result from this is that alternative variants of a protein
might be expressed. A search in the database with already known proteins can
therefore be problematic since the match against known proteins can contain
many deletes (se figure below).
Other aspects on database matching exist since there might be inserts and
deletions as a result of reading frame shifting. Often, an mRNA sequence is a
subsequence of a sequence in the database which might help to decide if you find
the correct ORF. To take an mRNA sequence and compare it (translated) against
the database is not always unproblematic. It is not always the case that the mRNA
contains only coding regions of a gene, it might also contain flank regions.
8. Databases

Databases, or sequence libraries, is a central concept within sequencing projects,
since the first step in sequencing is to generate a large amount of information that
later can be used for research. The largest and most broadcasted database is
called GeneBank
2
and is administrated by the National Institute of Health (NIH) in
USA. A corresponding database in Europe also exists and is called EMBL, for the
European Molecular Biology Laboratory. We will just describe GeneBank since
these two databases are very similar.
In GeneBank all sequencing results from public research all over the world is
stored. Private companies such as Celera, has not published their results since
they want to earn money on their research. The amount of base pairs in
GeneBank has during the last decade been 250 times larger. In August the year
2000, it was found that just during the last seven months the amount of base
pairs had been doubled. The usage of a functional database is easily understood
when you know that the amount of data being handled is growing exponentially.
GeneBank is an indexed flat file, which means that it consists of an enormously
large text file (just as this Word document, but this will be discussed later). In the
file there is information about sequences from a lot of different organisms. Every
structure in the database contains a lot of information together with the
sequence. This information is among other things:
· The size of the sequence
· Where did it come from, that is, the Latin name of the species
· In what chromosome and where in the chromosome where the sequence
found
22

· Properties such as information about the protein sequence it codes for
· Key words describing the function of the sequence (if it is known)
· Who sent the sequence
· References to scientific publications that uses this sequence
· References to similar sequences in the database
Gene bank is updated once every day if a new sequence has been reported. Sadly,
the certainty of the sequences in GeneBank is not the best. Many errors have
been discovered and the information has been incorrectly reported or parts of it
might be missing.
The databases are normally handled in a Unix server, since its multiuser friendly,
which gives a faster and more reliable availability to the content of the database.
The text format of the information in the database has been chosen to offer
maximum portability between different computer environments. This is believed
to be very important since sequencing projects like e.g. HUGO is performed all
over the world. Another advantage with thus type of database is that you do not
need a certain database manager. A disadvantage with this solution is that
searching gets slower compared to a search in a relational database where other
more effective search strategies are used.
To search in GeneBank it is enough to have a regular web browser, e.g., Netscape
or Internet Explorer. There are also other tools for other computer environments
that work in different ways. Network Entrez is an example of a platform
dependent tool and it supports text based searches in several of the different
sequence databases while Network BLAST that exists in different versions for
different environments, supports searches of similar sequence to one already
known.
The problems arising with searches in databases are several. If we assume that we
have a sequence that we have assembled and want to perform a search against it
in a database, how shall we then determine if the sequence contains a gene, are
non-coding or get ideas on similarities with other proteins? If a search in a
database does not result in a significant similarity with any existing sequence it
might be that we have found a new gene. But, it may also be due to that we have
found a sequence that is a part of a non-coding region. If the sequence comes
from genome sequencing, there are no guarantees for that it is actually coding. If
comes from EST sequencing, on the other hand, it might be a coding region if the
EST sequence is not from one of the flanking regions. So, to conclude, whether the
sequence matches against a known sequence will matter. If no match is found,
regardless of sequencing method, this might also be due to that the sequence
does not exist in the database.
9. Protein structure – structure data

All the information a protein needs for knowing how to fold can be found in its
amino acid sequence. It is the properties of the amino acids that determine how it
is folded. The problem is that it is very hard to decide the protein structure from
the sequence. There are a number of experimental methods to determine the
structure of a protein, but in this chapter we will just give you an overview of how
a protein structure is created.
23

Most protein structures are determined with physical methods; x-ray
crystallography or a technique called NMR (nuclear magnetic resonance) but
mostly the x-ray crystallography is used. The goals with deciding the structure
data might be several depending on the method used, but 3D information is
desirable. Both crystallography and NMR can be used to get coordinates over
every atom in the three-dimensional space.
X-ray crystallography
X-ray crystallography means that you let crystals attach to a protein molecule, you
actually pack the proteins with crystals, which is called a unit cell. By flashing the
unit cell with x-rays, an inference pattern is created on a detector. Depending on
the structure of the encased protein, different inference patterns will be created
on the detector. This depends on that different indices of refraction will be
created, which means that different protein configurations will create different
inference patterns. By flashing the unit cell from different angles, different
inference patterns will be created. Different atoms have different amounts of
electrons which results in that the wave length (easily translated the color of the
light) will change and you can get ideas of where an atom is placed but also which
atoms it might be. By interpreting this information, you can get the protein
structure.
NMR
NMR is based on that certain atoms (atom nuclei) have magnetic properties (H;
hydrogen, C; carbon, N; nitrogen and P; phosphorus). These atoms are also found
in a high degree of the amino acids that the proteins are built of. NMR works by
placing the protein in a strong magnetic field and when this is done the hydrogen
atoms will coincide with the magnetic field. By sending out radio frequencies to
the protein investigated, the hydrogen atoms will react and shed radio
frequencies which can be measured. The frequency of this radiation depends on
the surroundings of the atom nuclei and is different for different atoms if they do
not have the same surrounding. In a simplified way you can say that NMR tries to
find where in the three-dimensional space the radiation exists and interpret what
type of radio frequency radiation that is emitted from a certain position. Since the
atoms create bindings (in amino acids and between amino acids) the emitted
signals will be different from different positions. In this way you can interpret
where a certain configuration of atoms is located. This explanations is very
simplified, you can use NMR to many things e.g. identify commonly existent
configurations of amino acids that exist in most protein structures (so called
secondary structures; the amino acid sequence is the primary sequence).
Structure data
Protein structure data is data about the atoms positions in the three-dimensional
space. It also means that other information except for the X, Y and Z coordinates
can be coupled to each atom. Protein structures are not static; the proteins have
some movement in parts of their structure. A membrane protein, functioning as a
channel can thereby change its structure so that the channels opens/closes, but of
course this is only one example. The movement is a part of the functionality in the
protein; it can exert a function due to this movement in a part of the structure.
This movement in protein molecules is not general. There are proteins that do not
need any movement at all to exert their function. Although, this movement
24

causes problems when it comes to determining the protein structure. You often
have to repeat the structure determination in a number of experiments to get a
mean value on the protein configuration. The movement in a protein molecule is
also dependent on the temperature and the surrounding environment. Some
proteins are like a shell; closed when its inactive and then it opens when
something binds to it and then closes again when it becomes active (exert its
function).
Another aspect that should be mentioned is the accuracy of the determined
protein structure. When performing x-ray crystallography it is hard to get at high
accuracy. With accuracy we mean how accurate the position of an atom is.
Structures determined with x-ray crystallography can have an accuracy of 1 to 2.5
Ångström and sometimes even more (this is one of a few areas where SI units
aren’t being used). One Ångström is 0,0000000001 meter. What does this mean?
It means that a certain atom has been determined to a certain position with 1-2, 5
Ångström (mean value). One atom (oxygen) is about 1.5 Ångström in diameter,
which means that the position has been determined with the size of an oxygen
atom.
10. Expression of proteins and genes

To see which proteins that are expressed in a specific tissue or in a certain
situation, different techniques can be used. We will mostly focus on two methods;
a method who shows which proteins that is found in a tissue or a situation
(2DPAGE), and a method that makes it possible to see which genes (with the help
of mRNA) that are expressed in a tissue or in a situation (analysis of expression
data). We will thus look at two different data sets originating from two different
levels. One level is the protein level and the other is the gene level.
Protein level - 2DPAGE
To study proteins expressed at the same time in a tissue a method called 2D-PAGE
(2 dimensional PolyAcrylamide Gel Electrophoresis) can be used. The technique is
based on the fact that proteins have different weights (different molecular
weights) and different electrical charges. Actually, this is based on the acidity of
the protein, but this in turn is coupled to the charge of the protein. By
experimentally separating proteins with respect to their weight and charge you
can in a lucid way se which proteins that are represented in a tissue. It gets a little
bit more complicated when you perform a two-dimensional gel with weight on
one side (e.g. the y-axis) and pH on the other side (x-axis) and are supposed to
find which proteins that are represented. Theoretical values for weight and pH
can be calculated but they do not always fit to the real world, se the figure below.
25


Figure 13. 2D-PAGE. An example of how a 2D-PAGE data can look like. At the y-axis we
have molecular weight and at the x-axis we have pH values. The arrows indicate an
experimental point for an arbitrary protein, and also its theoretical point.

As seen in the example it is difficult to identify proteins on a gel. Another
reflection is that there are many points that are merged and many that are
unclear. The size of a point reflects the amount of protein that exists in the tissue.
It is therefore difficult to find proteins that exist in a low quantity.
The interesting thing with gels, except the fact that you can see what proteins that
are expressed, is when you compare two different gels with each other. This could
be from a person with a healthy tissue and from a person with a sick tissue. By
comparing these two gels and try to find points that deviates, e.g., a protein that
exist in the sick tissue but not in the healthy or the other way around, you can
identify proteins that are involved in the disease. This principle is also used to
identify how different proteins are expressed in, e.g., development stages (cell
division, cell death, growth).
Analysis of expression data
Since it is difficult to measure the amount of all the cell’s proteins, other
techniques has evolved. One of these are microarrays, which can measure the
relative amount of mRNA for all genes in a cell, that is, measure the expression
level of the genes.
Many techniques have been evolved to measure gene expression. Among these is
the DNA microarray technique
3
. With this technique it is possible to study the
expression pattern of thousands of genes at the same time and under different
conditions, to be able to understand function, regulatory mechanisms and
interaction pathways in whole genomes. This type of data makes it possible to
define roles of all genes in a genome and understand how they function.
Microarray experiments measure the concentration of an mRNA floating around
in the cell. High concentrations of mRNA for a gene are believed to reflect a high
expression level of the gene. Microarrays are used today to understand how gene
expression is, e.g., changed during different environmental stress factors in yeast
or to compare these expression profiles to tumors from cancer patients. The


3
DNA microarrays, http://en.wikipedia.org/wiki/DNA_microarray
26

microarray technique is among other things important to understand how gene
regulation works, in disease diagnostics and to find new medicines.
By measuring the expression level of genes in an organism during different
conditions, e.g., different development stages and different tissues, it is possible
to build up gene expression profiles that can be used to characterize the
functionality of genes. For example, see the following gene expression matrix:
YORF NAME 10m 30m 50m 70m
YHR051W COX6 -1.12 -1.18 -0.56 0.33
YKL181W PRS 0.24 0.30 0.31 0.39
YHR124W NDT80 -0.56 -0.15 -0.86 -0.67
YHL020C OPI1 0.19 0.26 0.23 0.00
YGR072W UPF3 0.15 0.06 0.01 -0.58
YGR145W unknown 1.27 1.01 0.55 0.14
YGR218W CRM1 0.10 0.01 0.06 -0.06
YGL041C unknown -0.17 -0.20 -0.09 -0.76
YOR202W HIS3 -0.40 -0.49 -0.14 -0.27
YCR005C CIT2 -1.79 -1.94 -0.25 1.17
Expression data collected from yeast (cell division cycle). YORF name the ORF (coding gene
in yeast), NAME names the name of the protein (or similar protein) and if nothing have
been found its named unknown. Cell cycle NNm, where NN stands for the numbers 10, 30,
50 and 70 denotes the time points under which the sample has been taken.

Gene expression data can be represented in a matrix (as the one above), where
the rows names genes and the columns represent samples, e.g., different tissues,
development stages or treatments. Every cell (box) contains a number which sets
the expression level of a certain gene in a certain sample.

But how is the value of the expression of mRNA obtained? To obtain the
expression level we use microarrays. A microarray consists of a glass slide (could
also be silicon or nylon) where single stranded DNA molecules has been attached
to predestined places (spots). These single stranded DNA molecules are known
(you know the sequence and often also the function). On every glass slide there
are thousands of spots, where each spot is specific for one gene. A sample is
taken from a tissue (mRNA) that is tagged with a fluorescent substance. Single
stranded DNA binds to a complementary mRNA sequence, so the sample is
applied on the glass slide (hybridized). This means that the genes that are
expressed will bind its mRNA to the corresponding single stranded DNA on the
glass slide (unexpressed genes do not have any mRNA). Then you wash away
mRNA that has not bound to any DNA and the last step is to flash the slide with a
laser and read off the amount of mRNA in each spot (the amount of fluorescent
substance the mRNA have in each spot).

27


Figure 14. DNA microarray. A microarray with its thousands spots, approximately 40,000.
From http://en.wikipedia.org/wiki/DNA_microarray.

From the amount of light, you can then calculate a relative amount of mRNA
(compare the numbers in the table above). When you calculate the relative
amount of mRNA, different factors needs to be considered. This is experimental
factors such as background light, background noise, experimental noise, light
spread over the spot etc. These are factors that affect the results. We mention it
here since microarray experiments have been criticized to contain a lot of noise,
that is, they are considered to be fussy. The different factors are considered when
the relative amount of mRNA is calculated.
11. The Central Dogma and why bioinformatics is important

In this compendium we have tried to introduce some of the factors that are basics
to bioinformatics, partly a brief biological background and partly how data is
produced and some factors that might affect the quality of data. But why do we
need bioinformatics? We have tried to present situations that are complex and
where large amounts of data are handled. For example, GeneBank is doubling its
size every 14th month and it contains over 22 billion base pairs today, distributed
on about 1,00,000 species. It is quite obvious that this information have to be
structured, analyzed and evaluated.
Today, more than 70% of gene functions can be predicted with the help of
bioinformatics tools and databases. The remaining part is harder, although
bioinformatics tools can still help. The process that we have tried to highlight is
what we call the central dogma. This means going from a chromosome to mRNA
to protein to folding. We have not talked about the final step yet, going from
proteins to metabolic pathways to cellular functions. This step would require that
knowledge about the metabolic pathway could be transferred to how the cell
(and eventually an organ) reacts when the metabolic pathway is changed (e.g. up-
or down-regulation in some part). The complexity is dramatically changed when
you imagine that several pathways are crossing each other and together help to
create the cellular function. This topic is covered by systems biology, which is
closely related to bioinformatics.
What we mean by the name function might be confusing, but the definition here
is how something works. There are several levels and angles of the name function
within bioinformatics that is often subjective. With molecular function we usually
mean the specific function a protein has. This is often, e.g., the binding of a
substrate (e.g. sugar) or binding to another protein molecule (e.g. to something in
28

the cytoskeleton). A metabolic function usually names the function of a number of
aggregative reactions, e.g., production of amino acids. The metabolic function is a
global function lying above the thorough molecular functions. With cellular
function we mean the function a cell performs. A nerve cell cannot have the same
cellular function as a liver cell. Underlying causes to cells’ different functions are
metabolic functions that control the cell’s function. This does not mean the
metabolic pathway in the two different cell types are different, they might just be
adjusted to better suite a certain cellular function.
12. Examples of bioinformatics

In this chapter we will outline three examples where bioinformatics is important
and useful.
Example 1: Sequence search
To analyze and get clues about what a protein (gene product) do you can take the
protein sequence and compare it to sequences with already known functions
(genes or proteins) in a database. This means that you could search for similarities
between an unknown protein and already known proteins (unknown means that
there are no knowledge about the function and with known it means that you
have a pretty good idea of the function of the protein). A similarity in the protein
sequence mean that we can say that if they similarity is big enough; the
sequences are related with aspect on function.
Assume that we have data from a sequence called Protein1 which function is
unknown:
>Protein1
AGILVGRCTILV

As seen, the sequence data is divided into two parts, one identifier called
“Protein1” and the second part is the sequence “AGILVGRCTILV”. This format of
sequence data is FastA format and is the common way of storing sequences in flat
file databases. The first line starting with ‘>’ commonly contain an identifier of the
protein or gene, a name of the protein or gene and sometimes a short sentence
describing its function.
By taking the sequence and compare it to a database of sequences we can identify
similar sequences. We want to maximize the amount of similarity against the
sequences in the database. How this is done will be described later in the course,
but we can assume that we find some similarities like:

>Protein1 AGILVGRCTILV 100
>DNA binding protein AG-LVGRCSILV 83
>Transport protein 2 AAILIGRCTIVL 67
>Ribosomal protein --ILIGRCSVVL 41

By comparing the amino acids against each other we find the similarities and by
putting in ”-” in a sequence we can move the amino acids one step forward to get
similar amino acids in the same position. This comparison is also called to create
29

an “alignment”. In this schematically comparison we can see that the sequence
for Protein1 is most similar to itself (100% match), and thereafter the sequence
named “DNA binding protein” (83% match). We should then be able to conclude
that the unknown protein, Protein1, has a similar function to the identified
protein “DNA binding protein”. Note that this conclusion only can be made if the
similarity is large enough, something that will be discussed later on in the course.

Example 2; Structure prediction
To be able to say that a certain protein has a certain function we need knowledge
about the structure of the protein. A protein structure is a three dimensional
picture over the placement of the amino acids in the three dimensional space.
This comes from the fact that proteins fold. Depending on the folding of the
protein, a function can be coupled to the protein. With function we mean that the
protein in its three dimensional shape can bind to other proteins, substrate or
DNA. This interaction with other proteins, substrate or DNA reflects the function
of a protein.
A common technique is to compare to proteins with a known three dimensional
structure. Exactly how this works will not be covered here, but the idea is that if
you can derive a structure for a protein you can use this information for example
to see which amino acids play a key role in binding to other proteins. These are
important amino acids which can form a foundation for the understanding of how
a protein actually functions.
A three-dimensional structure that has been predicted in a correct way gives you
clues about the molecular function of the protein and you can in a quite detailed
way analyze exactly what the protein does. You can also think of it the other way
around, if you want to construct a protein with a specific function, the structure
will play an essential role in this function. The function of the protein might be to
interact with other proteins in a specific way.
Example 3; Expression analysis
Expression data are data that expresses how active a gene (or protein) is under
certain conditions. This means that you might get ideas of which genes that are
expressed in, e.g., a certain disease or when a cellular process happens. An
underlying assumption here is that genes that are similarly expressed are genes
that act in the same cellular process or genes (gene products; proteins) that act in
the same metabolic pathway (reactions between several different proteins). By
analyzing the expression of different genes and map the genes with similar
expression you might get ideas of its function and in what cellular process or
metabolic pathway they are active in.
We assume that we have two different genes with known function (gene A and
Gene C) and one gene with an unknown function (gene B). These genes are
expressed over time. From this you can draw the conclusion that since the
unknown gene B is expressed in a similar way as a known gene (in this case, gene
A) this gene is probably involved in the same cellular process or metabolic
pathway (reaction pathway). This can be illustrated by creating a clustering
tree/dendrogram where genes with the same expression profile are placed near

each other and genes with different expression profiles are placed far away from
each other.
Figure 15.
Gene expression clustering.
matrix, where genes with similar profiles are located close to each other in the tree.
squares indicate up-
regulation and green squares down
control.


30
each other and genes with different expression profiles are placed far away from

Gene expression clustering.
Showing the clustering of a gene expression
matrix, where genes with similar profiles are located close to each other in the tree.

Red
regulation and green squares down
-
regulation when compared to a

30

each other and genes with different expression profiles are placed far away from
Showing the clustering of a gene expression
Red
regulation when compared to a
31

13. Bioinformatics exercises

A) A search in Entrez.
Use the Entrez database,
http://www.ncbi.nlm.nih.gov/Entrez/
, and
search for the accession 1BNJ. Take a look at the Protein sequence
database and answer the following questions: - What is the name of the protein?
- From which database source (dbsource) is it from?
- From which organism is it from?
- What is the function (class) of the protein?
- How many amino acids long is it?
B) A search in Pubmed.
Use the Pubmed database,
http://www.ncbi.nlm.nih.gov/pubmed
, and
search for the keyword aromatase and answer the following questions:
- How many hits in the library do you get?
- How many of the hits refer to review papers?
Can you find out the function of this enzyme?
C) A refind search in Pubmed.
Use the Pubmed database again,
http://www.ncbi.nlm.nih.gov/pubmed
,
but this time use the search phrase aromatase AND ”breast cancer”.
Thereafter answer the following questions: - How many hits did you get this time?
- How many of the hits refer to review papers?
- Can you find out why aromatase is linked to breast cancer
development (tips: click on the link Review in the list on the upper
right corner to get only review papers)?