Introduction to Bioinformatics 2. Biology Background

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

74 views

Introduction to Bioinformatics

2. Genetics Background

Course 341

Department of Computing

Imperial College, London


© Simon Colton

Coursework


1 coursework


worth 20 marks


Work in pairs



Retrieving information from a database



Using Perl to manipulate that information

The Robot Scientist


Performs experiments


Learns from results


Using machine learning


Plans more experiments


Saves time and money



Team member:


Stephen Muggleton

Biological Nomenclature


Need to know the meaning of:


Species, organism, cell, nucleus, chromosome,
DNA


Genome,
gene
, base, residue,
protein
, amino acid


Transcription, translation, messenger RNA


Codons, genetic code, evolution, mutation, crossover


Polymer, genotype, phenotype, conformation


Inheritance,
homology
,
phylogenetic trees


Substructure and Effect

(Top Down/Bottom Up)

Species

Organism

Cell

Nucleus

Chromosome

DNA strand

Gene

Base

Protein

Amino Acid

Folds

into

Affects the

Function of

Affects the

Behaviour of

Prescribes

Cells


Basic unit of life


Different types of cell:


Skin, brain, red/white blood


Different biological function


Cells produced by cells


Cell division (mitosis)


2 daughter cells


Eukaryotic cells


Have a nucleus

Nucleus and Chromosomes


Each cell has nucleus


Rod
-
shaped particles inside


Are chromosomes


Which we think of in pairs


Different number for species


Human(46),tobacco(48)


Goldfish(94),chimp(48)


Usually paired up


X & Y Chromosomes


Humans: Male(xy), Female(xx)


Birds: Male(xx), Female(xy)

DNA Strands


Chromosomes are same in every cell of organism


Supercoiled DNA (Deoxyribonucleic acid)


Take a human, take one cell


Determine the structure of all chromosonal DNA


You’ve just read the human genome (for 1 person)


Human genome project


13 years, 3.2 billion chemicals (bases) in human genome


Other genomes being/been decoded:


Pufferfish, fruit fly, mouse, chicken, yeast, bacteria

DNA Structure


Double Helix (Crick & Watson)


2 coiled matching strands


Backbone of sugar phosphate pairs


Nitrogenous
Base

Pairs


Roughly 20 atoms in a base


Adenine


Thymine [A,T]


Cytosine


Guanine [C,G]


Weak bonds (can be broken)


Form long chains called polymers


Read the sequence on 1 strand


GATTCATCATGGATCATACTAAC

Differences in DNA


DNA differentiates:


Species/race/gender


Individuals


We share DNA with


Primates,mammals


Fish, plants, bacteria


Genotype


DNA of an individual


Genetic constitution


Phenotype


Characteristics of the
resulting organism


Nature and nurture




Genes


Chunks of DNA sequence


Between 600 and 1200 bases long


32,000 human genes, 100,000 genes in tulips


Large percentage of human genome


Is “junk”: does not code for proteins


“Simpler” organisms such as bacteria


Are much more evolved (have hardly any junk)


Viruses have overlapping genes (zipped/compressed)


Often the active part of a gene is spit into exons


Seperated by introns

The Synthesis of Proteins


Instructions for generating Amino Acid sequences


(i) DNA double helix is unzipped


(ii) One strand is
transcribed

to messenger RNA


(iii) RNA acts as a template


ribosomes
translate

the RNA into the sequence of amino acids


Amino acid sequences fold into a 3d molecule


Gene expression


Every cell has every gene in it (has all chromosomes)


Which ones produce proteins (are expressed) & when?

Transcription


Take one strand of DNA


Write out the counterparts to each base


G becomes C (and vice versa)


A becomes T (and vice versa)


Change Thymine [T] to Uracil [U]


You have transcribed DNA into messenger RNA


Example:

Start: GGATGCCAATG

Intermediate: CCTACGGTTAC

Transcribed: CCUACGGUUAC

Genetic Code


How the translation occurs



Think of this as a function:


Input: triples of three base letters (
Codons
)


Output: amino acid


Example: ACC becomes threonine (T)



Gene sequences end with:


TAA, TAG or TGA

Genetic Code

A=Ala=Alanine

C=Cys=Cysteine

D=Asp=Aspartic acid

E=Glu=Glutamic acid

F=Phe=Phenylalanine

G=Gly=Glycine

H=His=Histidine

I=Ile=Isoleucine

K=Lys=Lysine

L=Leu=Leucine

M=Met=Methionine

N=Asn=Asparagine

P=Pro=Proline

Q=Gln=Glutamine

R=Arg=Arginine

S=Ser=Serine

T=Thr=Threonine

V=Val=Valine

W=Trp=Tryptophan

Y=Tyr=Tyrosine

Example Synthesis


TCGGTGAATCTGTTTGAT

Transcribed to:


AGCCACUUAGACAAACUA

Translated to:


SHLDKL

Proteins


DNA codes for


strings of amino acids


Amino acids strings


Fold up into complex 3d molecule


3d structures:conformations


Between 200 & 400 “residues”


Folds are proteins


Residue sequences


Always fold to same conformation


Proteins play a part


In almost every biological process

Evolution of Genes: Inheritance


Evolution of species


Caused by reproduction and survival of the fittest


But actually, it is the genotype which evolves


Organism has to live with it (or die before reproduction)


Three mechanisms: inheritance, mutation and crossover


Inheritance: properties from parents


Embryo has cells with 23 pairs of chromosomes


Each pair: 1 chromosome from father, 1 from mother


Most important factor in offspring’s genetic makeup

Evolution of Genes: Mutation


Genes alter (slightly) during reproduction


Caused by errors, from radiation, from toxicity


3 possibilities: deletion, insertion, alteration


Deletion: ACGTTGACTC


ACGTGACTC


Insertion: ACGTTGACTC


A
G
CGTTGACTC


Substitution: ACGTTGACTC


ACG
A
TGACT
T


Mutations are almost always deleterious


A single change has a massive effect on translation


Causes a different protein conformation

Evolution of Genes:

Crossover (Recombination)


DNA sections are swapped


From male and female genetic input to offspring DNA

Bioinformatics Application #1

Phylogenetic trees


Understand our evolution


Genes are
homologous


If they share a common ancestor


By looking at DNA seqs


For particular genes


See who evolved from who


Example:


Mammoth most related to


African or Indian Elephants?


LUCA:


Last Universal Common Ancestor


Roughly 4 billion years ago

Genetic Disorders


Disorders have fuelled much genetics research


Remember that genes have evolved to function


Not to malfunction


Different types of genetic problems


Downs syndrome: three chromosome 21s


Cystic fibrosis:


Single base
-
pair mutation disables a protein


Restricts the flow of ions into certain lung cells


Lung is less able to expel fluids

Bioinformatics Application #2

Predicting Protein Structure


Proteins fold to set up an active site


Small, but highly effective (sub)structure


Active site(s) determine the activity of the protein


Remember that translation is a
function


Always same structure given same set of codons


Is there a set of rules governing how proteins fold?


No one has found one yet


“Holy Grail” of bioinformatics

Protein Structure Knowledge


Both protein sequence and structure


Are being determined at an exponential rate


1.3+ Million protein sequences known


Found with projects like Human Genome Project


20,000+ protein structures known


Found using techniques like X
-
ray crystallography


Takes between 1 month and 3 years


To determine the structure of a protein


Process is getting quicker

Sequence versus Structure

00

95

90

85

0

100000

200000

300000

400000

500000

Year

Number

Protein sequence

Protein structure

Database Approaches


Slow(er) rate of finding protein structure


Still a good idea to pursue the Holy Grail


Structure is much more conservative than sequence


1.3m genes, but only 2,000


10,000 different conformations


First approach to sequence prediction:


Store [sequence,structure] pairs in a database


Find ways to score similarity of residue sequences



Given a new sequence, find closest matches


A good match will possibly mean similar protein shape


E.g., sequence identity > 35% will give a good match


Rest of the first half of the course about these issues

Potential (Big) Payoffs

of Protein Structure Prediction


Protein function prediction


Protein interactions and docking



Rational drug design


Inhibit or stimulate protein activity with a drug



Systems biology


Putting it all together: “E
-
cell” and “E
-
organism”


In
-
silico modelling of biological entities and process

Further Reading


Human Genome Project at Sanger Centre


http://www.sanger.ac.uk/HGP/



Talking glossary of genetic terms


http://www.genome.gov/glossary.cfm



Primer on molecular genetics


http://www.ornl.gov/TechResources/Human_Genome/publicat/primer/toc.html