Bioinformatics

fleagoldfishΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

113 εμφανίσεις

Bioinformatics

ABE 2007

Kent Koster

Group 3

Why bioinformatics?


“Other techniques raise more questions
than they answer. Bioinformatics is what
answers the questions those techniques
generate.”

Outline


Bioinformatics Defined


Evolution of Bioinformatics


Bioinformatics History


Common Uses of Bioinformatics


Procedures and Tools of Bioinformatics


Our Procedure


Our Results


Resources

Bioinformatics Defined


Bioinformatics is broad term covering the use of
computer algorithms to analyze biological data.


Differs from “computational biology” in that while
computational biology is the use of computer
technology to solve a single, hypothesis
-
based
question, bioinformatics is the omnibus use of
computerized statistical analysis to make
statistical or comparative inferences.



i.e. converting “data” to “information.”

The nebulous genesis of
bioinformatics


1977


Φ
-
X174 Phage Genome sequenced


1990


Paper published in the
Journal of
Molecular Biology

describes sequence
alignment search algorithm


1990s


Software used to find fragment overlap
for the Human Genome Project


1992


NCBI takes over GenBank DNA
sequence database in response to the growing
number of gene patents


The nebulous genesis of
bioinformatics


1994


“Entrez” Global Query Cross
-
Database
Search System allows users to search GenBank
database


1995


Dr. Owen White writes software to help
find gene elements (promoters, start and stop
codons, etc.) in the sequenced Haemophilus
influenzae genome


1996


NCBI
-
BLAST created to provide powerful
heuristic searches against the GenBank
database

Genomics to Proteomics through
Bioinformatics


Because proteins are ultimately the tool of all* gene
expression, proteomics is, in effect, the “product” science
made possible by bioinformatics


A proteome is the collection of all proteins expressed in
a cell at a given time


Every organism has 1 genome, but many proteomes


In addition to “high throughput” protein analysis,
proteomics is researched through cDNA analysis (RT
-
PCR)


Proteomics represents a methodical addition of “large
scale biology” to traditional molecular biology, made
possible by bioinformatics

Common Uses of Bioinformatics


Homology and Comparative Modeling


Protein or gene homology is shared
nucleotide or amino acid sequences or
domains shared between different proteins
regardless of whether from same or different
organism


Gene or Protein Identification


Searching databases for nucleotide or amino
acid sequences that match sequences in
unknown samples

So, how do ya do it?


DNA Sequencing


Sequence Formats


Sequence Homology Software Tools


Aligning Tools


Annotated Information


Protein Folding

DNA Sequencing


Sanger Method


New nucleotide chains of DNA being
replicated by DNA Polymerase are stopped
when di
-
deoxy nucleotides (added in the
reaction mixture in ~1/100 ratio) are
incorperated into the chain

DNA Sequencing


Fluorescent dyes are bound to the
ddNTPs, allowing the molecule to detected
when it is excited by a laser


Terminated DNA chains are run on a gel,
and fragments are resolved by size


By combining the fluorescence readings
from each size nucleotide chain, the DNA
sequence is computed


Example Sequence
Chromatograph


Sequence Analysis


First Things First


Sequence File Formats:


Most common for nucleotides: FASTA / Multi
-
FASTA


“>” followed by any unicode text, entire line read as sequence title


Carriage return followed by continuous 5’
-

3’ nucleotide sequence or
protein sequence using 1
-
letter codes


Example:


>E. coli Globin
-
coupled chemotaxis sensory transducer (TM
domain)
ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATG
CGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGA
TGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGAT
CTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATT
AA


Sequence Homology Software


NCBI
-
BLAST


Run by the National Center for Biotechnology
Information


BLAST uses a heuristic algorithm based on the
Smith
-
Waterman algorithm


Algorithm searches database for a small string within
the query (default 11 for nucleotide searches), then
when it detects a match, searches for shared
nucleotides at each end of the seed to extend the
match


Gaps are taken into account, then the matches are
presented in order of statistical significance


http://www.ncbi.nlm.nih.gov/BLAST/

Different Types of BLAST


Nucleotide
-
nucleotide BLAST (BLASTN):


Basic nucleutide sequence searches


The BLAST that you used for your sequences


Protein
-
protein BLAST (BLASTP):


Similar technology used to search amino acid
sequences


Position
-
Specific Iterative BLAST (PSI
-
BLAST):


A more advance protein BLAST useful for analyzing
relationships between divergently evolved proteins.

Different Types of BLAST


BLASTX and BLASTN variants:


Use six
-
frame translation for proteins and
nucleotides, respectively, in the search


MegaBLAST:


Used for BLASTing several sequences at
once to cut down on processing load and
server reporting
-
time


Interpreting BLAST Results


Max/Total Score


Calculated from the number of matches and gaps.
Higher relative to your query length is better


E Value: E=Kmn
(e
-
λS)



Translation: E Value gives you the number of entries
required in the database for a match to happen by
random chance. e.g. E=e
-
6

means that one match
would be expected for every 1,000,000 entries in the
database


Smaller E Values are better


Values larger than E=e
-
5

too likely to be due to
chance

Interpreting BLAST Results


Query Coverage


The percent of the query sequence matched
by the database entry


Max Ident


The percent identity, i.e. the percent that the
genes match up within the limits of the full
match (e.g. deletions or additions reduce this
value)


Sequence Aligning Software


Clustal (free)


ClustalX


Software


ClustalW


Web


DNAStar ($$$)


Functionality is similar, but difference is in
interface, tools, and speed of algorithms


http://www.ebi.ac.uk/clustalw/

SMART


Simple


Modular


Architecture


Research


Tool


Run by EMBL (European Molecular
Biology Laboratory)


While BLAST compares nucleotide
sequences and then informs you of any
domains that may have been annotated to
them, SMART compares by domains

PFAM


Protein domain database


Manually curated, trading volume for quality


Uses “hidden Markov models” for domain
pattern recognition


Run by Sanger Institute in the UK


Heuristic server
-
load analysis predicts when key
protein analysis report is due and crashes server


http://www.sanger.ac.uk/Software/Pfam/

Interpro


Database of protein domains and
functional sites


Best source of annotation


Other tools sometimes draw annotation
from Interpro


Run by the European Bioinformatics
Institute


http://www.ebi.ac.uk/interpro/

Protein Folding


Lowest energy state folding


Ab initio:
tremendously resource heavy, can
only be done for tiny proteins


Distributed computing is used for mid
-
sized
proteins


Folding@Home


Human Proteome Folding Project


Rosetta@Home


Predictor@Home

Protein Folding


Software
-
assisted manual folding


Use knowledge of biochemistry to fold protein
into predicted structure, then software to find
lowest energy state


Commercial Programs:


Protein Shop


Profold

Manual Motif Verification


Ramachandran Plot


ratio of Ψ to Φ
angles on N and C terminals of subunit

Our Procedure


Colonies were selected from nutrient plates


Each group selected two colonies to sequence


Colonies which survived ampicillin treatment were
possibly transformed by the vector, which contained
an ampicillin resistance gene


Presence of PDI insert was expected to disrupt ccdB
(lethal protein) and LacZα gene expression in vector
plasmid


LacZα expression resulted in some blue colonies, as
the colonies were able to cleave X
-
Gal substrate into
blue product




Initial Questions Guiding Colony
Selection


How did some blue colonies survive?


Did all blue colonies come from the PCR product?


Did the white colonies contain the PDI inserts?


Were some colonies able to survive without the
ampicillin resistance plasmid?


What was the actual sequence of the commercial
positive control insert?


Some samples were transformed with inserts collected
from PCR instead of gel electrophoresis. Could have
non
-
PDI sequences have ligated to the vector and been
inserted into bacteria?


Procedure


Samples were prepared with T3 and T7
(forward and backward) primers in solution
for sequencing


Samples were sent to UH Manoa lab for
sequencing


Chromatogram results were viewed with
Finch TV to determine quality

Procedure


Sequences were trimmed at 5’ and 3’
ends, then restriction enzyme sites on the
vector were attempted to be located with
Finch TV

Procedure


Sequences were exported in FASTA format


Procedure was repeated for the other strands


Pair
-
wise alignment was performed for both
strands of each sample with EBI’s tools


Consensus sequence from pair
-
wise alignment
was searched for in BLAST


Gene information was located from BLAST
annotation and TAIR website


Results


General Remarks


Because colonies were selected prior to the identity of
the positive control insert being questioned, no control
colonies were sequenced


All sequenced white colonies definitively had PDI
gene insert, save for one interesting exception


Some blue colonies showed multiple nucleotide
chromatogram readings, suggesting either sample
contamination or separately transformed
E. coli

growing as one colony

Group 3 Results


Sequenced 1 blue and 1 white colony from
same plate


Colonies were transformed with PCR
product, not gel
-
recovered DNA


White colonies had PDI insert


Blue colonies had 154Bp partial insert,
disrupting ccdB gene, but remaining in
-
frame and allowing for a partially function
LacZ alpha gene to be expressed

Group 3 White Colony


T7 strand definitively showed the presence
of a PDI insert

Group 3 White Colony


T3 and T7 strand consensus sequence
also showed PDI gene presense

Group 3 Blue Colony


Blue colony T3 showed multiple signals

Group 3 Blue Colony


However, T7 strand was salvageable


A 154 nucleotide sequence was found
between the restriction sites

Group 1 Results


White Colony from PCR product showed
PDI gene in both T3 and T7 strands


White colony from gel purification:


T7 strand sequenced as multiple signals


T3 strand sequenced excellently

Group 1 Gel White Colony


T3 sequence showed only nucleotides
1540
-
2320 of the vector

Group 2 Results


White Colony from gel purification


White colonies sequenced with PDI gene


Blue w/ White Ring Colony from PCR


Both T3 and T7 strand sequencing showed
consistent multiple signals

Group 4 Results


1 white colony from PCR and 1 white
colony from gel purification were
sequenced


Both showed PDI gene

Final Remarks


All white colonies had the PDI gene, except one with a modified vector


All blue colonies were transformed with the direct PCR product (not gel
purified)


Group 3 showed that a small (154Bp) insert that stays in
-
frame with the
LacZ gene can knock
-
out the ccdB, while still allowing the expression of an
at least partially functioning LacZ gene


Some blue colonies with white rings could be 2 separate lines living
together


Bacteria transformed with ampicillin resistance gene could deplete area of
ampicillin, allowing bacteria without the gene to crowd the white bacteria out of
the area of depleted ampicillin


How could bacteria without the insert survive both ccdB expression and ampicillin
selection in broth?


ccdB gene could be lost due to mutation


Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and
ampicillin resistance genes


No group sequenced the positive control insert


sequence still a mystery!

Resources


http://www.bioinformatics.org


http://http://syntheticbiology.org/Tools.html


NCBI BLAST:
http://www.ncbi.nlm.nih.gov/BLAST/


SMART:
http://smart.embl
-
heidelberg.de/


PFAM:
http://www.sanger.ac.uk/Software/Pfam/


Interpro:
http://www.ebi.ac.uk/interpro/


Canadian Bioinformatics Helpdesk Newsletter (Ramachandran
Plot):
http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.p
hp


Finch TV:
http://www.geospiza.com/finchtv/


EBI Pair
-
wise alignment:
http://www.ebi.ac.uk/emboss/align/index.html


TAIR:
http://www.arabidopsis.org