BIOINFORMATICS Introduction - Yale University

lambblueearthΒιοτεχνολογία

29 Σεπ 2013 (πριν από 3 χρόνια και 8 μήνες)

63 εμφανίσεις

1 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
BIOINFORMATICS
Introduction
MarkGerstein,YaleUniversity
bioinfo.mbb.yale.edu/mbb452a
2 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Bioinformatics
Biological
Data
Computer
Calculations
+
3 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Whatis
Bioinformatics
?
(Molecular)Bio-informatics
Oneideaforadefinition?
Bioinformaticsisconceptualizingbiologyintermsof
molecules
(inthesenseofphysical-chemistry)and
thenapplying
informatics
techniques
(derived
fromdisciplinessuchasappliedmath,CS,and
statistics)tounderstandandorganizethe
informationassociated
withthesemolecules,ona
large-scale.
BioinformaticsisMISforMolecularBiology
Information
4 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MolecularBiology:anInformationScience
CentralDogma
ofMolecularBiologyDNA
->RNA
->Protein
->Phenotype
->DNA
Molecules
◊Sequence,Structure,Function
Processes
◊Mechanism,Specificity,Regulation
CentralParadigm
forBioinformaticsGenomicSequenceInformation
->mRNA(level)
->ProteinSequence
->ProteinStructure
->ProteinFunction
->Phenotype
LargeAmountsofInformation
◊Standardized
◊Statistical
(ideafromDBrutlag,Stanford,graphicsfromSStrobel)
Geneticmaterial
Informationtransfer(mRNA)
Proteinsynthesis(tRNA/mRNA)
Somecatalyticactivity
Mostcellularfunctionsareperformedor
facilitatedbyproteins.
Primarybiocatalyst
Cofactortransport/storage
Mechanicalmotion/support
Immuneprotection
Controlofgrowth/differentiation
5 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MolecularBiologyInformation-DNA
RawDNASequence
◊CodingorNot?
◊Parseintogenes?
◊4bases:AGCT
◊~1Kinagene,
~2Mingenome
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgca
gcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatac
atggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtg
aaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatcca
gcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattc
ttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaact
ggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgca
ggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgt
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
gcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgca
tcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacct
gcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgtt
gttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatc
aaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacact
gaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgca
gacgctggtatcgcattaactgattctttcgttaaattggtatc...
...caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaa
caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtgg
cgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgctt
gctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgg
gttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact
acaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaacc
aatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtc
ggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaa
aaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
6 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MolecularBiologyInformation:
ProteinSequence
20letteralphabet
◊ACDEFGHIKLMNPQRSTVWYbutnotBJOUXZ
Stringsof~300aainanaverageprotein(inbacteria),
~200aainadomain
~200Kknownproteinsequences
d1dhfa_LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
d8dfr__LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
d4dfra_ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI
d3dfr__TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF
d1dhfa_LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI
d8dfr__LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI
d4dfra_ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESI
d3dfr__TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF
d1dhfa_VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP
d8dfr__VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP
d4dfra_---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA
d3dfr__---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV
d1dhfa_-PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP
d8dfr__-PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP
d4dfra_-G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKA
d3dfr__-P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV
7 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MolecularBiologyInformation:
MacromolecularStructure
DNA/RNA/Protein
◊Almostallprotein
(RNAAdaptedFromDSollWebPage,
RightHandTopProteinfromMLevittwebpage)
8 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MolecularBiologyInformation:
ProteinStructureDetails
StatisticsonNumberofXYZtriplets
◊200residues/domain->200CAatoms,separatedby3.8A
◊Avg.ResidueisLeu:4backboneatoms+4sidechainatoms,150cubicA
=>~1500xyztriplets(=8x200)perproteindomain
◊10Kknowndomain,~300folds
ATOM1CACE09.40130.16660.5951.0049.881GKY67
ATOM2OACE010.43230.83260.7221.0050.351GKY68
ATOM3CH3ACE08.87629.76759.2261.0050.041GKY69
ATOM4NSER18.75329.75561.6851.0049.131GKY70
ATOM5CASER19.24230.20062.9741.0046.621GKY71
ATOM6CSER110.45329.50063.5791.0041.991GKY72
ATOM7OSER110.59329.60764.8141.0043.241GKY73
ATOM8CBSER18.05230.18963.9741.0053.001GKY74
ATOM9OGSER17.29431.40963.9301.0057.791GKY75
ATOM10NARG211.36028.81962.8271.0036.481GKY76
ATOM11CAARG212.54828.31663.5321.0030.201GKY77
ATOM12CARG213.50229.50163.5001.0025.541GKY78...
ATOM1444CBLYS18613.83622.26357.5671.0055.061GKY1510
ATOM1445CGLYS18612.42222.45258.1801.0053.451GKY1511
ATOM1446CDLYS18611.53121.19858.1851.0049.881GKY1512
ATOM1447CELYS18611.45220.40256.8601.0048.151GKY1513
ATOM1448NZLYS18610.73521.10455.8111.0048.411GKY1514
ATOM1449OXTLYS18616.88723.84156.6471.0062.941GKY1515
TER1450LYS1861GKY1516
9 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Genomes
highlight
the
Finiteness
ofthe
Worldof
Sequences
Bacteria,1.6
Mb,~1600
genes
[Science
269:496]
Eukaryote,
13Mb,~6K
genes
[Nature
387:1]
1995
1997
1998
Animal,~100
Mb,~20K
genes
[Science
282:1945]
Human,~3
Gb,~100K
genes
[???]
2000?
10 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MolecularBiology
Information:
WholeGenomes
TheRevolutionDrivingEverything
Fleischmann
,
R.D.,Adams,M.D.,White,O.,Clayton,R.A.,Kirkness,E.F.,
Kerlavage,A.R.,Bult,C.J.,Tomb,J.F.,Dougherty,B.A.,Merrick,J.M.,McKenney,K.,
Sutton,G.,Fitzhugh,W.,Fields,C.,Gocayne,J.D.,Scott,J.,Shirley,R.,Liu,L.I.,Glodek,A.,
Kelley,J.M.,Weidman,J.F.,Phillips,C.A.,Spriggs,T.,Hedblom,E.,Cotton,M.D.,
Utterback,T.R.,Hanna,M.C.,Nguyen,D.T.,Saudek,D.M.,Brandon,R.C.,Fine,L.D.,
Fritchman,J.L.,Fuhrmann,J.L.,Geoghagen,N.S.M.,Gnehm,C.L.,McDonald,L.A.,Small,
K.V.,Fraser,C.M.,Smith,H.O.&
Venter
,J.C.(
1995
).
"Whole-genome
randomsequencingandassemblyof
Haemophilus
influenzaerd."
Science
269:496-512.
(PictureadaptedfromTIGRwebsite,
http://www.tigr.org)
IntegrativeData
1995,HI(bacteria):1.6Mb&1600genesdone
1997,yeast:13Mb&~6000genesforyeast
1998,worm:~100Mbwith19Kgenes
1999:>30completedgenomes!
2003,human:3Gb&100Kgenes...
Genomesequencenow
accumulatesoquicklythat,
inlessthanaweek,a
singlelaboratorycan
producemorebitsofdata
thanShakespeare
managedinalifetime,
althoughthelattermake
betterreading.--GAPekso,Nature401:115-116(1999)
11 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
GeneExpression
Datasets:the
Transcriptosome
Also
:SAGE;
Samsonand
Church,Chips;
Aebersold,
Protein
Expression
Young/Lander,Chips,
Abs.Exp.
Brown,µ
µµ
µarray,
Rel.Exp.over
Timecourse
Snyder,
Transposons,
ProteinExp.
12 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
ArrayData
(courtesyofJHager)
YeastExpressionDatain
Academia:
levelsforall6000genes!
Canonlysequencegenome
oncebutcandoaninfinite
varietyofthesearray
experiments
at10timepoints,
6000x10=60Kfloats
tellingsignalfrom
background
13 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
OtherWhole-
Genome
Experiments
SystematicKnockoutsWinzeler,E.A.,Shoemaker,D.D.,
Astromoff,A.,Liang,H.,Anderson,K.,
Andre,B.,Bangham,R.,Benito,R.,
Boeke,J.D.,Bussey,H.,Chu,A.M.,
Connelly,C.,Davis,K.,Dietrich,F.,Dow,
S.W.,ElBakkoury,M.,Foury,F.,Friend,
S.H.,Gentalen,E.,Giaever,G.,
Hegemann,J.H.,Jones,T.,Laub,M.,
Liao,H.,Davis,R.W.&etal.(1999).
FunctionalcharacterizationoftheS.
cerevisiaegenomebygenedeletionand
parallelanalysis.Science285,901-6
2hybrids,linkagemapsHua,S.B.,Luo,Y.,Qiu,M.,Chan,E.,Zhou,H.&
Zhu,L.(1998).Constructionofamodularyeast
two-hybridcDNAlibraryfromhumanESTclonesfor
thehumangenomeproteinlinkagemap.Gene215,
143-52Foryeast:
6000x6000/2
~18Minteractions
14 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MolecularBiologyInformation:
OtherIntegrativeData
Informationto
understandgenomes
◊MetabolicPathways
(glycolysis),traditional
biochemistry
◊RegulatoryNetworks
◊WholeOrganisms
Phylogeny,traditional
zoology
◊Environments,Habitats,
ecology
◊TheLiterature
(MEDLINE)
TheFuture....(PathwaydrawingfromPKarpsEcoCyc,Phylogeny
fromSJGould,DinosaurinaHaystack)
15 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Explonential
GrowthofDataMatched
byDevelopmentofComputer
Technology
CPUvsDisk&Net
◊Asimportantasthe
increaseincomputer
speedhasbeen,the
abilitytostorelarge
amountsof
informationon
computersiseven
morecrucial
DrivingForcein
Bioinformatics(Internetpictureadapted
fromDBrutlag,Stanford)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1980198519901995
0
20
40
60
80
100
120
140
197919811983198519871989199119931995
CPU Instruction
Time (ns)
Num.
Protein
Domain
Structures
Internet
Hosts
16 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Bioinformaticsisborn!
(courtesyofFinnDrablos)
17 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Weber
Cartoon
18 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
TheCharacterof
MolecularBiology
Information:
Redundancyand
Multiplicity
DifferentSequencesHavethe
SameStructure
Organismhasmanysimilargenes
SingleGeneMayHaveMultiple
Functions
GenesaregroupedintoPathways
GenomicSequenceRedundancy
duetotheGeneticCode
Howdowefindthe
similarities?.....
IntegrativeGenomics-
genes↔structures↔
functions↔pathways↔
expressionlevels↔
regulatorysystems↔.
19 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
NewParadigmfor
ScientificComputing
Becauseof
increaseindataand
improvementincomputers,
newcalculationsbecome
possible
ButBioinformaticshasanew
styleofcalculation...
◊TwoParadigms
Physics
◊Predictionbasedonphysical
principles
◊ExactDeterminationofRocket
Trajectory
◊Supercomputer,CPU
Biology
◊Classifyinginformationand
discoveringunexpected
relationships
◊globin~colicin~plastocyanin~
repressor
◊networks,federateddatabase
20 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
GeneralTypesof
Informatics

in
Bioinformatics
Databases
◊Building,
Querying
◊ObjectDB
TextStringComparison
◊TextSearch
◊1DAlignment
◊SignificanceStatistics
◊AltaVista,grep
FindingPatterns
◊AI/MachineLearning
◊Clustering
◊Datamining
Geometry
◊Robotics
◊Graphics(Surfaces,Volumes)
◊Comparisonand3DMatching
(Visision,recognition)
PhysicalSimulation
◊NewtonianMechanics
◊Electrostatics
◊NumericalAlgorithms
◊Simulation
21 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
BioinformaticsTopics--
GenomeSequence
FindingGenesinGenomic
DNA
◊introns
◊exons
◊promotors
CharacterizingRepeatsin
GenomicDNA
◊Statistics
◊Patterns
DuplicationsintheGenome
22 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Bioinformatics
Topics--
ProteinSequence
SequenceAlignment
◊non-exactstringmatching,gaps
◊Howtoaligntwostringsoptimally
viaDynamicProgramming
◊LocalvsGlobalAlignment
◊SuboptimalAlignment
◊Hashingtoincreasespeed
(BLAST,FASTA)
◊Aminoacidsubstitutionscoring
matrices
MultipleAlignmentand
ConsensusPatterns
◊Howtoalignmorethanone
sequenceandthenfusethe
resultinaconsensus
representation
◊TransitiveComparisons
◊HMMs,Profiles
◊Motifs
Scoringschemesand
Matchingstatistics
◊Howtotellifagivenalignmentor
matchisstatisticallysignificant
◊AP-value(orane-value)?
◊ScoreDistributions
(extremeval.dist.)
◊LowComplexitySequences
23 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Bioinformatics
Topics--
Sequence/
Structure
SecondaryStructure
Prediction
◊viaPropensities
◊NeuralNetworks,Genetic
Alg.
◊SimpleStatistics
◊TM-helixfinding
◊AssessingSecondary
StructurePrediction
TertiaryStructurePrediction
◊FoldRecognition
◊Threading
◊Abinitio
FunctionPrediction
◊Activesiteidentification
RelationofSequenceSimilarityto
StructuralSimilarity
24 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Topics--Structures
BasicProteinGeometryand
Least-SquaresFitting
◊Distances,Angles,Axes,
Rotations
Calculatingahelixaxisin3D
viafittingaline
◊LSQfitof2structures
◊MolecularGraphics
CalculationofVolumeand
Surface
◊Howtorepresentaplane
◊Howtorepresentasolid
◊Howtocalculateanarea
◊DockingandDrugDesignas
SurfaceMatching
◊PackingMeasurement
StructuralAlignment
◊Aligningsequencesonthebasis
of3Dstructure.
◊DPdoesnotconverge,unlike
sequences,whattodo?
◊OtherApproaches:Distance
Matrices,Hashing
◊FoldLibrary
25 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Topics--
Databases
RelationalDatabase
Concepts
◊Keys,ForeignKeys
◊SQL,OODBMS,views,forms,
transactions,reports,indexes
◊JoiningTables,Normalization
NaturalJoinas"where"
selectiononcrossproduct
ArrayReferencing(perl/dbm)
◊FormsandReports
◊Cross-tabulation
ProteinUnits?
◊Whataretheunitsofbiological
information?
sequence,structure
motifs,modules,domains
◊Howclassified:folds,motions,
pathways,functions?
ClusteringandTrees
◊Basicclustering
UPGMA
single-linkage
multiplelinkage
◊OtherMethods
Parsimony,Maximum
likelihood
◊Evolutionaryimplications
TheBiasProblem
◊sequenceweighting
◊sampling
26 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Topics--Genomics
ExpressionAnalysis
◊TimeCoursesclustering
◊Measuringdifferences
◊IdentifyingRegulatoryRegions
Largescalecrossreferencing
ofinformation
FunctionClassificationand
Orthologs
TheGenomicvs.Single-
moleculePerspective
GenomeComparisons
◊OrthologFamilies,pathways
◊Large-scalecensuses
◊FrequentWordsAnalysis
◊GenomeAnnotation
◊TreesfromGenomes
◊Identificationofinteracting
proteins
StructuralGenomics
◊FoldsinGenomes,shared&
commonfolds
◊BulkStructurePrediction
GenomeTrees

27 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Topics--Simulation
MolecularSimulation
◊Geometry->Energy->Forces
◊Basicinteractions,potential
energyfunctions
◊Electrostatics
◊VDWForces
◊BondsasSprings
◊Howstructurechangesover
time?
Howtomeasurethechange
inavector(gradient)
◊MolecularDynamics&MC
◊EnergyMinimization
ParameterSets
NumberDensity
Poisson-BoltzmanEquation
LatticeModelsand
Simplification
28 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Bioinformatics
Schematic
29 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
Background
MathBiology
NeedtoKnow
Today
CalculationofStandard
Deviation,aBell-shaped
Distribution(oftestscores),
a3Dvector
DNA,RNA,alpha-
helix,thecellnucleus,
ATP
WhatYoull
Learn
ForceistheDerivative(grad)of
Energy,RotationMatrices(3D),a
P-valueof.01andanExtreme
ValueDistribution
Proteinsaretightly
packed,sequence
homologytwilight
zone,proteinfamilies
Notreally
necessary.
Poisson-BoltzmanEquation,
DesignaHashingFunction,Write
aRecursiveDescentParser
WhatGroELdoes,a
wormisametazoa,E.
coliisgramnegative,
whatchemokinesare
30 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
AreTheyorArentThey
Bioinformatics
?(#1)
DigitalLibraries
◊AutomatedBibliographicSearchandTextualComparison
◊Knowledgebasesforbiologicalliterature
MotifDiscoveryUsingGibb'sSampling
MethodsforStructureDetermination
◊ComputationalCrystallography
Refinement
◊NMRStructureDetermination
DistanceGeometry
MetabolicPathwaySimulation
TheDNAComputer
31 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
AreTheyorArentThey
Bioinformatics
?(#1,Answers)
(YES?)DigitalLibraries
◊AutomatedBibliographicSearchandTextualComparison
◊Knowledgebasesforbiologicalliterature
(YES)MotifDiscoveryUsingGibb'sSampling
(NO?)MethodsforStructureDetermination
◊ComputationalCrystallography
Refinement
◊NMRStructureDetermination
(YES)DistanceGeometry
(YES)MetabolicPathwaySimulation
(NO)TheDNAComputer
32 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
AreTheyorArentThey
Bioinformatics
?(#2)
Geneidentificationbysequenceinspection
◊Predictionofsplicesites
DNAmethodsinforensics
ModelingofPopulationsofOrganisms
◊EcologicalModeling
GenomicSequencingMethods
◊AssemblingContigs
◊Physicalandgeneticmapping
LinkageAnalysis
◊Linkingspecificgenestovarioustraits
33 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
AreTheyorArentThey
Bioinformatics
?(#2,Answers)
(YES)Geneidentificationbysequenceinspection
◊Predictionofsplicesites
(YES)DNAmethodsinforensics
(NO)ModelingofPopulationsofOrganisms
◊EcologicalModeling
(NO?)GenomicSequencingMethods
◊AssemblingContigs
◊Physicalandgeneticmapping
(YES)LinkageAnalysis
◊Linkingspecificgenestovarioustraits
34 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
AreTheyorArentThey
Bioinformatics
?(#3)
RNAstructureprediction
Identificationinsequences
RadiologicalImageProcessing
◊ComputationalRepresentationsforHumanAnatomy(visiblehuman)
ArtificialLifeSimulations
◊ArtificialImmunology/ComputerSecurity
◊GeneticAlgorithmsinmolecularbiology
Homologymodeling
DeterminationofPhylogeniesBasedonNon-
molecularOrganismCharacteristics
ComputerizedDiagnosisbasedonGeneticAnalysis
(Pedigrees)
35 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
AreTheyorArentThey
Bioinformatics
?(#3,Answers)
(YES)RNAstructureprediction
Identificationinsequences
(NO)RadiologicalImageProcessing
◊ComputationalRepresentationsforHumanAnatomy(visiblehuman)
(NO)ArtificialLifeSimulations
◊ArtificialImmunology/ComputerSecurity
◊(NO?)GeneticAlgorithmsinmolecularbiology
(YES)Homologymodeling
(NO)DeterminationofPhylogeniesBasedonNon-
molecularOrganismCharacteristics
(NO)ComputerizedDiagnosisbasedonGenetic
Analysis(Pedigrees)
36 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MajorApplicationI:
DesigningDrugs
UnderstandingHowStructuresBindOtherMolecules(Function)
DesigningInhibitors
Docking,StructureModeling
(Fromlefttoright,figuresadaptedfromOlsenGroupDockingPageatScripps,DysonNMRGroupWebpageatScripps,andfrom
ComputationalChemistryPageatCornellTheoryCenter).
37 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MajorApplicationII:
FindingHomologues
FindSimilarOnesinDifferentOrganisms
Humanvs.Mousevs.Yeast
◊EasiertodoExpts.onlatter!
(SectionfromNCBIDiseaseGenesDatabaseReproducedBelow.)
BestSequenceSimilarityMatchestoDateBetweenPositionallyCloned
HumanGenesandS.cerevisiaeProteins
HumanDiseaseMIM#HumanGenBankBLASTXYeastGenBankYeastGene
GeneAcc#forP-valueGeneAcc#forDescription
HumancDNAYeastcDNA
HereditaryNon-polyposisColonCancer120436MSH2U039119.2e-261MSH2M84170DNArepairprotein
HereditaryNon-polyposisColonCancer120436MLH1U074186.3e-196MLH1U07187DNArepairprotein
CysticFibrosis219700CFTRM286681.3e-167YCF1L35237Metalresistanceprotein
WilsonDisease277900WNDU117005.9e-161CCC2L36317Probablecoppertransporter
GlycerolKinaseDeficiency307030GKL139431.8e-129GUT1X69049Glycerolkinase
BloomSyndrome210900BLMU398172.6e-119SGS1U22341Helicase
Adrenoleukodystrophy,X-linked300100ALDZ218763.4e-107PXA1U17065PeroxisomalABCtransporter
AtaxiaTelangiectasia208900ATMU264552.8e-90TEL1U31331PI3kinase
AmyotrophicLateralSclerosis105400SOD1K000652.0e-58SOD1J03279Superoxidedismutase
MyotonicDystrophy160900DML192685.4e-53YPK1M21307Serine/threonineproteinkinase
LoweSyndrome309000OCRLM881621.2e-47YIL002CZ47047PutativeIPP-5-phosphatase
Neurofibromatosis,Type1162200NF1M899142.0e-46IRA2M33779Inhibitoryregulatorprotein
Choroideremia303100CHMX781212.1e-42GDI1S69371GDPdissociationinhibitor
DiastrophicDysplasia222600DTDU145287.2e-38SUL1X82013Sulfatepermease
Lissencephaly247200LIS1L133851.7e-34MET30L26505Methioninemetabolism
ThomsenDisease160800CLC1Z258847.9e-31GEF1Z23117Voltage-gatedchloridechannel
WilmsTumor194070WT1X516301.1e-20FZF1X67787Sulphiteresistanceprotein
Achondroplasia100800FGFR3M580512.0e-18IPL1U07163Serine/threoinineproteinkinase
MenkesSyndrome309400MNKX692082.1e-17CCC2L36317Probablecoppertransporter
38 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MajorApplicationII:
FindingHomologues(
cont
.)
Cross-Referencing,onethingtoanotherthing
SequenceComparisonandScoring
AnalogousProblemsforStructureComparison
Comparisonhastwoparts:
(1)OptimallyAligning2entitiestogetaComparisonScore
(2)AssessingSignificanceofthisscoreinagivenContext
IntegratedPresentation
◊AlignSequences
◊AlignStructures
◊ScoreinaUniformFramework
39 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
MajorApplicationI|I:
OverallGenomeCharacterization
OverallOccurrenceofa
CertainFeatureinthe
Genome
◊e.g.howmanykinasesinYeast
CompareOrganismsand
Tissues
◊ExpressionlevelsinCancerousvs
NormalTissues
Databases,Statistics
(Clockfigures,yeastv.Synechocystis,
adaptedfromGeneQuizWebPage,SanderGroup,EBI)
40 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
~1000
folds
~100000
genes
~1000
genes
123456789101112131415
1234567891011121314151617181920
(human)
(T.pallidum)
Simplfying
GenomeswithFolds,
Pathways,&c
41 (c) Mark Gerstein,1999,Yale,bioinfo.mbb.yale.edu
AtWhat
Structural
Resolution
Are
Organisms
Different?
individual
atom
(C,H,O...)
10Å100Å
person
plant
protein
fold(Ig)
helix
strand
super-secondary
structure(
ββ,ΤΜ−ΤΜ,
αβαβ,ααα
)
1m1Å
123456789101112131415
1234567891011121314151617181920
(human)
(T.pallidum)
Practical
Relevance
Drug
(Pathogenonlyfolds
aspossibletargets)