D2DBT9 - Genetic Analysis and Bioinformatics

abalonestrawΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

82 εμφανίσεις

D2DBT9
-

Genetic Analysis and
Bioinformatics

Bioinformatics of Proteins in One and
Three Dimensions


Dr. Jaume Bacardit

jaume.bacardit@nottingham.ac.uk


Learning outcomes


To gain practical experience at using protein
-
related web
-
based biological databases and
extracting information from them


To gain practical experience at using web
-
based
protein structure prediction public services


Having basic knowledge about how to use protein
visualisation tools


Have basic practical experience about how to
perform homology modelling

Protein we are going to use today…


We are going to use in most examples the
AXR4 protein from
Arabidopsis Thaliana


MAIITEEEEDPKTLNPPKNKPKDSDFTKSESTMKNPKPQSQNPFPFWFYFTVVVSL
ATII

FISLSLFSSQNDPRSWFLSLPPALRQHYSNGRTIKVQVNSNESPIEVFVAESGSIHT
ETV

VIVHGLGLSSFAFKEMIQSLGSKGIHSVAIDLPGNGFSDKSMVVIGGDREIGFVARV
KEV

YGLIQEKGVFWAFDQMIETGDLPYEEIIKLQNSKRRSFKAIELGSEETARVLGQVIDT
LG

LAPVHLVLHDSALGLASNWVSENWQSVRSVTLIDSSISPALPLWVLNVPGIREILLA
FSF

GFEKLVSFRCSKEMTLSDIDAHRILLKGRNGREAVVASLNKLNHSFDIAQWGNSDG
INGI

PMQVIWSSEASKEWSDEGQRVAKALPKAKFVTHSGSRWPQESKSGELADYISEF
VSLLPK

SIRRVAEEPIPEEVQKVLEEAKAGDDHDHHHGHGHAHAGYSDAYGLGEEWTTT

Biological databases


Uniprot


NCBI Entrez


Pfam

UniProt


UniProt is a collaboration between the
European Bioinformatics Institute (EBI), the
Swiss Institute of Bioinformatics (SIB) and the
Protein Information Resource (PIR).


Main protein data base


http://www.uniprot.org/


Querying UniProt with a protein
Name: AXR4


Uniprot ID

Included in the AXR4 page….


Annotation of protein


Function, location, specificity, disruption phenotype


Gene Ontology


Sequence


Transmembrane potential


Bibliographic references


Cross
-
references to other databases


GenBank, PIR, KEGG, TAIR (Adapbidopsis
-
specific)

Scrolling down through the AXR4
page….


If we click here….

Blasting the AXR4 sequence…


Returns these results


Now we select the most closer homologs and
press align

Aligning the homologs (ClustalW)


ClustalW also generates phylogenetic
trees


Not only in Uniprot we have protein
information….


The NCBI’s
Entrez

system returns this for AXR4

Pfam
: sequence
-
based detection of
protein families


Pfam returns three possible sequence
motifs (but no significant results)


Protein Data Bank

(PDB)


Put your PDB ID Here

Each protein in PDB is identified by a 4
-
letter code

Entry 2p31

Let’s click at display PDB file

PDB file for 2p31

Sequence

Atomic coordinates of the amino acids

Other Biological Databases


Prediction sites


Secondary Structure Prediction


Prediction of residue’s structural aspects


Tertiary structure prediction


Transmembrane prediction


Functional sites prediction



These servers perform very complex calculations


They sometimes take a day or two (or more) to reply


Generally users are notified by email when the results
are ready


PSIPRED
: Secondary Structure
Prediction


Results of PSIPRED…


3D structure prediction


3D Jury

is a Meta
-
server for 3D PSP

Results of 3D
-
Jury


Good source of templates

Results of 3D Jury (scrolling to the
right)


LOMETS


The
quick

server from the Zhang group


Zhang’s
I
-
Tasser

is the best publicly available PSP
server


Unfortunately it is very overloaded (for AXR4 it
took 8 days to return a
model


LOMETS performs fold recognition using several
locally installed programs


Generates homology modelling from the
alignments obtained in the FR process


Another good source of distant templates

LOMETS results


mGenTHREADER Prediction results


More templates !!

Other 3D PSP servers


FUGUE


3D
-
JIGSAW


Hhpred


SAM
-
T08


ROBETTA

(David Baker’s server. Heavily
overloaded too)



Results of CASP8

(to see how these servers
perform)

Infobiotic.net PSP server


Created here in Nottingham


It predicts a broad variety of residue’s
structural aspects

Results from the Infobiotic.net server


Firestar
:Functional sites prediction


TMHMM
: Transmembrane prediction


PyMOL


One of the best protein visualisation tools


Free for educational use


Your can ask for a license at
http://www.pymol.org/educational.html


I have a license, so if you would like to use it in your
personal computers, you can download it from
http://www.cs.nott.ac.uk/~jqb/pymol
-
1_1edu1
-
bin
-
win32.zip


I also have the Linux and MacOS versions


Please, do not distribute it



Let’s downalod
2p31

and open it
from pymol

Controls are at the top right of the
screen




A control (
all
) affects everything loaded into
pymol


Also, you can control each loaded
protein/selection individually. Right now there
is only one protein (2p31)


Five types of controls:


Actions, Show, Hide, Label and Colour

To change to a cartoon visualisaton…


2p31


Hide


Everything


2p31


Show


Cartoon


2p31


Colour


Spectrum


Rainbow


Now click on the middle of the screen, drag
the mouse and this is what you obtain….


Visualising only chain A


As we saw in the PDB web site, this protein has two
chains


To visualise only one of them, we have to create a
selection


You have to type this at the pymol prompt:


PyMOL>select chainA, 2p31 and chain A



chainA is the label of the selection


Everything after the comma is the definition of the
selection


We can select chains, residues and even atoms


Type “help selection” to see all possible options

Visualising only chain A


All


Hide


Everything


chainA


Show


Cartoon


chainA


Color


Spectrum


Rainbow


chainA


Action


Zoom


Showing the protein surface


chainA


Show


Surface


Type this: set transparency=0.5

Simple Homology Modelling


We are going to use
Modeller


Free for academic use


http://salilab.org/modeller/9v6/modeller9v6.exe


Licence key: MODELIRANJE


1
st

step: Installing it.


When choosing the destination path, choose c:
\
temp
(in B08/B09)


Modeller is a very sophisticated tool where you can
controll almost any aspect of the homology
modelling process


Here we are only going to use the simplest options

Chain we are going to model

ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGK
VSLVVNVASDCQLTDRNYLGLKELHKEFGPSHF
SVLAFPCNQFGESEPRPSKEVESFARKNYGVTF
PIFHKIKILGSEGEPAFRFLVDSSKKEPRWNFWK
YLVNPEGQVVKFWRPEEPIEVIRPDIAALVRQVII
KKKEDL

T0388 LOC493869A, Homo sapiens

CASP target ID

1
st

step: BLAST against PDB


Selecting the template


The perfect match
exists, because
right now the
structure for this
target is already
public


We are going to
ignore it, and use
chain A of protein
2p31 instead

2
nd

step: Creating an alignment


Modeller has a sophisticated alignment tool


Uses structural information from the template


Dynamic programming instead of the approximate method
of blast


To create the alignment you need to:

1.
Download the
PDB file
of the template

2.
Put your sequence in PIR format (
example
)

3.
Edit the
alignment script
to set the template and chain

4.
Call modeller: mod9v6.exe align.py

PIR file


Just replace the sequence with your own one


The last line in the sequence needs to end in *


Do not touch anything else from the file, or
the alignment script will not work

>P1;target

sequence:target:::::::0.00: 0.00

ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGKVSLVVNVASDCQLTDRNYLGLKELHKE

FGPSHFSVLAFPCNQFGESEPRPSKEVESFARKNYGVTFPIFHKIKILGSEGEPAFRFLV

DSSKKEPRWNFWKYLVNPEGQVVKFWRPEEPIEVIRPDIAALVRQVIIKKKEDL*

Align.py

from modeller import *

from modeller.automodel import *


env = environ()

aln = alignment(env)


template='2p31'

chain='A'


tc=template+chain


mdl = model(env, file=template, model_segment=('FIRST:'+chain,'LAST:'+chain))

aln.append_model(mdl, align_codes=tc, atom_files=template+'.pdb')

aln.append(file='target.ali', align_codes='target')

aln.align2d()

aln.write(file='target
-
'+tc+'.ali', alignment_format='PIR')

aln.write(file='target
-
'+tc+'.pap', alignment_format='PAP')

Just change the value of these 2 lines

with your template

Results of the alignment


Alignment is different from that produced by BLAST


Modeller has ignored the residues lacking structural
information


_aln.pos 10 20 30 40 50 60

2p31A
-----
Q
----
DFYDFKAVNIRGKLVSLEKYRGSVSLVVNVASECGFTDQHYRALQQLQRDLGPHHFNV

target
ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGKVSLVVNVASDCQLTDRNYLGLKELHKEFGPSHFSV


_consrvd * ** * * ****** * ********* * ** * * * ** ** *



_aln.p 70 80 90 100 110 120 130

2p31A
LAFPCNQFGQQEPDSNKEIESFARRTYSVSFPMFSKIAVTGTGAHPAFKYLAQTSGKEPTWNFWKYLV

target
LAFPCNQFGESEPRPSKEVESFARKNYGVTFPIFHKIKILGSEGEPAFRFLVDSSKKEPRWNFWKYLV


_consrvd ********* ** ** ***** * * ** * ** * *** * * *** ********



_aln.pos 140 150 160 170

2p31A APDGKVVGAWDPTVSVEEVRPQITALVR
----------


target NPEGQVVKFWRPEEPIEVIRPDIAALVRQVIIKKKEDL


_consrvd * * ** * * * ** * ****

Creating the model


5 models are
created


Each of them
can be slightly
different


Models are
going to be
assessed using
2 different
criteria

from modeller import *

from modeller.automodel import *


log.verbose()

env = environ()



template='2p31'

chain='A'


tc=template+chain


class MyModel(automodel):


def get_model_filename(self,sequence, id1, id2, file_ext):



return sequence+'_'+`id2`+file_ext



def special_restraints(self, aln):



rsr = self.restraints


a = MyModel(env, alnfile='target
-
'+tc+'.ali',



knowns=tc, sequence='target',



assess_methods=(assess.DOPE, assess.GA341))

a.starting_model = 1

a.ending_model = 5

a.make()


Results of the modelling


According to DOPE score, 3 is the best model
and 2 the worst


The lowest the DOPE score, the better


Let’s see how different are the models

>> Summary of successfully produced models:

Filename molpdf DOPE score GA341 score

----------------------------------------------------------------------

target_1.pdb 1280.53101
-
19077.32812 1.00000

target_2.pdb 1570.33606
-
18480.83008 1.00000

target_3.pdb 960.32550
-
19365.79102 1.00000

target_4.pdb 1415.41724
-
18980.71094 1.00000

target_5.pdb 1463.82593
-
19077.91016 1.00000

Viewing the two models from pymol

1.
Open model 3 as usual

2.
But then, instead of double
-
clicking model 2,
open it from inside pymol using File


open

3.
The models are not aligned

Type: align target_3,target_2


The only differences are in the two ends of the
chain

So how does the model compare
to the real protein
3CYN
?


The residues at both ends of the chain are
wrong

Can we do any better?


We can give modeller information about the
secondary structure of the target


We can get these predictions from PSIPRED





Then, the
modelling script

needs to be
modified

CCCCCCCCCCCEEEEEEECCCCCEECHHHHCCCEEEEEECC
CCCCCCHHHHHHHHHHHHHHCCCCEEEEEEECCCCCCCCC
CCHHHHHHHHHHCCCCCHHEEEEEECCCCCCCHHHHHHHH
CCCCCCCCCCEEEEECCCCCEEEEECCCCCHHHHHHHHHHH
HHHHHHHHHCCC

from modeller import *

from modeller.automodel import *


log.verbose()

env = environ()


template='2p31'

chain='A’

tc=template+chain


class MyModel(automodel):


def get_model_filename(self,sequence, id1, id2, file_ext):



return sequence+'_'+`id2`+file_ext



def special_restraints(self, aln):



rsr = self.restraints



rsr.add(secondary_structure.strand(self.residue_range('12:', '18:')))



rsr.add(secondary_structure.strand(self.residue_range('24:', '25:')))



rsr.add(secondary_structure.alpha(self.residue_range('27:','30:')))



rsr.add(secondary_structure.strand(self.residue_range('34:', '39:')))



rsr.add(secondary_structure.alpha(self.residue_range('48:','61:')))



rsr.add(secondary_structure.strand(self.residue_range('66:', '72:')))



rsr.add(secondary_structure.alpha(self.residue_range('84:','93:')))



rsr.add(secondary_structure.alpha(self.residue_range('99:','100:')))



rsr.add(secondary_structure.strand(self.residue_range('101:', '106:')))



rsr.add(secondary_structure.alpha(self.residue_range('114:','121:')))



rsr.add(secondary_structure.strand(self.residue_range('132:', '136:')))



rsr.add(secondary_structure.strand(self.residue_range('142:', '146:')))



rsr.add(secondary_structure.alpha(self.residue_range('152:','171:')))


a = MyModel(env, alnfile='target
-
'+tc+'.ali',



knowns=tc, sequence='target',



assess_methods=(assess.DOPE, assess.GA341))

a.starting_model = 1

a.ending_model = 5

a.make()


Pred
SS
info

And here is the new model,
compared to the real protein


Now at least we got right one end of the
protein

Coursework


Each of you will be given a different protein
sequence


You need to tell me everything you can from
the sequence

Things to do

1.
Identify the protein from where the
sequence comes from

2.
Fetch information of the protein from all the
databases that contain it


The protein itself


The gene from where the protein comes from


Function (and functional sites) of the protein


Protein families


Structural classification, secondary structure


Protein interactions


Etc.


Things to do

3.
Query predictions servers for


Functional sites


Secondary structure


Transmembrane segments

4.
And compare these predictions to the actual
structure/function

5.
Do basic homology modelling, as shown in this
practical


Do not put the real protein as the model



I don’t expect you to produce a perfect model


What I am interested is in you performing the modelling
process
(and showing evidence for it)


Submission details


Write a report with all the information that you have collected


Be exhaustive but not verbose. Just put relevant information


Include the PDB file of the model plus images generated using
pymol and a comparison with the real protein


In the cover page include your name, the
target number of
the sequence that you have been assigned

and the sequence


You have to submit the coursework through
http://www.submit.ac.uk/


Class ID: 117409


Enrollment password: D2DBT9


Submission deadline:
May 22nd


Questions about the
coursework?

Summary of the session


We have seen how to query the most
important protein
-
related biological databases
and prediction servers


We have learnt to use the pymol protein
visualisation software


We have employed modeller to perform some
basic homology modelling