In summary, we hypothesize that the structure determination and functional profiling outlined in this proposal will result in rapid de-orphaning of proteases by enabling computational methods to accurately identify macromolecular substrates and thus facilitate efficient experimental follow-up.

beadkennelAI and Robotics

Oct 15, 2013 (3 years and 8 months ago)

106 views

Substr
ate predictions will be made
by
two methods: (i)
a
machine
learning algorithm called a support
vector machine
(
SVM
)

trained on

known positives and negatives
and
(ii)
a
newly developed
protease
-
peptide docking

algorithm

(Aim 4)
.
The
compositions

of the
SVM
training set will depend
up
on whether
or not
there are known macromolecular substrates for a given protease.
In collaboration with Andrej
Sali’s group at UCSF, w
e have
developed an SVM method to train on both the primary sequence and
structura
l attributes of cleavage sites in
macromolecular substrat
es of GrB

17
.
This algorithm has
successfully identified several novel granzyme substrates that have been validated experimentally in
lysates
. The method
can be applied to the other proteases in this proposal
. In the case of

a
n orphaned
protease,
the peptides identified as positive
s and negatives in our libraries

wil
l be used to train the SVM

on primary sequences only.
The structures
will serve as the critical starting point for protease
-
peptide
docking
,

while the peptides identified in our libraries will be used

to benchmark

the docking method.

Finally
,

high confidence predictions will be validated experimentally and made available to the protease
community.
In summary,
we hypothesize that
the structur
e determination and functional profiling
outlined in this proposal will result
in rapid
de
-
orphan
ing

of
proteases
by
enabling
computational
methods to
accurately
identify macromolecular substrates

and thus facilitate efficient experimental
follow
-
up
.

Aim

4



Develop substrate identificatio
n algorithm
s

using
structu
re
-
function relationships develop
ed
in Aims 2 and 3

4a


Support Vector Machine

We have developed a method
called “SVM Structure”
that
systematically uses the primary sequence and structural attributes
of
cleavage sites in macromolecular substrates to predict protease
substrates by training a
machine
learning algorithm called a support
vector machine (SVM)

17
. Structural attributes are evaluated by using
solved structures in the
PDB,
high qualit
y homology models, an
d
sequence
-
based
secondary
structure
and disorder
prediction when
neither a solved structure no
r homology model is available.
Receiver
-
operator characteristic (ROC) plots were generate
d to assess the ability
of “SVM Structure”

and

two

previous approaches
23, 24


to distinguish
between positives and negatives (
Fi
gure 3
). In a ROC curve, the false
positive rate

(FPR)

is plotted against the true positive rate

(TPR)

a
t
different scoring thresholds.
A perfect predictor is a line from (0,0) to
(0,1), followed by one to (1,1); a random predictor results in the
diagonal from (0,0) to (1,1). The critical point of the ROC is where each
curve intersects the line from (1,0) to (0,1); this point
minimizes the

sum of the false positives and negatives,
and was used to compare the
performance of different methods.

Our


SVM Structure


method
outperformed

an SVM trained only on
sequence features

24

and another
method

based on position specific scoring matrices (PSSMs)

23
.

The
“SVM se
quence” nevertheless
did well to discriminate between
Figure 3:
Benchmark Results. Results
of applying different methods to
an
GrB dataset that includes all known
substrates.
SVM (Structure)
is our
novel method
; SVM (Sequence) was
taken from a previous study that
trained on cleavage sequence residue
type only, PSSM implemented the
GrabCas method for GrB substrates
.

True positive

rate is abbreviated as
TPR and is indicative of coverage.
False positive rate is abbreviated as
FPR and is indicative of accuracy.

positives and negatives
.
These results have two important implications for
the interplay between the
experimentally obtained structure
-
function data and the substrate identification algorithms
: (1)
Ex
perimental i
dentification of macromolecular substrates will enhance the predictive power of an
SVM by incorporating
the
structural fea
tures of cleavage sites

into the training set and (2)
the
minimum requirement for substrate identification by SVM is data
obtained with PLUSS in Aim 2b.
Prior knowledge of experimentally identified

macromolecular substrates is not

an absolute necessity.

4b


P
rotease
-
peptide docking

The application of computational docking to substrate prediction requires a

method that
can
evaluate
on the order of 10
6

peptides within the flexible binding cleft of
a protease active site.
Comparing the
likelihood of binding

across peptides
necessitates

the selection of an appropriate
conformation
scoring
function,
conformation
sampling protoc
ol, and
peptide
ranking metric. The scoring

function will
quantify the favorability of having a peptide in a particular conformation
with
in a protease active site.
We hypothesize that a scoring function that can accurately capture the orientation
-
depende
nt
character of hydrogen bonds will be of significant advantage.

The sampling protocol determines what
conformations a
re evaluated
for a particular peptide.
I
t is not
computationally feasible to sample
and
score all possible
conformations;

therefore a ro
bust
sampling protocol
will explore enough of the
conformational landscape to ensure that the best scoring conformation has been identified.

We
propose to sample different conformations of both peptide and sidechain
s

in the protease active site
,

in an atte
mpt
to accurately capture instances of sub
-
site cooperativity
.

Finally
,

the ranking metric will
facilitate

“apples to apples” comparisons across peptides.
The top ranking peptides above a certain
threshold will be predicted
as

substrates.

To decrease the computational complexity of proteome
-
wide
predictions, we will only consider octapeptides that fulf
ill the strict P1 requirement of granzymes and
TTSPs
. For example, GrB has an absolute requirement for Asp at P1
,

which restrict
s
proteome
-
wide
evaluation to approximately 800,000
human
octapeptides with Asp in the fourth position
2, 17
.

To build our scoring function, w
e will derive and test a variety of atomic
and residue
statistical
potentials
25
-
27

that are specialized for
protein
-
peptide interactions
, building on ou
r experience with
deriving statistical potentials for modeling individual protein structures and interfaces between them
28
-
35
.
In particul
ar, we will develop a statistical potential for
relative orientation
s

between two pairs of
covalently
-
bound atoms (defined by a distance, two angles, and a dihedral angle, thus capable of
explicitly describing the orientation
-
dependent character of hydroge
n bonds).

This atomic orientation
-
dependent statistical potential will be derived from a variety of samples, including protein
-
peptide
complexes of known structure. Different reference states, atom pair type definitions, and sparse dataset
treatments will
be tested.
Finally, we will also explore combining our statistical potentials with a number
of scoring functions described by others, including physics
-
based terms such as de
-
solvation and entropy
changes

36
-
40
,
using approaches we previously developed for
multivariate model assessment

29, 30
.

The sampling protocol will

simultaneously optimize

both the
peptide

and
pro
tease

binding site
degrees
of freedom (Cartesian coordinates of all atoms).
W
e will explore a “divide
-
and
-
conquer” approach that
automatically partitions the complete set of degrees of freedom into subsets of degrees of freedom,
followed by combining best
scoring solutions for subsets into a complete solution. The starting point is
provided by our new inferential optimization implemented in the DOMINO module of Integrative
Modeling Platform

(IMP;
http://salilab.org/im
p/
)
41
. This approach generalizes several other common
means of

improving the sampling of complex scoring functions, such as dividing a system into rigid (
eg
,
protein backbone) and flexible parts (
eg
, protein sidechains), the “pre
-
calculation” of ligand
conformations for independent docking in
to a rigid ligand
-
binding

site
, and the pre
-
calculation of ligand
binding sites for independen
t docking of a flexible ligand
. In contrast, for a given configuration,
inferential optimization by DOMINO derives these subsets automatically from the information we have
about the syste
m, as encoded in the scoring function; for example, in DOMINO, subsets of
independently optimized degrees of freedom may in fact share common variables (unlike separate rigid
bodies).

The predicted protease
-
peptide conformation will be the best scoring co
nformation for a given peptide
and protease. For substrate identification, all possible peptides will be docked into a given protease
binding cleft and ranked by their “normalized” scores (
eg
, corresponding to Z
-
scores for each peptide,
based on the diffe
rence between the scores for its optimal and suboptimal conformations). The top
ranking peptides above a certain threshold will be predicted as protease substrates.
Th
e

method will be
benchmarked to
quantify

its accuracy.
The data from the PLUSS, including

the identification of nulls,
will be critical for this assessment.

The
protocols

will be incorporated into
IMP

for
free distribution to
others (
http://salilab.org/imp/
)
.

4c
) Identify substrates
for structure determi
nation that are predicted to possess unusual structural
attributes

Dogma suggests that proteases cleave exclusively at solvent accessible loops

42
. However,
the Sali

and
Wells

lab
s

ha
ve

shown in an analysis of solved structures and accurate comparative models of caspase
substrates that many of the
se known cleavage sites are in α
-
he
lices and even occasionally on
β
-
strands
43
; this was exp
and
ed upon in a more recent study
44
.
Analysis of the structural attributes of GrB
cleavage sites in macrom
olecular substrates
show
the same tendencies

(
Figure 4
; caspases shown for
comparison
)
. Indeed, more than 35% of known cleavage sequences for each substrate fall on a region
that
has regular secondary structure
. In consultation with GNF, we will select
experimentally validated
substrates

(Aim 5)

for structure determination that
contain cleavage sites predicted (by either
a
homology model or
a
sequence
-
based structure prediction) t
o possess non
-
dogmatic structural
attributes. These structures will help define determinants of proteolysis.


Figure 4:
Structural properties of protease cleavage sequence positives and negatives as assessed by DSSP for substrates
where a solved structur
e or
a
good quality comparative model was available. Numbers may not add to 100%

because
some
peptides did not have more than four residues in any one of the three secondary structure
states
. Structural properties as
assessed by the predictive methods PSI
-
PRED (secondary structure) and Disopred

(disorder)
that

consider the primary amino
acid sequence only.

Here
,

negatives are defined as octapeptides in known substrates that are outside of the experimentally
identified cleavage site and contain Asp in the f
ourth position. Therefore
,

it is possible that some of these negatives are in
fact cut by GrB and were missed experimentally. One
important
advantage of using the PLUSS data from Aim 2b will be the
use of experimentally validated negatives to train the S
VM.

Aim 5


Validation of substrate predictions and distribution of results to the community

5a) Predicted substrates that are cut in lysates will be functionally
validated by RNAi

Proteins that are predicted to be substrates by both algorithms will be
given priority for experimental follow
-
up.

Validation of substrates will
be carried out in two phases: (1) predicted substrates that are expressed
in
biologically
relevant cell lines and fulfill other criteria such as antibody
availability will be assess
ed

for cleavage

in lysates
and (2) substrates that
are cleaved
in vitro

will be candidates for siRNA mediated knockdown
followed by
functional characterization in a cell based assay
that is
linked
to the biology of the pro
tease. Initial
RNAi work will be
conducted in
collaboration with the Small Molecule Discovery Center
(SMDC)

at UCSF
.
The SMDC has considerable experience conducting high
-
throughput RNAi
screens using a wide range of cell based assays.
Should further screening
be required, this will be d
one in collaboration with GNF.

The Craik laboratory is well equipped to

perform both of these validation
phases

for
human
GrB

(
Figure 5
)

and MT
-
SP1

substrates

(
Figure 6
)
.
We
hypothesize that
proteolysis of
biologically relevant substrates will
be
conserved across species.
Therefore we will focus our efforts on proteins predicted to be substrates of
both mouse and human GrB.

Using mRNA expression databases, we will determine that a prediction is
expressed in
cell line
s

known to be highly susceptible to granzyme
-
induced cell death
; K562 cells in the
case of human GrB and YAC cells in the case of mouse GrB. Other criteria for

experimental follow
-
up
will include the availability of
a literature validated antibody and evidence
showing

the predicted
substrate plays a role in cell death
.

Immunoblotting of lysates
treated with varying concentrations of
Gr
B will be performed to val
idate
proteolysis

in vitro
. Controls with inhibitors will
determine
if GrB is

the
causative protease for the
observed cleavage event.
Predicted s
ubstrates shown to be cut by GrB in
K562 and YAC
lysates will be candidates for siRNA mediated knock
-
down.
We

will focus our efforts on
human substrates because o
ur lab
has
developed a flow
-
cytometry based assay
to assess whether
knock
-
down of a GrB substr
ate

modulates the sensitivity of K562 cel
l
s
to NK
-
92 induced cell death
3, 5

.
We hypothesize th
at our
criteria for evaluation of

a predicted
substrate

in
the
RNAi screen is
sufficiently stringent to
identify biologicall
y relevant GrB substrates
.

Figure 5: Criteria for selection of
granzyme substrates for knockdown
by RNAi


Given the role of
MT
-
SP1
in metastasis
, the h
uma
n colon adenocarcinoma
cell line
, HT
-
29, will be used to validate predicted substrates. MT
-
SP1 is
highly
expressed in these cells
45

a
nd expression of predicted substrates in HT
-
29 can be
determined by consulting mRNA databases. Co
-
localization criteria will reduce the
number of candidat
es to pursue experimentally by focusing on only those substrates
that are either localized to the plasma membrane or secreted. Cells will be tre
ated
with a highly specific
MT
-
SP1 inhibitor

that was developed in the Craik lab
13

to block
proteolysis of all
MT
-
SP1
substrates. Immunoblots of membrane enriched fractions
or spent media from
both
inhibitor treated and untreated cel
ls will validate whether
a
prediction is cleaved
in vitro
. These substrates will
then
be ca
ndidates for siRNA
-
mediated knock
-
down. A Matrigel invasion assay will be used to qua
ntify the effect
of
substrate
knock
-
down o
n the metastatic potential of the HT
-
29 cell line. Cells
treated with the MT
-
SP1 inhibitor and an MT
-
SP1 knockdown will serve a
s a positive
controls.

5b)
Make substrate predictions
and informatic methodology
public
ally

av
ailable

In addition to distributing predictions to the protease co
mmunity for validation via
the PSI
-
SG Knowledgebase (
http://kb.psi
-
structuralgenomics.org/
),

we will make both our SVM model
and protease
-
peptide docking protocol publically available. The
web serve
r for our SVM model is
currently

under development
and
w
ill allow users
(i) to construct and apply a new SVM based on a user
-
provided training set, (ii) to benchmark the ability of the SVM to predict interaction partners for a
protein of interest, (iii) to use the newly generated SVM to make proteome
-
wide predi
ctions, and (iv) to
make the SVM and its predictions publically available for use by others. As a result, our approach may
become a widely useful hypothesis
-
generator that can increase the pace of biological discovery by
guiding future experiments in a var
iety of protein
-
peptide systems.

References


1.

Bell, J.K. et al. The oligomeric structure of human granzyme A is a determinant of its extended
substrate specificity.
Nat Struct Biol

10
, 527
-
534 (2003).

2.

Harris, J.L., Peterson, E.P., Hu
dig, D., Thornberry, N.A. & Craik, C.S. Definition and redesign of the
extended substrate specificity of granzyme B.
J Biol Chem

273
, 27364
-
27373 (1998).

3.

Hostetter, D.R., Loeb, C.R., Chu, F. & Craik, C.S. Hip is a pro
-
survival substrate of granzyme B.
J

Biol Chem

282
, 27865
-
27874 (2007).

4.

Loeb, C.R., Harris, J.L. & Craik, C.S. Granzyme B proteolyzes receptors important to proliferation
and survival, tipping the balance toward apoptosis.
J Biol Chem

281
, 28326
-
28335 (2006).

5.

Mahrus, S. & Craik, C.S. S
elective chemical functional probes of granzymes A and B reveal
granzyme B is a major effector of natural killer cell
-
mediated lysis of target cells.
Chem Biol

12
,
567
-
577 (2005).

6.

Mahrus, S., Kisiel, W. & Craik, C.S. Granzyme M is a regulatory protease
that inactivates
proteinase inhibitor 9, an endogenous inhibitor of granzyme B.
J Biol Chem

279
, 54275
-
54282
(2004).

Figure 6: Criteria
for selection of
TTSP substrates for
knockdown by
RNAi

7.

Waugh, S.M., Harris, J.L., Fletterick, R. & Craik, C.S. The structure of the pro
-
apoptotic protease
granzyme B reveals the molecular det
erminants of its specificity.
Nat Struct Biol

7
, 762
-
765
(2000).

8.

Bugge, T.H., Antalis, T.M. & Wu, Q. Type II transmembrane serine proteases.
J Biol Chem

284
,
23177
-
23181 (2009).

9.

Choi, S.Y., Bertram, S., Glowacka, I., Park, Y.W. & Pohlmann, S. Type II

transmembrane serine
proteases in cancer and viral infections.
Trends Mol Med

15
, 303
-
312 (2009).

10.

Takeuchi, T. et al. Cellular localization of membrane
-
type serine protease 1 and identification of
protease
-
activated receptor
-
2 and single
-
chain urokina
se
-
type plasminogen activator as
substrates.
J Biol Chem

275
, 26333
-
26342 (2000).

11.

Bhatt, A.S. et al. Coordinate expression and functional profiling identify an extracellular
proteolytic signaling pathway.
Proc Natl Acad Sci U S A

104
, 5771
-
5776 (2007).

12.

Farady, C.J., Sun, J., Darragh, M.R., Miller, S.M. & Craik, C.S. The mechanism of inhibition of
antibody
-
based inhibitors of membrane
-
type serine protease 1 (MT
-
SP1).
J Mol Biol

369
, 1041
-
1051 (2007).

13.

Sun, J., Pons, J. & Craik, C.S. Potent and sel
ective inhibition of membrane
-
type serine protease 1
by human single
-
chain antibodies.
Biochemistry

42
, 892
-
900 (2003).

14.

Saleem, M. et al. A novel biomarker for staging human prostate adenocarcinoma:
overexpression of matriptase with concomitant loss of

its inhibitor, hepatocyte growth factor
activator inhibitor
-
1.
Cancer Epidemiol Biomarkers Prev

15
, 217
-
227 (2006).

15.

Vogel, L.K. et al. The ratio of Matriptase/HAI
-
1 mRNA is higher in colorectal cancer adenomas
and carcinomas than corresponding tissue
from control individuals.
BMC Cancer

6
, 176 (2006).

16.

Welm, A.L. et al. The macrophage
-
stimulating protein pathway promotes metastasis in a mouse
model for breast cancer and predicts poor prognosis in humans.
Proc Natl Acad Sci U S A

104
,
7570
-
7575 (2007
).

17.

Barkan, D.T. et al. Prediction of protease substrates using sequence and structural features.
(manuscript submitted).

18.

Takeuchi, T., Shuman, M.A. & Craik, C.S. Reverse biochemistry: use of macromolecular protease
inhibitors to dissect complex bi
ological processes and identify a membrane
-
type serine protease
in epithelial cancer and normal tissue.
Proc Natl Acad Sci U S A

96
, 11054
-
11061 (1999).

19.

Harris, J.L. et al. Rapid and general profiling of protease specificity by using combinatorial
fluorogenic substrate libraries.
Proc Natl Acad Sci U S A

97
, 7754
-
7759 (2000).

20.

Ruggles, S.W., Fletterick, R.J. & Craik, C.S. Characterization of structural
determinants of
granzyme B reveals potent mediators of extended substrate specificity.
J Biol Chem

279
, 30751
-
30759 (2004).

21.

Farady, C.J., Egea, P.F., Schneider, E.L., Darragh, M.R. & Craik, C.S. Structure of an Fab
-
protease
complex reveals a highly spe
cific non
-
canonical mechanism of inhibition.
J Mol Biol

380
, 351
-
360
(2008).

22.

Wu, L. et al. Structural basis for proteolytic specificity of the human apoptosis
-
inducing
granzyme M.
J Immunol

183
, 421
-
429 (2009).

23.

Backes, C., Kuentzer, J., Lenhof, H.P
., Comtesse, N. & Meese, E. GraBCas: a bioinformatics tool
for score
-
based prediction of Caspase
-

and Granzyme B
-
cleavage sites in protein sequences.
Nucleic Acids Res

33
, W208
-
213 (2005).

24.

Wee, L.J., Tan, T.W. & Ranganathan, S. SVM
-
based prediction of
caspase substrate cleavage
sites.
BMC Bioinformatics

7 Suppl 5
, S14 (2006).

25.

Miyazawa, S. & Jernigan, R.L. Estimation of effective inter
-
residue contact energies from protein

crystal structures: quasi
-
chemical approximation.
Macromolecules

18
, 534
-
552 (
1985).

26.

Sippl, M.J. Calculation of conformational ensembles from potentials of mean force. An approach
to the knowledge
-
based prediction of local structures in globular proteins.
J Mol Biol

213
, 859
-
883 (1990).

27.

Tanaka, S. & Scheraga, H.A. Medium
-

an
d long
-
range interaction parameters between amino
acids for predicting three
-
dimensional structures of proteins.
Macromolecules

9
, 945
-
950
(1976).

28.

Davis, F.P. & Sali, A. PIBASE: a comprehensive database of structurally defined protein
interfaces.
Bioin
formatics

21
, 1901
-
1907 (2005).

29.

Eramian, D., Eswar, N., Shen, M.Y. & Sali, A. How well can the accuracy of comparative protein
structure models be predicted?
Protein Sci

17
, 1881
-
1893 (2008).

30.

Eramian, D. et al. A composite score for predicting erro
rs in protein structure models.
Protein Sci

15
, 1653
-
1666 (2006).

31.

Melo, F. & Sali, A. Fold assessment for comparative protein structure modeling.
Protein Sci

16
,
2412
-
2426 (2007).

32.

Melo, F., Sanchez, R. & Sali, A. Statistical potentials for fold ass
essment.
Protein Sci

11
, 430
-
448
(2002).

33.

Sali, A. & Blundell, T.L. Comparative protein modelling by satisfaction of spatial restraints.
J Mol
Biol

234
, 779
-
815 (1993).

34.

Sali, A. & Overington, J.P. Derivation of rules for comparative protein modeling

from a database
of protein structure alignments.
Protein Sci

3
, 1582
-
1596 (1994).

35.

Shen, M.Y. & Sali, A. Statistical potential for assessment and prediction of protein structures.
Protein Sci

15
, 2507
-
2524 (2006).

36.

Brooks, B.R. et al. CHARMM: the biomolecular simulation program.
J Comput Chem

30
, 1545
-
1614 (2009).

37.

Chen, J. & Brooks, C.L., 3rd Can molecular dynamics simulations provide high
-
resolution
refinement of protein structure?
Proteins

67
, 922
-
930 (2007).

3
8.

Lazaridis, T. & Karplus, M. Discrimination of the native from misfolded protein models with an
energy function including implicit solvation.
J Mol Biol

288
, 477
-
487 (1999).

39.

Misura, K.M., Chivian, D., Rohl, C.A., Kim, D.E. & Baker, D. Physically real
istic homology models
built with ROSETTA can be more accurate than their templates.
Proc Natl Acad Sci U S A

103
,
5361
-
5366 (2006).

40.

Wroblewska, L., Jagielska, A. & Skolnick, J. Development of a physics
-
based force field for the
scoring and refinement o
f protein models.
Biophys J

94
, 3227
-
3240 (2008).

41.

Lasker, K., Topf, M., Sali, A. & Wolfson, H.J. Inferential optimization for simultaneous fitting of
multiple components into a CryoEM map of their assembly.
J Mol Biol

388
, 180
-
194 (2009).

42.

Tyndall,
J.D., Nall, T. & Fairlie, D.P. Proteases universally recognize beta strands in their active
sites.
Chem Rev

105
, 973
-
999 (2005).

43.

Mahrus, S. et al. Global sequencing of proteolytic cleavage sites in apoptosis by specific labeling
of protein N termini.
C
ell

134
, 866
-
876 (2008).

44.

Timmer, J.C. et al. Structural and kinetic determinants of protease substrates.
Nat Struct Mol
Biol

16
, 1101
-
1108 (2009).

45.

Bhatt, A.S. et al. Quantitation of membrane type serine protease 1 (MT
-
SP1) in transformed and
normal cells.
Biol Chem

384
, 257
-
266 (2003).