R01_MRx - Andrej Sali Lab

tangibleassistantSoftware and s/w Development

Dec 3, 2013 (3 years and 8 months ago)

144 views

Project Summary

The broad goal is to develop and apply computational methods for building
models

of the structure and dyna
m-
ics of proteins and their assemblies. These models can give insights into how the assemblies work, how they
evolved, how they can

be controlled, and how similar functionality can be designed. One successful approach,
integrative structure modeling, casts the building of such models as a computational optimization problem
where all knowledge about the assembly is encoded into the sco
ring function used to evaluate candidate mo
d-
els. It is proposed here to extend and enhance the
already successful
open source
Integrative Modeling Pla
t-
form

(IMP;
http
://
integrativemodeling
.
org
)

that provides programmatic support for developing and distributing
integ
rative structure modeling protocols. IMP allows representing molecules at a variety of resolutions, using
spatial restraints from many types of data, and searching for solutions by a variety of sampling algorithms. So
far, it has been applied mostly to ele
ctron microscopy, small angle X
-
ray scattering, and various proteomics
data. IMP is easily extensible to add support for new data sources and algorithms, and is distributed under an
open source license. IMP will be extended to allow addressing a greater ra
nge of biological problems and more
generally useful to the scientific community. Specifically, the traditional scoring functions used by IMP will be
supplemented with inference
-
based scoring functions that extract the maximum possible information from the

data, following a Bayesian approach with minimal assumptions and approximations, to account for errors and
incompleteness in the data as well as a heterogeneous sample.
Sampling

of the scoring function landscape
will be improved by a method that efficient
ly divides the complete set of degrees of freedom into potentially
overlapping subsets, finds optimal and suboptimal solutions for the subsets independently by traditional opt
i-
mizers or enumeration, and then combines compatible solutions to obtain guarante
ed highest
-
scoring solutions
for the whole system. IMP will also be extended to make best use of the wealth of information provided by
mass spectrometry, used in combination with other data about the system. To maximize the impact of IMP and
its utility to

the community, it will be interfaced to other packages, including structure viewers such as Chimera,
structure prediction and design programs such as Rosetta, and web portals such as the Protein Model Portal.
Finally, the software will be well
-
tested and
documented, and the growing IMP community will be supported
with mailing lists, examples, demonstrations at workshops, and hosting of select users at UCSF.

Relevance

We propose to extend IMP, a computer program that can describe the three
-
dimensional shape
s of large ma
c-
romolecular machines that are not amenable to solution with a single experimental technique. These structures
will allow us to better understand the workings of the cell, both under normal and disease conditions.

Specific Aims

Our broad goal

is to develop and apply computational methods for building models of the structure and dyna
m-
ics of proteins and their assemblies. These models can give insights into how the assemblies work, how they
evolved, how they can be controlled, and how similar fu
nctionality can be designed. One successful approach,
integrative structure modeling, casts the building of such models as a computational optimization problem
where our knowledge about the assembly is encoded into the scoring function used to evaluate can
didate
models
3
-
8
.
During the previous funding period, we developed
Integrative Modeling Platform

(IMP;
http
://
integrativemodeling
.
org
), a
n open source

software package that provides programmatic support for d
e-
veloping and d
istributing integrative structure modeling protocols
8
.
We also demonstrated its use by application
to several biological problems, including the Nuc
lear Pore Complex
1

and 26S proteasome
9
.
IMP allows
the re
p-
resentation of
molecules at a variety of resolutions, using spatial restraints from almost any type of data, and
searches
for solutions
using
a variety of sampling algorithms. So far, it has been a
p
plied mostly to electron m
i-
croscopy (EM)
7
,
10
, small angle X
-
ray scattering (SAXS)
11
, a
nd various proteomics data
3
. IMP is easily extens
i-
ble to add support for new data sourc
es and algorithms, and is distributed under an open source license. We
propose to build upon this foundation, to be able to address a greater range of biolo
g
ical problems
and make
IMP more generally useful to the scientific community. Our specific aims are
:

Aim 1: Develop inference
-
based scoring functions.
Experimental

data
are
frequently
limited by
errors and
incompleteness
. As a result, scoring functions traditionally used for structure determination may fail to identify
the full set of structures that ga
ve rise to the data. We aim to
develop

inference
-
based scoring functions that
extract the maximum possible information from the data, following a Bayesian approach with minimal assum
p-
tions and approximations.

Aim 2: Develop a divide
-
and
-
conquer family of s
ampling methods.

Sampling all model structures co
n-
sistent with the data is generally a challenging problem, in part due to the rugged nature of the scoring function
landscape and its many local minima. We aim to design, implement, and test a sampling
method that efficiently
divides the complete set of degrees of freedom into potentially overlapping subsets, finds optimal and subopt
i-
mal solutions for the subsets independently by traditional optimizers or enumeration, and then combines co
m-
patible solutio
ns to obtain guaranteed best
-
scoring solutions for the whole system
12
.

Aim 3: Develop methods for translating
mass spectrometry data
directly
into spatial restraints.

Na
tive
mass spectrometry (MS)
13
-
16

and intermolecular c
ross
-
linking detected by MS
17
-
20

can yield a wealth of info
r-
mation about proximity between assembly subunits, though this information is sometimes incorrect and not
uniquely assigned to
specific residues or subunits. We propose to develop an IMP module to convert this e
x-
perimental data into explicit spatial restraints that can be used with restraints from other data, such as EM
maps and SAXS profiles
, to produce
hybrid

models

of macromole
cular assemblies
.

Aim 4: Interfacing IMP with other packages.

IMP itself provides a toolbox of interchangeable components
that can be used to construct an integrative modeling protocol. We propose to extend this toolbox by develo
p-
ing interfaces to other pa
ckages. In particular, we propose to maximize the impact of IMP and its utility to the
community by interfacing it with structure viewers suc
h as Chimera
21
, structure prediction and desi
gn programs
such as Rosetta
22
,

and web portals such
as the Protein Model Portal
23
,
24
.

Aim 5: Support the IMP developer and user communities.

To maximize the usefulness of IMP, we will d
e-
liver a robust, well
-
tested, and well
-
documented product. We will work closely with the community, providing
high
-
level usage documentation, demonstrating IMP at workshops, and hosting select users at UCSF. Lit
er
a-
ture and input files related to our collaborative research projects with experimental biologists will also be made
available to illustrate the use of the software.

We will apply IMP to the NPC, 26S proteasome, bacterial type III and VI secretion systems
, group II chape
r-
onins, spindle pole body, and host
-
pathogen complexes, in collaboration with experimentalists and funded by
other grants (Collaboration Letters); these projects will benefit from more accurate scoring functions, more
powerful sampling sche
mes, and protocols for using MS data developed in this grant.


Research Strategy

0. Introduction

0.1. Significance

Building models of a biological system that are consistent with the myriad data describing them is one of the
key challenges in biology. In p
articular, we are interested in the structure and dynamics models of proteins and
their assemblies. These models can be helpful in the understanding of the function, evolution, control, and d
e-
sign of macromolecular assemblies. One successful structure dete
rmination approach, integrative structure
modeling, casts the building of such models as a computational optimization problem where our knowledge
about the assembly is encoded into the scoring function used to evaluate candidate models
1
,
6
,
25
,
26
. Integrative
modeling has a number of advantages when compared to other modeling approaches; for example, an e
n-
semble of models that fit
s all the data is generally more accurate and precise than that based on a subset of
data, new types of data can be easily added, the accuracy and precision of the data as well models can be a
s-
sessed, and preliminary models can guide the design of new expe
riments. Previous applications of
integrative
modeling

include modeling the 26S proteasome from a cryo
-
EM map, proteomics data, and comparative pr
o-
tein structure models of components
9
; the bacterial type II pilus from sparse NMR data and X
-
ray crystallo
g-
raphy structures of co
nstituent proteins
27
; chromatin from 5
C data
28
; auxilin bound to clathrin from an EM map
and comparative models of components
29
; the human voltage dependent anion channel from NMR spectro
s-
copy and X
-
ray crystallography structures of constitue
nt proteins
30
; eu
karyotic initiation factor 3 from MS and
proteomics data
31
; and the whole NPC from the biophysical, proteomics, and EM data
1
.

T
he model of the
yeast NPC
1

illustrates the value of integra
tive modeling. The sheer size and flexibility of the
NPC makes it all but impossible to solve its molecular architecture by conventional atomic resolution tec
h-
niques, such as X
-
ray crystallography. However, integrating information from multiple sources, in
cluding sto
i-
chiometry from protein quantification, protein proximities from subcomplex purification, protein positions from
immuno
-
EM, sedimentation analysis informative about the protein and subcomplex shapes, and the overall
NPC shape from cryo
-
EM and el
ectron tomography, resulted in an ensemble of medium resolution models.
The models were summarized by a 3D probability map, resembling an EM map and localizing the 456 consti
t-
uent proteins with an average precision of ~5 nm. This map has revealed fundament
al new insights into the
function and evolution of the NPC
32
-
35
.

Publication of macromolecular structures has evolved from printed words and pictures to include deposition of
coordinates in the Pro
tein Data Bank
36
, and more recent
ly deposition of raw input data such as X
-
ray scattering
factors
36
, NMR restraints
37
, and EM particle images
38
. However, the conversion of the raw data to the final
structures is often only briefly described and all too rarely a
vailable in a directly usable form
39
-
41
, making repr
o-
duction and use of the
published results laborious or even impossible. If published papers included integrative
modeling data and protocols, a wide variety of researchers would benefit. Particularly, experimental labs, which
are unlikely otherwise to go through the effort of mod
eling systems themselves, would be able to use the state
-
of
-
the
-
art model in experiment planning by simulating how much benefit would be achieved from new data. It
would also be easy to see how much each new measurement contributes to the current model as
well as
whether or not it is consistent with it. Other computational groups could more easily experiment with new sco
r-
ing, sampling, and analysis methods, without having to re
-
implement the existing methods from scratch. Fina
l-
ly, the authors themselves wou
ld maximize the impact of their work,
via

increasing the odds that their results
are incorporated into future modeling.

Achieving
these goals requires good, well
-
supported software. First, to maximize the ability of scientists to a
p-
ply integrative modeling

techniques to new problems, the software needs to be readily available, well doc
u-
mented and, easy to learn. Second, as no one group can be an expert in all areas of structural biology, it
should be easy for any lab, not only the core software developers,
to extend the software to handle new data
types by adding new restraints, optimizers, and other functionality.

0.2. Approach

IMP is a
library
designed to facilitate computational encoding of the standard scientific cycle of gathering data,
proposing
hypotheses, and then gathering more data to test and refine those hypotheses. Structure modeling
in IMP proceeds through the following stages (
Fig. 1
)
3
,
8
.

1) Gathering information
:

This information consists
of data from wet lab experiments such as those listed above, as well as statistical tendencies such as atomic
statistical potentials,
and
physical laws such as m
olecular
mechanics force fields
.
2) Designing model repr
e-
sentation and evalu
a
tion
:

The resolution of the representation depends on the quantity and resolution of the
available information, and
should be commensurate with the resolution of the final models; different parts of a model may be represented
at different resolutions and a part may be represented at several different resolutions simultaneously. The sco
r-
ing fun
ction evaluates whether or not a given model is consistent with the input information, taking into account
the uncertainty in the information.
3) Sampling good
-
scoring models
:

The search is performed using any of
a variety of sampling and optimization sche
mes (
eg
, Monte Carlo method). There may be many models that
score well if the data is incomplete or none if the data is inconsistent due to errors or unconsidered states of
the assembly.
4) Analyzing models and information
:

The good
-
scoring models need to
be clustered and a
n-
alyzed to ascertain their precision and accuracy, and to check for inconsistent information. Analysis can also
suggest what are likely to be the most informative experiments to perform in the next iteration. The cycle term
i-
nates when a c
onvergent ensemble of models is found fitting the current information (or a sufficient subset of
it) and the models have been assessed to be satisfactory. When new information is gathered, whether by other
scientists or techniques, the cycle is resumed.

0.
3. Progress Report

During the initial funding period (4/1/08
-
3/31/12), we achieved our goals as defined by the stated Specific Aims
and demonstrated by our
31
publications (Progress Report Publication List).

Previous Aim 1: Develop a software kernel for IMP.

We
have developed the basic infrastructure for IMP and i
m-
plemented a set of tools in C++ and Python for represen
t-
ing, scoring, samplin
g, and analyzing structures of macr
o-
molecular assemblies
(
http://integrativemodeling.org/
3
,
8
)
.
This core functionality



allows easy representation of molecules at a variety of
resolutions
. When modeling large assemblies, we often do
not have enough information to warrant a fully atomic or
even a residue level model, so b
eing able to handle
coarse
-
grained as well as atomic models is essential.



allows in principle almost any type of information to be
used in modeling
. We need to use all available information
to maximize accuracy, precision, coverage, and efficiency
of model
ing efforts.



is easily extensible
. As a single research group cannot
write the code to support all information types, any deve
l-
oper must be able to easily add and distribute support for
new information types (
eg
, by implementing a single C++
or Python
class). IMP is structured as a collection of mo
d-
ules, each of which groups together functionality based on,
for example, a particular type of information and identity of
the authors; these modules encompass source code, do
c-
umentation, parameter files, auth
orship,
etc
. Since mo
d-
ules are self
-
contained, they can be developed and distri
b-
uted separately from the main IMP code, while still benefi
t-
ing from the IMP infrastructure. As a result, it is easy to
build on other developers’ code and methods.



provides a h
igh level interface
against
which to write
scripts to model biological systems
. A high level interface
in Python and C++ limits the amount of code that needs to
be written, debugged, maintained, and documented when
modeling a particular biological syste
m. Thus, a third party
can more easily add new data, tweak the representation,
and improve the sampling scheme.



reduces the maintenance burden on methods and appl
i-
cation developers.
By insulating application developers
from the details of the platform on w
hich the application is
being run, the platform makes scripts easier to maintain
and run.


Fig. 1. Integrative structure determinatio
n by
satisfaction of spatial restraints
. The four steps
to determine a structure by integrating varied data
are illustrated using the NPC as an example
1
. First,
structural data are generated by experiments, such
as cryo
-
EM (left), immuno
-
EM (center), and affinity
purification of subcomplexes (right
). Many other
types of information can also be included. Second,
the data and theoretical considerations are e
x-
pressed as spatial restraints that ensure the o
b-
served symmetry and shape of the assembly (from
cryo
-
EM, left), the positions of constituent gold
-
labeled proteins (from immuno
-
EM, center), and
the proximities of the constituent proteins (from
affinity purification, right). The assembly is indica
t-
ed in blue, and constituent proteins are indicated
as colored circles. Third, an ensemble of structural
solutions that satisfy the data is obtained by min
i-
mizing the violations of the spatial restraints (from
left to right). Fourth, the ensemble is clustered into
sets of distinct sets of similar solutions (left), and
analyzed in different representations, su
ch as pr
o-
tein positions (center) and protein

protein contacts
(right).

Previous Aim 2: Develop extension modules.

Using the core functionality of IMP, we developed modules
that implement a wide variety of methods for scoring and sampling

molecular models. Sources of spatial r
e-
straints currently supported by IMP
include
3D maps from EM
10
,
12
,
26
,
42
-
46
, 2D images from EM
47
-
49
, molecular
mechanics

force fields
50
, comparative protein structure modeling
51
-
53
, atomic statistical potentials
54
-
56
, molecular
docking of p
airs of protein structures
11
, comp
arative patch analysis
57
,
58
, comparative docking
59
,
60
, SAXS pr
o-
files
11
,
61
,
62
, and various proteomics datasets
1
,
7
,
9
,
49
,
63
. Moreover, IMP provides several sampling algorithms to
generate possible solutions; these samplers include physics
-
inspired a
lgorithms such as molecular dynamics

(MD)
64

and Monte Carlo with simulated annealing
65
, as well as methods such as conjugate gradients
66
,
67

and
simplex
68
,
69
. We also
implemented the DOMINO sampler, which applies a divide
-
and
-
conquer approach
70

to
efficiently find solutions wi
th the globally optimal score within a discrete sampling space (
Fig. 2
)
10
,
12
.

In addition, we built a number of higher
-
level tools for solving common modeling problems. These include R
e-
strainer for converting proteomics data into spatial restraints on the configuration of multi
-
subunit assemblies
7
;
MultiFit for assembling multiple subunits based on an EM density map, proteomics data, and molecular doc
k-
ing

(
http://salilab.org/multifit/
)
10
;

FoXS for computing a SAXS profile of a given structure
(
http://salilab.org/foxs
/
)
61
; and an integration of protein
-
protein docking with information from other experiments,
such as EM, SAXS, and NMR spectroscopy, significantly

outperforming docking alone
(
http://salilab.org/foxsdock/
)
47
.

Previous Aim 3: Develop a sample application.

Demonstrating the relative maturity and utility of
the IMP code and its predecessors, we have su
c-
cessfully
applied it to a number of biological sy
s-
tems. Examples i
n
clude a eukaryotic rib
osome
71
,
a mam
m
alian ribosome
72
, an RyR channel
73
, the
26S proteasome
9
,
74
,
75
, the Hsp90 chaperonin
76
,
the TRiC/CCT chaperonin
77
, the actin
-
scruin
complex
78
, and the NPC
32
. We have also dete
r-
mined the structure of the heptameric Nup84
complex in baker’s yeast, a key compo
nent of the
NPC structural scaffold (
Fig. 3a
); as well as the
configuration of all 12 subunits in the lid of the
19S subunit of the
S.
pombe

26S proteasome
(
Fig. 3b
).

Previous Aim 4: Support the developer and
user communities
. Progress on this aim is
summarized below (
Aim 5
).

Future Work.

Reflecting the accomplishment of
previous
S
pecific
A
ims and our research suppor
t-
ed by other grants, we
now aim to extend IMP’s
functionality by (i) developing more informative
scoring functions (
Aim 1
), (ii) improving sampling
algorithms (
Aim 2
), (iii) incorporating new types of
MS data that make use of improved scoring and
sampling (
Aim 3
), (iv) connecting

with other sof
t-
ware packages to make better use of resources
developed by others (
Aim 4
), and (v) expanding
support of the user community (
Aim 5
). We will
ourselves apply IMP to the NPC, 26S pr
o-
teasome, bacterial type III and VI secretion sy
s-
tems, group I
I chaperonins, spindle pole body,
and host
-
pathogen complexes, in collaboration
with experimentalists and funded by other grants
(Collaboration Letters); these projects will benefit
from more accurate scoring functions, more po
w-
erful sampling schemes, and
protocols for using
MS data developed in this grant.


Fig. 2
:
Divide
-
and
-
conquer sampling
. The DOMINO sa
m-
pler decomposes variables into subsets, samples
each su
b-
set independently, and gathers all compatible solutions
across subsets into a global solution.




Fig. 3. Recent sample structures determined by IMP.
(left)

Heptameric
S.
cerevisiae

Nup84 complex. In collabor
a-
tion with M. Rout and B. Chait, we relied on homology mo
d-
e
l
ing of subunits, affinity purification of domain deletion co
n-
structs, and 2D EM data (in preparation). The structure d
e-
lineates previously unknown domain
-
domai
n interactions in
the complex.
(right)
19S subunit of the
S. pombe

26S pr
o-
teasome. In collaboration with W. Baumeister and R. Aebe
r-
sold, we relied on subunit homology models, an EM map of
the assembly at 8 Å resolution, intermolecular cross
-
links,
and prot
eomics data (in preparation).

1. Aim 1: Develop inference
-
based scoring functions

Experimental data
are
frequently
limited by

incompleteness and
errors
. Thus, scoring functions traditionally
used for structure determination may fa
il to identify the full set of structures that gave rise to the data. We aim
to
develop

inference
-
based scoring functions that extract the maximum possible information from the data

while accounting for potential sources of
error
, following a Bayesian
approach with minimal assumptions and
approximations. We are concerned here with the development of the general approach that is in principle a
p-
plicable to any data; the ap
proach will be ada
pted to specific MS data in
Aim 3

and to other
types of
data in
ef
forts funded by other grants.

1.1. Significance

Protein structure prediction or experimental determination of any kind proceeds by exploring a large configur
a-
tional
space

with a sampling algorithm and evaluating each
configuration

with a scoring function.
Thus, the a
c-
curacy of the output models depends on the accuracy of the scoring function and the thoroughness of structu
r-
al sampling. As a result, there is a need for more accurate scoring functions, such as the inference
-
based sco
r-
ing functions proposed he
re, especially when the datasets are incomplete

and contain errors
.

M
odel precision should reflect the
incompleteness
of the

data: it should not be higher
or lower than what the
data allow.
However, traditional scoring functions
use
arbitrarily weighted
ad

hoc

terms
that are

often over
-
designed to
result

in

a single solution
;

thus,
there is no guarantee
for

the
solution

to
con
tain

the full range of
structures that produced the observed data.

Even with com
plete data,
mutually inconsistent subsets of informat
ion
can arise because of (i)
errors

in data
collection
(
eg
, false protein
-
protein interactions from yeast two
-
hybrid system
79
)
and (ii)

incoherence


resulting
from
a
sample
existing

in

multiple compositional
and structural states
. To produce reliable models,
data

must
be analyz
ed to identify self
-
consistent
data
subsets, assign the data subsets to al
ternate structures, and r
e-
move the incorrect data. Failing to do so significantly deteriorates the accuracy of the solutions by generating
models that are a compromise between consistent and inconsistent data
,

lead
ing

also

to an over
-

or under
-
estimate of
the
model

accuracy and precision.

1.2. Innovation

We propose to combine the representation capabilities of IMP with a Bayesian approach, based on Inferential
Structure Determination (ISD)
2
. The innovation is threefold: (i) IMP will provide a probability
-
based scoring
function that correctly ranks solutions when the data is incomplete and erroneous; (ii) the Bayesian

approach,
once incorporated
into IMP, will be extended to a variety of experimental data, including MS, proteomics and
EM, and benefit from IMP’s multi
-
scale representation and versatility; and (iii) we will generalize the Bayesian
approach to cope with inconsistent data. This will
help with the detection of alternative configurations of the
system as well as the presence of incorrect data.

1.3. Approach

Inferential structure determination.
In
the Bayesian approach,
the goal is
to
find high
est

probability models,
given all informatio
n (including experimental data and
theoretical information).

As a result,
optimally precise
models

are
inferred

even from
uncertain and incomplete information,

in stark contrast
to

traditional scoring
functions
81
-
83
.

Models

are obtained by sampling the
scoring function
defined conveniently as

the
posterior probability

p
(
X
,

|
D
,
I
)
of

a
model
X
and
uncertainty

,

given
data
D
and
prior information

I
2
.
The un
certainty

dete
r-
mines how well model

X
agrees with data
D
: it accounts for the accuracy of both the data and the
forward
model

that computes
D

from

X
.
The posterior probability

reflects the information content of the data and re
p-
resents the complete set of solutions. When it is zero everywhere except for a single model, the data dete
r-
mine the model uniquely; in contrast, if it

is uniform, the data is completely uninformative.
The posterior prob
a-
bility is estimated by Bayes theorem:
p
(
X
,

|
D
,
I
)

p
(
D
|
X
,

,
I
)
p
(
X
|
I
)
p
(

|
I
)
. The
model

prior
p
(
X
|
I
)

e
x-
presses our initial knowledge about the structure of the
system
, such as
a mol
ecular mechanics force field
. The
uncertainty

prior
p
(

|
I
)

is the distribution of the values for


(
eg
, the Jeffrey prior
84
). The
likelihood

p
(
D
|
X
,

,
I
)
estimates how well the data agree with the assessed model. The likelihood
can often

be
form
u-
lated as a product of unimodal distributions, each one peaked around a single experimental datum
d
i
,
p
(
D
|
X
,

,
I
)


i
1
2


2
e
xp

d
(
X
)

d
i


2
2

2


, where
d
(
X
)
is
t
he
forward
model
. Importantly,
the Bayes
i-
an

approach

simultaneously optimizes

model
X
and

uncertainty

, and
does
not simply minimize

the devi
a-
tions
between

the experimental data
D

and the data computed from
model
X
, as is the case for traditional
“least
-
squares” procedures.

Modeling based on

inconsistent data.
Since

ISD was initially formulated with a single
data
uncertainty


, it
cannot deal with inconsistent data (
Fig. 4
). When all the

data cannot be satisfied by a single model,
we pr
o-
pose to formulate

the
likelihood as a “multiple uncertainty” function where each data point
d
i

is associated with
its own uncertainty paramete
r

i
.

Therefore, if a subset of the data is inconsistent with the bulk of the data, the
corresponding uncertainty will increase, attenuating the contribution of the inconsistent data to the scoring
function (
Fig. 4
). Thus, we can now quantitatively assign the d
egree of inconsistency for each piece of data,
obviating the need for oversimplif
ied

classification of data points into the “correct” and “incorrect” classes.
B
e-
cause inconsistent data can arise from either noise
in data collection
or distinct compositions
/configurations in
the sample,

the
proposed
Bayesian
perspective

wi
ll be informative about both
aspects

(
cf
,
a large subset

of
self
-
consistent

data
is

unlikely to be erroneous and
thus
was

likely produced
by a single configuration
)
.

Potential problems and
alternative approaches.

A number of difficulties may arise in implementing the mu
l-
tiple uncertainty approach. First, the introduction of error parameters for each datum increases the number of
the degrees of freedom of the system and exacerbates the burden

on the sampler. We will address this pro
b-
lem by (i) utilizing advanced sampling techniques (
eg
, replica
-
exchange
MD
85
,
86

and DOMINO (
Aim 2
)) as well
a
s (ii)
testing for sampling convergence by analysis of solution subsets. Second, as is often the case in Baye
s-
ian approaches, the choice of the optimal prior information may be uncertain. On one hand, if priors are uni
n-
formative, finding statistically rele
vant solutions with sparse data may be difficult; alternatively, it may result in
over
-
fitting when the free parameters exceed the number of data points. On the other hand, if priors are too
restrictive, solutions may be inappropriately biased. We will add
ress this challenge by experimenting with var
i-
ous priors for the coordinates, such as molecular mechanics force fields
50

and statistical potentials
54
, and priors
for uncertainty parameters obtained from the maximum entropy principle
84
. Finally, a major difficulty exists in
the comprehensive representation of the solutions. Our method will produce either a uni
que cluster of solutions
with associated uncertainty centered around a single conformation or multiple distinct clusters centered around
different conformations, as well as ensembles where certain portions of the macromolecule are much less d
e-
termined than

others. We will work with the UCSF Chimera developers to establish procedures for visualizing
ensembles with associated uncertainties (
Aim 4
).

Benchmark and applications.
We will use specific synthetic benchmarks
to test

the

general Bayesian fram
e-
work above
.
We will begin by using
a
l
ow
-
resolution
representation
,
with

a
protein

in an assembly

represented
as
a sphere

and pairwise contact data generated presumed native structures.

We will test the scoring function

ability to de
tect incorrect data and assign consistent data to the corresponding system configuration.
We will
also use
the
atomic

representation for

small proteins and peptides
,

G
õ


restraints
87

computed from their
native

Fig. 4. Inconsistent data in structure determination. (a)
Assessment of a

traditional scoring function using harmonic
restraints.
(b)

ISD scoring function with a single uncertainty parameter.
(c)

Bayesian approach with multiple uncertai
n-
ties. The optimized system is a 16
-
residue peptide (inset panel (a), ribbon) and the simulate
d data consist of 54 Cα
-

distance restraints (inset, gray lines) obtained from the β
-
hairpin target conformation. Three incorrect data points, r
e-
straints between Cα with incorrect distance values (inset panel (a), blue, yellow and orange lines), are prog
ressively
included in the data set. The score is plotted against Cα
-
RMSD from the target structure for solutions obtained by
sampling each scoring function containing zero (black), one (red) and three (green) incorrect data points.
(d)

The
Bayesian approac
h with multiple uncertainties is able to find in all cases the correct solution (low RMSD and low score)
by attenuating the contribution of the incorrect data points and minimizing the discrepancy with the correct data. Post
e-
rior probability distribution w
as sampled by Replica Exchange and Gibbs Sampling scheme
2
.

structures
,
and

an increasing number of incorrect distance restraints combined with different molecular m
e-
chanics force fields and statistical potentials as priors. More challengin
g and realistic tests wi
ll be performed on
MS data sets; see
Aim 3

for
a specific example of an inference
-
based scoring function for a cross
-
link between
two residues
.

2. Aim 2: Develop a divide
-
and
-
conquer family of sampling methods

Sampling all model str
uctures consistent with the data is a challenging problem, in part due to the rugged n
a-
ture of the scoring function landscape and its many local minima. A scoring function useful in structure dete
r-
mination is frequently a sum of terms, each of which depend
s on a
subset
of system components (
eg
, atoms,
residues, and proteins).
If not every component
is coupled

with every other component
, it is sometimes poss
i-
ble to assemble globally optimal solutions of the complete system from locally optimal
and sub
-
optima
l

sol
u-
tions for overlapping subsets of the system, thus vastly improving the efficiency of sampling
12
,
70
,
88
. We aim to
implement a general framework for
this

approach in IMP, so that we can apply it to sampling scoring functions
for a variety of
integrative

modeling problems.

2.1. Significance

As stated above,
model

accuracy depends on
the thoroughness of structural sampling
, in addition to
the acc
u-
racy of the scoring function. Despite recent advancements, problems involving many degrees of freedo
m still
often prove to be computationally intractable. As a result, there is a need for more efficient sampling methods,
such as the DOMINO class of methods proposed here.

2.2. Innovation

The key innovation is
a
divide
-
and
-
conquer strategy for efficient
sa
mpling

of scoring function
s typically used in
integrative modeling
12
,
68
,
86
. DOMINO may be able to o
vercome the barriers on the
sampled

landscape
, which

prevent traditional samplers from fully exploring th
e

landscape
,

by dividing the
system into overlapping parts

and

exploring their states independently. This
general
approach will underlie the developmen
t of specific sa
m-
pling schemes specialized for modeling of atomic protein structures, prediction of
trans
-
membrane
helix

pac
k-
ing
, as well as modeling of coarse assembly structures based on
EM

maps and proteomics data.

2.3. Approach

Divide
-
and
-
conquer
.
DOMINO

enumerate
s

the

global and
sub
-
optimal
solutions of
a scoring
function
, over a
given

discrete sampling space
12
,
70
; t
h
e variables
of the scoring function
define positions of system components,
such as atoms, secondary structure segments, protein domains, or whole proteins. DOMINO proceeds in four
stages. First, the
scoring function

is represented
as a graph, where the nod
es are the variables to be sampled
and the edges are pairwise
terms

acting on these variables

(
representation
)
. Second, the set of variables is
decomposed into overlapping subsets of variables that are “loosely” coupled (
decomposition
)
70
,
89
; these su
b-
sets are represented as nodes in a junction tree where edges are drawn between subsets containing the same
variables. Third, the possible discrete states for each subset are generate
d by enumeration or traditional sa
m-
pling schemes (
sampling
). Finally, self
-
consistent combinations of these subset states that are good scoring
global solutions are constructed by inference (
gathering
)
70
,
89
.

S
ubset states
are gathered
by finding solutions
that are as optimal as possible within each subset while still globally compatible (in contrast, traditional opt
i-
mizers might explore locally optimal solutions, but will take a long t
ime before finding all local solutions simu
l-
taneously

in a single complete state
).
Gathering efficiency is achieved

by
insisting on common values of
the
variables
that are
shared by
neighbo
ring
s
ubsets
in

the junction tree
. In its simplest form, DOMINO has
al
ready
b
een applied to fit
multiple
assembly
subunits into
an

EM density map
12
. Here, we propose significant a
d-
vancements to the algorithm that will make it more efficient as well as applicable to a larger variety o
f pairwise
scoring functions.

Sampling stage.

Previously, allowed
values of subset
variable
s

were generated through enumeration over a
discrete grid. Here, we propose to
sample a subset

by

using traditional optimizers, as exhaustive enumeration
is not trac
table for large systems. As an example, we are integrating DOMINO with a
MD
sampling of atomic
structures. In this case, the full system is sampled using MD. Atoms in the system interact with each other, and
for each MD step, their values are saved. The sy
stem is decomposed into subsets at the conclusion of the MD
sampling
, and the values of the variables in each subset are generated from their coordinates for each MD
step. Other methods
for sampling subsets
instead of MD, such as MC with simulated annealin
g and replica
exchange,
wil
l

also be
tested
.

Gathering stage.

Having t
oo many overlap
ping states across subsets lead
s

to combinatorial intractability.
Thus, it is desirable to eliminate
unproductive
subset
states before they are combined with those for another
subset. To accomplish this goal, we will define application
-
specific
filters

(constraints
)
acting across variables
in a subset. For example, one filter will be a restraint threshold. Filters will also
allow for generation of subopt
i-
mal global solutions by
relaxing the

threshold
s

while still enabling sampling efficiency.

We now outline three sampling

methods
that will

benefit from the divide
-
and
-
conquer approach of DOMINO.

At
omic protein structures.

DOMINO will be
adapt
ed

for exploring conformational space at the
atomic resol
u
tion. Immediate applications include
protein loop modeling and protein
-
ligand docking.
First,
MD

is used to sample the system as a
whole
, generating
a sa
mple of

conformations for
each D
OMINO
atom
subset. Compa
t
ible
subset

conformations are
gathered
,

result
ing in

a global
solution

that
often

scores better

than

any indivi
d-
u
al
state on

the trajectory (
Fig. 5
).

Benchmarking.

We will test the protocol on

the
a
large set of loops in known protein structures
90

and th
e
peptiDB dataset of 103 protein
-
peptide
structures
91
. The scoring function wil
l be
a

di
s-
tance
-
dependent atomic statistical potential
54
.

Trans
-
membrane
helix packing
.

We
are

appl
y-
in
g

DOMINO

to

sampling the configurations of
trans
-
membrane spa
n
ning α
-
helices.
Initially
, he
l-
ices predicted from sequence are treated as rigid
bodies, but will be made flexible in the f
uture. Subsets of helices are defined based on sequence
-
connectivity
and a relatively accurate

sequence
-
based

prediction of interacting pairs of helices. Different algorithms are b
e-
ing explored to sample states in the subsets, including enumeration of stat
es and other canonical sampling
methods such as Monte Carlo and MD. We are relying on preferred tilt angles, predicted topology, packing m
o-
tifs and other features extracted from known membrane protein structures
as filters
to restrict the sampling in
each
subset.

Benchmarking.
Initially, we will benchmark our algorithm on small membrane proteins with less than 7 trans
-
membrane helices for which an X
-
ray structure with the resolution better than 3.5 Å is available in the OPM
database (http://opm.phar.umich.e
du/).

A
ssembly structures
from

EM maps and proteomics data.
For many complexes,
EM can generate a dens
i-
ty map
38

and
prote
omic techniques

can
determine

protein proximit
i
es
92
. We propose to create graphical ne
t-
work

representation
s
for these two datasets
, followed by matching
the netw
orks

using DOMINO
to

predict the
struc
ture of
the
assembly. First, the density map is discretized into

a graph of

anchor points
, each associated
with a probability of being
close to

each of the assembly components

(
based on
fitting

each
c
omponent

to
the
density
in the vicinity of
each

anchor point
)
. Second, the proteomics data is
converted

into an interaction graph.
Third, good scoring mappings of components to anchor points are enumerated by aligning the
interaction
and

anchor graph
s
. The possible arrangements are scored by
the quality of the best fit

of the

atomic structures
to

the map. W
e will also develop an extension of this method for solving structures of single protein chains given
their density map, using linear connectivity of the sequence as the “proteomics” data.

Benchmarking.
We will
use

synthetic benchmarks
based on

known compl
ex structures
, as well as select e
x-
pe
r
imental
entries in

EMDB
38

and B
iogrid
92

(
eg
,

RNA polymerase II (RNAPII)
93

and 20S proteasome
94
,
95
)
.

Potential problems and alternative approaches.
When

the number of compatible states
across subsets
is

very large,
a huge amount of memory is needed in

the gathering stage. We will explore ways to optimize
memory usage, including writing subset states to disk, parallelizing the algorithm to allow for distributing diffe
r-
ent subsets across
different nodes of a computational cluster, and turning off restraint score caching.

3. Aim 3: Develop methods for translating
mass spectrometry data into spatial
restraints

Native MS
13
-
16

and intermolecular cross
-
linking detected by MS
17
-
20

can yield a wealth of information about
proximity between assembly subunits, although this information is sometimes incorrect and not uniquely a
s-

Fig. 5
.
Application of DOM
INO to sampling atomic stru
c-
tures
.
(a)
Examples of four subsets of peptide atoms in the
protein
-
peptide system. Protein atoms are colored grey, pe
p-
tide atoms are blue, and subsets are in red, yellow, green,
and magenta (additional subsets are not shown).
(
b)
Score of
a small peptide in a protein
-
binding pocket over a MD time
course (blue line) compared with the optimal score found by
DOMINO (red square). The DOMINO score is also better
than those produced by relaxation of the local MD minima
(not shown).

signed to specific residues
or subunits. We propose to develop an IMP module to convert these experimental
data into explicit spatial restraints that can be used with restraints from other data, such as EM maps and
SAXS profiles
, to produce hybrid models of macromolecular assemblies
.

The resulting scoring functions (
Aim
1
) can be sampled by
methods

developed in
Aim 2
. Implementation of the MS module of IMP will benefit from
our collaborations with expert mass spectrometrists on determining the structures of specific protein comple
x-
es,

based in part on native MS and cross
-
linking data
(NPC with B.

Chait; host
-
pathogen complexes with A.
Burlingame; 26S proteasome with R. Aebersold and W. Baumeister; and eIF3
-

HIV protease complex with C.
Robinson; Collaboration Letters).

3.1. Significan
ce

Structural information from native MS and protein cross
-
linking techniques is complementary to other medium
-

and low
-
resolution data, such as density maps from EM, radial distribution functions from SAXS, and protein
proximities from proteomics experime
nts (
eg
, yeast two
-
hybrid system and tandem affinity purification). Ther
e-
fore, conversion of the MS data into explicit spatial restraints is needed, so that they can be used within the
integrative structure determination framework, thus maximizing the impa
ct of MS on structural biology.

3.2. Innovation

We will

formally

take the errors and ambiguity of native MS and
chemical
cross
-
linking into account for the first
time. While both native MS and cross
-
linking have previously been used to provide valuable ins
ights into the
structures of specific protein assemblies
96
-
98
, these insights have largely been limited by the lack of proper a
c-
counting of uncertainty a
nd ambiguity as well as the lack of formal integration with other kinds of data. IMP will
bring benefits of integrative structure determination to MS; for instance, if data from a set of cross
-
linking e
x-
pe
r
iments, properly qualified, are unable to resolve
a protein
-
protein interface, but the interface is resolved u
p-
on integration with
a
n

EM
map
, unnecessary experiments that might otherwise be carried out will be avoided.

A key hurdle in converting MS data into explicit spatial restraints is that the structu
ral interpretation of the data
can be ambiguous (assignment ambiguity), similar to the
uncertainty

of assigning a

specific hydrogen atom
pair to an observed
me
thyl
-
methyl NMR NOE.
For example, many affinity purification / MS experiments were
used in the de
termination of the molecular architecture of the NPC
32
. Each experiment identified a set of pr
o-
teins in a subcomplex, without revealing
their stoichiometry or
specific physical interactions (except for binary
complex
es). S
uch

a
mbiguity
needs to be handled
without over
-
interpretation of the data
, especially when

mu
l-
tiple copies of the same protein type
are present
in
the studied

complex (
eg
, symmetric complexes). We will
develop new conditional restraints
1

to properly treat ambiguity inherent in
the
data from native MS and
chem
i-
cal
cross
-
linking experiments.

3.3. Approach

A conditional restraint resolves assignment ambiguity by choosing the best scoring ass
ignment of the data to
the system components (
eg
, residues and proteins) among all possible assignments at each sampling step; for
instance, a conditional restraint might choose the lower scoring of two possible sites in a protein for a lysine
-
lysine cross
-
linker, and then enforce only the restraint corresponding to the single, chosen site. Thus, when
conditional restraints are applied to a system, both assignment of restraints to specific system components and
the structure of the system resu
lt from struct
ure optimization.

Protein configuration from native MS.

Native MS analyzes intact protein complexes
13
-
1
6
, resulting in several
types of low
-
resolution structural information. Most informatively, native MS can measure the composition and
stoichiometry of a family of nested subcomplexes usin
g collision induced disassociation. This tree of subco
m-
plexes (
Fig. 6
) provides much more information about how the constituent proteins of the complex are a
r-
ranged than simply knowing the identity of each individual subcomplex. In addition, every path fro
m root to leaf
in this tree describes a dissociation pathway; the reverse of this path may correspond to the order in which the
complex was assembled. Scoring a configuration based on this tree of subcomplexes requires finding the
best

scoring set of edges

such that all subcomplexes are “induced” by those edges and each subcomplex is co
n-
tained in its parent subcomplex in the tree (
eg
, the AC and BC leaf sub
-
complexes must be contained in the
ABC subcomplex in
Fig. 6
). We are borrowing ideas from our DOMINO
optimizer to implement this search eff
i-
ciently, in a bottom up manner (
Fig. 6
).

Collision cross
-
section from native MS.

In addition to subcomplex composition and stoichiometry, one

n
a-
tive MS technique, ion mobility spectrometry
13
,
99
-
101
, can be used to determine the average cross
-
section of a
complex (
ie
, collision cross
-
section). The amount of time required for an ion, under the influence of a weak
electric fie
ld, to move through a mobility chamber filled with neutral gas is directly proportional to this cross
-
section. A theoretical collision
-
cross section for the model will be calculated by averaging projections of the
model using MOBCAL
102
. The IMP collision cross
-
section restraint will restrain the distance by which the the
o-
retical value deviates from the experimentally determined cross
-
section radius.

Chemical cross
-
linking.
Chemical cross
-
linking

experiments use a variety of reagents to covalently bridge
two spatially proximate residues within the same protein or two interacting proteins
17
-
2
0
. The corresponding
cross
-
linking restraint will account for ambiguity in cross
-
link position assignment and potential multiple copies
of the same protein in the complex. The restraint, consisting of a single distance restraint or a
n explicit atomic
representation of the linker, will be conditionally assigned to the pair of
particles

that is most consistent with
that restraint at each optimiza
tion step; depending on the representation, these particles will be atoms, res
i-
dues or whole

proteins
.


Inference
-
based scoring function.

The MS restraints
will

be encoded in

inference
-
based sco
ring function
s

(
Aim 1
)
,

in addition to tr
aditional scoring function
s
. For instance, if the data
D

consists of a set of
specific
l
y-
sine pairs connected by linkers of length
l
, the likelihood can be written a
s a multiple uncertainty function:

p
(
D
|
X
,

i
j
,
I
)

1
2


i
j
2
e
xp

(
r
i
j

l
)
2
2

i
j
2


i
j


where

r
i
j

is the set of distances between
-
amino groups of
all cross
-
linked lysines. The uncertainty


i
j




i
j

will account for
the flexibility of the linker (

) and the a
m-
biguity of the restraint assignment through the binary parameter

i
j

{
0
,

}

that switches the likelihood term

on and off
. The prior probability
of

the model
can be

obtaine
d from an energy function
p
(
X
|
I
)

e
xp(

E
)

that
can be built depending on the system represent
a-
tion. I
f we are determining the architecture of a
complex where the structure
s

of the subunits are
known,
E

can

be the sum of the excluded vo
l-
ume and surface co
m
plementarity scores b
e-
tween
the
subunits.
Likewise
, if we
are

deter
mi
n-
ing

the structure of a protein given the intra
-
chain
cross
-
link restraints,
E

can

be
the potential e
n-
ergy defin
ed by a
molecular mechanics force
field.

Other data.
Additional MS data, including data
arising from chemical footprinting
103
,
104
, H/D e
x-
change
105
, and limited proteolysis exper
i-
ments
106
,
107
, will also be explored as a source of
spatial restraints.

Benchmarking
. First, the new restraints will be
tested on a synt
hetic benchmark
63
. We will ge
n-
erate restraints based on simulated native MS
and cross
-
linking
data for known structures of 5
or more subunits, optimize with those restraints,
and determine the accuracy and precision

of the
resulting ensembles. For example, restraints from
lysine
-
lysine cross
-
linking can be generated b
e-
tween all solvent exposed lysine side
-
chains
whose distance is compatible with linker length.
This benchmark will be enriched by generating
distinct co
mposition states, obtained by removing
one or more subunits from the complexes. A
number of incorrect restraints can be artificially
introduced by randomly linking pairs of lysines in
the simulated data, or generating incorrect su
b-
complexes for native MS;
the benchmark will also
be used
to
assess the predictive power of the
procedure for detection of inconsistent data (
Aim
1
). Second, the new restraints will also be tested
with real data on assemblies of known structure.
For the native MS restraints, we wil
l specifically

Fig. 6. Protein configuration restraints from native MS.

The native MS configuration restraint first builds a tree whose
root node is the full complex and whose branches contain
nodes representing subcomplexes identified in the exper
i-
ments; the most strongly interacting subcomplexes are the
leaves of the tree. The edges represent possible transitions
from a larger subcomplex to a smaller one (
ie,

the breaking
-
off of one or more proteins by collision induced dissociation).
The restraint

recursively traverses the tree. At each node, the
set of protein types specified for a subcomplex (
eg,
ABC,
second level of the tree) is expanded to all possible sets of
particles that satisfy that subcomplex (dashed boxes). Poss
i-
ble connectivity graphs f
or each of these particle sets are
then generated (green and red boxes). The graphs for each
node are merged with all possible graphs for that node’s
children, and connectivity graphs that are not compatible with
the connections required by the child subco
mplexes are eli
m-
inated. At the root of the tree, a minimal spanning tree (MST)
for each possible connectivity graph remaining after all
merges are generated; the best scoring of these MSTs, given
the current configuration of the complex, is selected, and t
he
edges in the MST enforced as distance restraints. The ge
n-
eration of MSTs and selection of the best scoring tree is r
e-
peated at each configurational sampling step.

focus on RNAPII, the DNA Clamp Loader, and eIF3, for which extensive native MS data
are available

from our
collaborator
16
,
108
; in the case of RNAPII, a crystal structure also exists. For the cross
-
linking data, we will again
bench
mark using RNAPII, for which a large set of lysine
-
lysine cross
-
link data is available
18
.

Potential problems and alternative approaches.

The accuracy of structure determination from MS data may
be affected by the presence of false positives as well
as compositional and configurational heterogeneity of the
sample. To address these issues,
the
MS

data
will be
expressed in

the Bayesian scoring function (
Aim 1
).

4. Aim 4: Interface IMP with other packages

IMP provides a toolbox of components that can be
used to construct an integrative modeling protocol. We pr
o-
pose to augment this toolbox by developing interfaces to other packages, in particular struct
ure viewers such
as Chimera
21
, structure prediction and desi
gn programs such as Rosetta
22
, and web portals such as the Pr
o-
tein Model P
ortal

(
http://proteinmodelportal.org/
23
,
24
).

4.1. Significance

Many of the problems encountered in integrative modeling have been addressed by other software packages.
Rather than reinvent the wheel, we would like to add tools to the IMP toolbox that
expose capabilities of exte
r-
nal software within IMP as well as make the IMP toolbox available within other software packages. For exa
m-
ple, MODELLER can generate comparative structure models of individual proteins or

complexes
52
, Rosetta
22

has been carefully tuned for
a variety of modeling

and docking problems

at the atomic resolution
, Chimera vi
s-
ualizes protein and complex structures as well as several types of experimental data (
eg
,
EM

density maps)
21
,
and Protein Model Portal serves as

a
convenient
unified access point to comparative model
s

from a large
number of different sources
23
,
24
.

Stable, well
-
documented, and efficient interfaces between IMP and
these

packages will allow a user to substantially advance the scope of biological applications that can be addressed

efficiently
. In addition, the experience gained and code
written developing those interfaces will aid other deve
l-
opers to link further packages with IMP. The interfaces as well as the existence of users who pick and choose
multiple
tools will encourage dialogue, ultimately leading to improvements in all packages
.

Particularly signif
i-
cant linkages are those that provide simpler interfaces to IMP by using visually
-
oriented packages such as
Chimera and web portals such as the Protein Model Portal
; t
hese interfaces will help make IMP more a
p-
proachable for non
-
program
mer users.

4.2. Innovation

In contrast to many other software tools, IMP has been designed from the bottom up so that it can be easily
inte
rfaced with other packages. The
se

key design aspects are (
i
) the modular structure that allows new fun
c-
tionality to b
e built and distributed without modifying (or even obtaining) the IMP source code; (ii) the simplicity
of the basic concepts that make it easy to use outside methods (
eg
, new data types are introduced by writing a
scoring function (a restraint) that merely

has to return a score and derivatives for a given model); and (ii
i
) the
dual Python and C++ interfaces
that
allow IMP to easily interact with the majority of other software packages.

4.3. Approach

We have several active collaborations with the developers
of complementary packages and web portals (
eg
,
Collaboration Letters). We propose to continue these efforts, formalizing and making robust the nascent inte
r-
faces, to later serve as examples for interfacing IMP with other software.

Chimera.

We are working w
ith the group of T. Ferrin of the Computer Graphics Laboratory at UCSF to int
e-
grate IMP with the Chimera visualization software (
Fig. 7
). For example, the latest version of Chimera uses the
IMP SAXS module to generate the SAXS profile for the currently loa
ded Chimera model, and to display it fitted
against the experimental profile
11
,
61
; and the IMP MultiFit module to fit multiple protein structures into an EM
ma
p of their
assembly
7
,
10
,
12
,
46
,
109
.

We will implement and improve visualization of coarse
-
grained models, ensembles of models, as well as sp
a-
tial restraints and their violations. Ensembles of models will be visualized by p
robability densities or other re
p-
resentations that give a visual indication of the dispersion of the ensemble. We will also explore visualization of
the spatial restraints and the optimization process itself. Such visual feedback
may

give a
n

indication of
the
regions of the model that are restrained by inconsistent data (
Aim 1
) or need more data to be well resolved. To
support this work, we will develop a file format for hierarchical and/or coarse
-
grained structures as well as add
i-
tional markup and annotati
ons of the structure. Once the format is developed, we will publish an open source
library that is able to read and write it, so that it can be easily incorporated into other visualization and modeling
packages.

We will ensure stability of the interface b
e-
tween IMP and Chimera by meeting regula
r-
ly with the Chimera d
e
velopers, adding IMP
-
specific test cases to the Chimera test suite,
and processing bug reports from the Chim
e-
ra user community. We will provide the

Chimera developers with copious IMP e
x-
amples to assist in the deve
l
opment of more
coarse
-
grained visualization methods.

Rosetta.
Rosetta is a state
-
of
-
the
-
art pac
k-
age that is efficient at sampling protein co
n-
formational space and accurately scoring
indivi
dual poses at the atomic resolution
22
. It
has been successfully applied to
de novo

protein st
ructure prediction
110
, high
-
resolution modeling and refinement
90
,
111
,
flexible protein docking
112
, as well as design
of proteins
113

and protein
-
protein complexes
with altered affinity and specificity
114
,
115
.
We
will develop, te
st and disseminate a C++
interface between IMP and Rosetta that allows components from both programs to be mixed and matched; for
example, this would allow a Rosetta user to employ an IMP statistical scoring function for assessment and
prediction of protei
n

structures
54

or an IMP user
to benefit from Rosetta’s high
-
resolution backbone and
sidechain modeling.

Protein Model Portal.

We will develop a range of new IMP web services (
Aim 5
) and link them to web r
e-
sources such as Protein Model Portal
23
,
24
. For example, we will provide a service to assess protein structure
models from

Protein Model Portal with IMP by using our multivariate assessment criteria
51
,
116
-
118
.

5. Aim 5: S
upport the IMP developer and user communities

To maximize the usefulness of IMP, we will deliver a robust, well
-
tested, and well
-
documented product. We will
work closely with the community, providing high
-
level usage documentation, demonstrating IMP at wor
kshops,
and hosting select users at UCSF. Literature and input files related to our collaborative research projects with
experimental biologists will also be made available to illustrate the use of the software.

5.1. Significance

Since the release of IMP a
s open source software in
March

2010, the community has grown rapidly. The IMP
1.0 software has over 300 registered users from all around the world, and the IMP mailing lists have received
several hundred posts. Continued growth requires continued active s
upport. Well
-
informed users of the sof
t-
ware can more rapidly tackle biological systems that are amenable to existing methods implemented in IMP, or
even become IMP developers themselves and add support for novel methods. This community development
will lea
d to solution of new classes of structural modeling problems that no single lab can approach alone.

5.2. Innovation

Through a combination of documentation and user support, we will encourage a workflow where a user grad
u-
ally moves to more powerful IMP tool
s. For example, they could be introduced to IMP
via

a web interface or
command line tool, and then adapt existing examples, using IMP building blocks and
ad hoc

Python scripting,
to add a new data source. They might then use the C++ interface to implement
a scoring function for that
source. This support can eventually be formalized by adding a new module to IMP. IMP is designed and su
p-
ported to maximize its utility to other developers and users, not only those in our research group.

5.3. Approach

Progress R
eport.

The IMP code and information about it can be found at

http
://
www
.
integrativemodeling
.
org
/.
The web site provides a technical introduction, a tutorial, a variety of example
s, nightly tests, user and deve
l-
oper email lists, a wiki, and a bug tracker. The kernel and extension modules that we developed during the in
i-
tial funding period are documented using the Doxygen package; we also include development policies, style
guides,
code examples, and compilation instructions using the same system.
To
maximize

code correctness,
w
e also include a set of unit tests, which are run as part of a nightly build system on several platforms. The

Fig. 7. Chimera
-

IMP interface example
s.

IMP source code is stored in a Subversion reposi
tory at http://svn.salilab.org/imp/ for easy integration of
changes from multiple developers. The source code
is released
under the terms of the GNU Lesser General
Public License (LGPL)

or GPL.

In March 2010, we released the first stable version of IMP, 1.
0, which was su
b-
ject to much more testing to ensure robustness, and for which we also provide binaries so that users can
quic
k
ly get started with the software, without needing to check out the code, compile it, and run tests.

We have set up infrastructure
to support users of IMP from outside of our lab. This includes discussion lists,
Bugzilla bug tracking, and a wiki. To encourage other research groups to add their own IMP modules, we i
n-
clude an "example" module with documentation, and a set of scripts to
easily set up a new module.

We have already hosted a number of visitors who were interested in learning about IMP (Davide Bau, Yannick
Spill, Emidio Capriotti, Joerg Gsponer, Ben Schwarz, Michael Nilges, Torsten Schwede, Ansgar Philippsen,
and Amrita Roy
Choudhury). We also demonstrated IMP at external workshops (Jan 2010 NCMI Cryo
-
EM
workshop at Baylor College; Jun 2010 International School of Crystallography workshop in Erice, Italy; Jun
2010 Collaborative Computational Project for Biomolecular Simulatio
n workshop in London, UK). In addition,
we have contributed two tutorials on using the IMP toolkit
26
,
109
. A number of other labs are already using and
developing IMP, including W
.

Baumeister at MPI München, M
.

Nilges at Pasteur Ins
titute, G
.

Church at Ha
r-
vard, A
.

Dejaegere at IGBMC, M
.

Marti
-
Renom at CIPF, and F
.

Alber at USC. For example, Marti
-
Renom's lab
successfully used IMP in combination with chromosome conformation capture carbon copy (5C) data to gene
r-
ate 3D models of chroma
tin at the megabase resolution
28
. With assistance from the IMP community and some
time spent in the Sali lab, they are now developing their
own ‘5C’ IMP module to allow others to generate sim
i-
lar models.

Future work.

In the next funding period,
to

continue to improve
IMP’
s

robustness, we will
(i)
add to our existing
automat
ed test suite, (ii)

use code coverage tools to report on the fraction of IMP C++ and Python code that is
exercised by our existing tests, and
(iii)
add more unit tests to attempt to exercise all code paths. These tests
will be supplemented with static code analysis tools (su
ch as clang; http://clang
-
analyzer.llvm.org/) that attempt
to find coding errors by examining the C++ source code, and more rigorous debugging runs within the Valgrind
virtual machine (http://valgrind.org/). We will extend our build system to allow users t
hat want to build the sof
t-
ware on
other platforms
, such as cloud computing systems, supercomputers, or GPU
-
accelerated systems, to
also test IMP automatically, by deploying the BuildBot open source system (http://trac.buildbot.net/).

IMP’s flexibility can
be overwhelming for a new user. Thus, we will continue to develop IMP applications that
provide a simplified user interface to a subset of IMP functionality. For example, we have already developed a
command line
and web
interface
(FoXS,
http://salilab.org/foxs/
61
)
to the SAXS module, which fits a PDB stru
c-
ture against an experimental SAXS profile. We will continue to develop such applications, for
instance

to use
the new feat
ures in the MultiFit module
7
,
10
,
12
,
46
,
109
, or to process MS data (
Aim 3
), and supplement them with
interfaces to other software packages or tie
-
ins to web portals (
Aim 4
). We will also make stable releases of
the IMP software every six months.

We will bridge the gap between
simple usage examples and ‘real’ biological modeling problems by providing
more fully worked examples using IMP with multiple sources of data, allowing the models to be reproduced. To
educate users and developers, we will continue to demonstrate IMP at cou
rses as well as national and intern
a-
tional workshops
, and
host visitors in our lab. We will host workshops of our own if there is demand. This face
-
to
-
face discussion, when added to discussions on the IMP mailing lists and wiki, should help us to identify
and
correct deficiencies in IMP. Finally, we will explore preparing video tutorials for frequent uses of IMP and d
e-
posit them on YouTube

(http://youtube.com)
.

IMP’s modular and open source design makes it straightforward for external labs to add new extens
ion mo
d-
ules without being under the strict control of the core IMP developers. Such modules could be released under
any license or even kept in
-
house. To maximize the exposure of these externally developed modules, we will
link to them from the IMP website
, provide Subversion hosting, or even accept the module into the core IMP
distribution if both parties agree and adopt a suitable open source license.


In conclusion, we propose to continue the development of the IMP package, to support integrative structu
re
determination for maximizing the accuracy, precision, coverage, and efficiency of structural characterization of
macromolecular assemblies. This broad goal will be achieved by implementing inference
-
based scoring fun
c-
tions, divide
-
and
-
conquer sampling a
lgorithms, conversion of often ambiguous, incomplete, and erroneous MS
data into explicit spatial restraints, interfacing IMP with other software packages, and support of other develo
p-
ers and users of IMP.

References cited


1.

Alber F, Do
kudovskaya S, Veenhoff L, Zhang W, Kipper J, Devos D, Suprapto A, Karni
-
Schmidt O,
Williams R, Chait B, Rout M, Sali A. Determining the architectures of macromolecular assemblies. Nature
450
, 683
-
94, 2007.

2.

Rieping W, Habeck M, Nilges M. Inferential stru
cture determination. Science (New York, NY)
309
, 303
-
6,
2005.

3.

Alber F, Forster F, Korkin D, Topf M, Sali A. Integrating diverse data for structure determination of
macromolecular assemblies. Annu Rev Biochem
77
, 443
-
77, 2008.

4.

Sali A, Glaeser R, Earne
st T, Baumeister W. From words to literature in structural proteomics. Nature
422
,
216
-
25, 2003.

5.

Robinson C, Sali A, Baumeister W. The molecular sociology of the cell. Nature
450
, 973
-
82, 2007.

6.

Russel D, Lasker K, Phillips J, Schneidman
-
Duhovny D, Ve
lazquez
-
Muriel J, Sali A. The structural
dynamics of macromolecular processes. Curr Opin Cell Biol
21
, 97
-
108, 2009 PMCID: PMC2774249.

7.

Lasker K, Phillips JL, Russel D, Velazquez
-
Muriel J, Schneidman
-
Duhovny D, Webb B, Schlessinger A,
Sali A. Integrative

Structure Modeling of Macromolecular Assemblies from Proteomics Data. Mol Cell
Proteomics
9
, 1689
-
702, 2010 PMCID: PMC2938050.

8.

Russel D, Lasker K, Webb B, Velazquez
-
Muriel J, Schneidman
-
Duhovny D, Tjioe E, Sali A. Putting the
pieces together: integrati
ve structure determination of macromolecular assemblies. submitted.

9.

Forster F, Lasker K, Nickell S, Sali A, Baumeister W. Toward an integrated structural model of the 26S
proteasome. Mol Cell Proteomics
9
, 1666
-
77, 2010 PMCID: PMC2938054.

10.

Lasker K,
Sali A, Wolfson HJ. Determining macromolecular assembly structures by molecular docking and
fitting into an electron density map. Proteins:Struct Funct Bioinform
78
, 3205
-
11, 2010 PMCID:
PMC2952722.

11.

Schneidman
-
Duhovny D, Hammel M, Sali A. Macromolecula
r docking restrained by a small angle X
-
ray
scattering profile. J Struct Biol
3
, 461
-
71, 2011 PMCID: PMC3040266.

12.

Lasker K, Topf M, Sali A, Wolfson H. Inferential optimization for simultaneous fitting of multiple
components into a cryoEM map of their as
sembly. J Mol Biol
388
, 180
-
94, 2009 PMCID: PMC2680734.

13.

Heck A. Native mass spectrometry: a bridge between interactomics and structural biology. Nature
methods 2008.

14.

Benesch JLP, Robinson CV. Mass spectrometry of macromolecular assemblies: preserv
ation and
dissociation. Current opinion in structural biology
16
, 245
-
51, 2006.

15.

Robinson C. When proteomics meets structural biology. Trends in biochemical sciences 2010.

16.

Taverner T, Hernández H, Sharon M, Ruotolo BT, Matak
-
Vinković D, Devos D, Russell RB, Robinson CV.
Subunit architecture of intact protein complexes from mass spectrometry and homology modeling. Acc
Chem Res
41
, 617
-
27, 2008.

17.

Fabris D. MS analysis of nu
cleic acids in the post
-
genomic era. Analytical chemistry 2011.

18.

Rappsilber J. The beginning of a beautiful friendship: cross
-
linking/mass spectrometry and modelling of
proteins and multi
-
protein complexes. Journal of structural biology
173
, 530
-
40, 20
11.

19.

Chu F, Baker P, Burlingame A. Finding chimeras: a bioinformatics strategy for identification of cross
-
linked
peptides. Molecular & Cellular … 2010.

20.

Leitner A, Walzthoeni T, Kahraman A, Herzog F, Rinner O, Beck M, Aebersold R. Probing nativ
e protein
structures by chemical cross
-
linking, mass spectrometry, and bioinformatics. Molecular & cellular
proteomics : MCP
9
, 1634
-
49, 2010.

21.

Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE. UCSF Chimera
--
a visualization system for exploratory research and analysis. J Comput Chem
25
, 1605
-
12, 2004.

22.

Das R, Baker D. Macromolecular modeling with rosetta. Annu Rev

Biochem
77
, 363
-
82, 2008.

23.

Arnold K, Kiefer F, Kopp J, Battey JN, Podvinec M, Westbrook JD, Berman HM, Bordoli L, Schwede T.
The Protein Model Portal. J Struct Funct Genomics
10
, 1
-
8, 2009 PMCID: 2704613.

24.

Schwede T, Sali A, Honig B, Levitt M, Berma
n H, Jones D, Brenner S, Burley S, Das R, Dokholyan N,
Dunbrack RJ, Fidelis K, Fiser A, Godzik A, Huang Y, Humblet C, Jacobson M, Joachimiak A, Krystek SJ,
Kortemme T, Kryshtafovych A, Montelione G, Moult J, Murray D, Sanchez R, Sosnick T, Standley D,
Stou
ch T, Vajda S, Vasquez M, Westbrook J, Wilson I. Outcome of a workshop on applications of protein
models in biomedical research. Structure
17
, 151
-
9, 2009 PMCID: PMC2739730.

25.

Alber F, Chait BT, Rout MP, Sali A. Integrative Structure Determination of Pro
tein Assemblies by
Satisfaction of Spatial Restraints. In: Panchenko A, Przytycka T, editors. Protein
-
protein interactions and
networks: identification, characterization and prediction. London, UK: Springer
-
Verlag; 2008. p. 99
-
114.

26.

Webb B, Lasker K, Sc
hneidman
-
Duhovny D, Tjioe E, Phillips J, Kim SJ, Velazquez
-
Muriel J, Russel D,
Sali A. Modeling of Proteins and their Assemblies with the Integrative Modeling Platform. Methods in
Molecular Biology, in press: Humana Press.

27.

Simon B, Madl T, Mackereth C
D, Nilges M, Sattler M. An efficient protocol for NMR
-
spectroscopy
-
based
structure determination of protein complexes in solution. Angew Chem Int Ed Engl
49
, 1967
-
70, 2010.

28.

Bau D, Sanyal A, Lajoie BR, Capriotti E, Byron M, Lawrence JB, Dekker J, Marti
-
Renom MA. The three
-
dimensional folding of the alpha
-
globin gene domain reveals formation of chromatin globules. Nat Struct
Mol Biol
18
, 107
-
14, 2011 PMCID: 3056208.

29.

Fotin A, Cheng Y, Grigorieff N, Walz T, Harrison SC, Kirchhausen T. Structure of an au
xilin
-
bound clathrin
coat and its implications for the mechanism of uncoating. Nature
432
, 649
-
53, 2004.

30.

Bayrhuber M, Meins T, Habeck M, Becker S, Giller K, Villinger S, Vonrhein C, Griesinger C, Zweckstetter
M, Zeth K. Structure of the human voltage
-
d
ependent anion channel. Proceedings of the National
Academy of Sciences of the United States of America
105
, 15370
-
5, 2008 PMCID: 2557026.

31.

Zhou M, Robinson CV. When proteomics meets structural biology. Trends in biochemical sciences
35
,
522
-
9, 2010.

32
.

Alber F, Dokudovskaya S, Veenhoff L, Zhang W, Kipper J, Devos D, Suprapto A, Karni
-
Schmidt O,
Williams R, Chait B, Sali A, Rout M. The molecular architecture of the nuclear pore complex. Nature
450
,
695
-
701, 2007.

33.

Devos D, Dokudovskaya S, Williams R,

Alber F, Eswar N, Chait BT, Rout MP, Sali A. Simple fold
composition and modular architecture of the nuclear pore complex. Proc Natl Acad Sci U S A
103
, 2172
-
7,
2006.

34.

DeGrasse JA, DuBois KN, Devos D, Siegel TN, Sali A, Field MC, Rout MP, Chait BT. The

Establishment
of Nuclear Pore Complex Architecture Occurred Early in Evolution. Mol Cell Proteomics
8
, 2119
-
30, 2009
PMCID: PMC2742445.

35.

Wente SR, Rout MP. The nuclear pore complex and nuclear transport. Cold Spring Harb Perspect Biol
2
,
a000562, 2010.

36.

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The
Protein Data Bank. Nucleic Acids Res
28
, 235
-
42, 2000 PMCID: 102472.

37.

Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J, Livny M, Mading

S, Maziuk D, Miller Z,
Nakatani E, Schulte CF, Tolmie DE, Kent Wenger R, Yao H, Markley JL. BioMagResBank. Nucleic Acids
Res
36
, D402
-
8, 2008 PMCID: 2238925.

38.

Ludtke SJ, Lawson CL, Kleywegt GJ, Berman HM, Chiu W. Workshop on the validation and modeling

of
electron cryo
-
microscopy structure of biological nanomachines
-

Workshop Introduction. Pac Symp
Biocomput 369
-
73, 2011.

39.

Mesirov JP. Computer science. Accessible reproducible research. Science
327
, 415
-
6, 2010.

40.

Barnes N. Publish your computer co
de: it is good enough. Nature
467
, 753, 2010.

41.

Merali Z. Computational science: ...Error. Nature
467
, 775
-
7, 2010.

42.

Topf M, Baker M, John B, Chiu W, Sali A. Structural characterization of components of protein assemblies
by comparative modeling and e
lectron cryo
-
microscopy. J Struct Biol
149
, 191
-
203, 2005.

43.

Topf M, Baker ML, Marti
-
Renom MA, Chiu W, Sali A. Refinement of protein structures by iterative
comparative modeling and CryoEM density fitting. J Mol Biol
357
, 1655
-
68, 2006.

44.

Topf M, Lasker K, Webb B, Wolfson H, Chiu W, Sali A. Protein structure fitting and refinement guided by
cryo
-
EM density. Structure
16
, 295
-
307, 2008 PMCID: PMC2409374.

45.

Topf M, Sali A. Combining electron microscopy and comparative protein structure mode
ling. Curr Opin
Struct Biol
15
, 578
-
85, 2005.

46.

Tjioe E, Lasker K, Webb B, Wolfson H, Sali A. Multifit: A web server for fitting multiple protein structures
into their electron microscopy density map. Nucleic Acids Res
in press
.

47.

Schneidman
-
Duhovny D,

Rossi A, Avila
-
Sakar A, Kim SJ, Velazquez
-
Muriel J, Strop P, Rajpal A,
Krukenberg K, Liao M, Kim H, Sobhanifar S, Dotsch V, Agard D, Cheng Y, Sali A. Integrative Structure
Determination of Binary Protein Complexes. submitted.

48.

Velazquez
-
Muriel J, Laske
r K, Phillip J, Russel D, Schneidman D, Webb B, Sali A. Determination of
macromolecular assemblies by satisfaction of restraints from electron microscopy images. submitted.

49.

Fernandez
-
Martinez J, Phillips J, Sekedat M, Diaz
-
Avalos R, Velazquez
-
Muriel J,

Franke J, Williams R,
Stokes D, Chait B, Sali A, Rout M. Structure
-
function Map for a Heptameric Component of the Nuclear
Pore Complex. submitted.

50.

Brooks BR, Brooks CL, 3rd, Mackerell AD, Jr., Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartel
s
C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W,
Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B,
Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus
M. CHARMM: the biomolecular simulation
program. J Comput Chem
30
, 1545
-
614, 2009 PMCID: 2810661.

51.

Pieper U, Webb BM, Barkan DT, Schneidman
-
Duhovny D, Schlessinger A, Braberg H, Yang Z, Meng EC,
Pettersen EF, Huang CC, Datta RS, Sampathkumar P, Madhusudh
an MS, Sjolander K, Ferrin TE, Burley
SK, Sali A. ModBase, a database of annotated comparative protein structure models, and associated
resources. Nucleic Acids Res
39
, 465
-
74, 2011 PMCID: PMC3013688.

52.

Sali A, Blundell TL. Comparative protein modelling
by satisfaction of spatial restraints. J Mol Biol
234
, 779
-
815, 1993.

53.

Eswar N, John B, Mirkovic N, Fiser A, Ilyin VA, Pieper U, Stuart AC, Marti
-
Renom MA, Madhusudhan MS,
Yerkovich B, Sali A. Tools for comparative protein structure modeling and analysi
s. Nucleic Acids Res
31
,
3375
-
80, 2003.

54.

Shen MY, Sali A. Statistical potential for assessment and prediction of protein structures. Protein Sci
15
,
2507
-
24, 2006.

55.

Melo F, Sanchez R, Sali A. Statistical potentials for fold assessment. Protein Sci
11
, 430
-
48, 2002.

56.

Fan H, Schneidman D, Irwin JJ, Dong G, Shoichet B, Sali A. Statistical Potential for Modeling and Ranking
Protein
-
Ligand Interactions. submitted.

57.

Korkin D, Davis F, Sali A. Localization of protein
-
binding sites within families of pr
oteins. Protein Sci
14
,
2350
-
60, 2005.

58.

Korkin D, Davis FP, Alber F, Luong T, Shen MY, Lucic V, Kennedy MB, Sali A. Structural modeling of
protein interactions by analogy: application to PSD
-
95. PLoS Computational Biology
2
, e153, 2006.

59.

Davis FP, Ba
rkan DT, Eswar N, McKerrow JH, Sali A. Host pathogen protein interactions predicted by
comparative modeling. Protein Sci
16
, 2585
-
96, 2007.

60.

Davis FP, Braberg H, Shen MY, Pieper U, Sali A, Madhusudhan MS. Protein complex compositions
predicted by struct
ural similarity. Nucleic Acids Res
34
, 2943
-
52, 2006.

61.

Schneidman
-
Duhovny D, Hammel M, Sali A. FoXS: A Web Server for Rapid Computation and Fitting of
SAXS Profiles. Nucleic Acids Res
38
, 541
-
4, 2010 PMCID: PMC2896111.

62.

Forster F, Webb B, Krukenberg
KA, Tsuruta H, Agard DA, Sali A. Integration of small
-
angle X
-
ray
scattering data into structural modeling of proteins and their assemblies. J Mol Biol
382
, 1089
-
106, 2008
PMCID: PMC2745287.

63.

Alber F, Kim M, Sali A. Structural characterization of assemb
lies from overall shape and subcomplex
compositions. Structure
13
, 435
-
45, 2005.

64.

Brooks CL, Karplus M, Pettitt BM. Proteins : a theoretical perspective of dynamics, structure, and
thermodynamics. New York: J. Wiley; 1988.

65.

Metropolis N, Rosenbluth A
W, Rosenbluth MN, Teller AH, Teller E. Equation of state calculation by fast
computing machines. Journal of Chemical Physics
21
, 1087

92, 1953.

66.

Hestenes MR, Stiefel E. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research
of th
e National Bureau of Standards
49
, 1952.

67.

Polak E. Computational methods in optimization; a unified approach. New York,: Academic Press; 1971.

68.

Dantzig GB, Thapa MN. Linear programming 1: Introduction.: Springer
-
Verlag; 1997.

69.

Nelder J, Mead R. A
simplex method for function minimization. The computer journal 1965.

70.

Jordan MI. Graphical models. Stat Sci
19
, 140
-
55, 2004.

71.

Taylor DJ, Devkota B, Huang AD, Topf M, Eswar N, Sali A, Harvey SC, Frank J. Comprehensive
Molecular Structure of the Euka
ryotic Ribosome. Structure
17
, 1591
-
604, 2009 PMCID: PMC2814252.

72.

Chandramouli P, Topf M, Menetret J, Eswar N, Cannone J, Gutell R, Sali A, Akey C. Structure of the
mammalian 80S ribosome at 8.7 A resolution. Structure
16
, 535
-
48, 2008 PMCID:
PMC2775484.

73.

Serysheva I, Ludtke S, Baker M, Cong Y, Topf M, Eramian D, Sali A, Hamilton S, Chiu W. Subnanometer
-
resolution electron cryomicroscopy
-
based domain models for the cytoplasmic region of skeletal muscle
RyR channel. Proc Natl Acad Sci U S A
1
05
, 9610
-
5, 2008 PMCID: PMC2474495.

74.

Forster F, Lasker K, Beck F, Nickell S, Sali A, Baumeister W. An Atomic Model AAA
-
ATPase/20S core
particle sub
-
complex of the 26S proteasome. Biochem Biophys Res Commun
388
, 228
-
33, 2009 PMCID:
PMC2771176.

75.

Nickel
l S, Beck F, Scheres SHW, Korinek A, Forster F, Lasker K, Mihalache O, Sun N, Nagy I, Sali A,
Plitzko J, Carazo J
-
M, Mann M, Baumeister W. Insights into the Molecular Architecture of the 26S
Proteasome. Proc Natl Acad Sci U S A
29
, 11943
-
7, 2009 PMCID: PMC
2715492.

76.

Krukenberg K, Forster F, Rice L, Sali A, Agard D. Multiple conformations of E. coli Hsp90 in solution:
insights into the conformational dynamics of Hsp90. Structure
16
, 755
-
65, 2008 PMCID: PMC2600884.

77.

Booth C, Meyer A, Cong Y, Topf M, Sali

A, Ludtke S, Chiu W, Frydman J. Mechanism of lid closure in the
eukaryotic chaperonin TRiC/CCT. Nat Struct Mol Biol
15
, 746
-
53, 2008 PMCID: PMC2546500.

78.

Cong Y, Topf M, Sali A, Matsudaira P, Dougherty M, Chiu W, Schmid M. Crystallographic conformers of

actin in a biologically active bundle of filaments. J Mol Biol
375
, 331
-
6, 2008.

79.

Fields S. High
-
throughput two
-
hybrid analysis. The promise and the peril. The FEBS journal
272
, 5391
-
9,
2005.

80.

Habeck M, Rieping W, Nilges M. Weighting of experimental

evidence in macromolecular structure
determination. Proceedings of the National Academy of Sciences of the United States of America
103
,
1756
-
61, 2006.

81.

Levitt M. Refinement of large structures by simultaneous minimization of energy and R factor. Acta
Crystallographica Section A: Crystal Physics 1978.

82.

Brünger AT. Free R value: a novel statistical quantity for assessing the accuracy of crystal structures.
Nature
355
, 472
-
5, 1992.

83.

Brünger AT, Clore GM, Gronenborn AM, Saffrich R, Nilges M. Assessi
ng the quality of solution nuclear
magnetic resonance structures by complete cross
-
validation. Science (New York, NY)
261
, 328
-
31, 1993.

84.

Jaynes ET, Bretthorst GL. Probability theory: Cambridge Univ Pr; 2003.

85.

Sugita Y. Replica
-
exchange molecular dyn
amics method for protein folding. Chemical Physics Letters
314
,
141
-
51, 1999.

86.

Mitsutake A, Sugita Y, Okamoto Y. Generalized
-
ensemble algorithms for molecular simulations of
biopolymers. Biopolymers
60
, 96
-
123, 2001.

87.

Go N. Theoretical studies of pro
tein folding. Annu Rev Biophys Bioeng
12
, 183
-
210, 1983.

88.

Lauritzen SL. Graphical models. Oxford

New York: Clarendon Press ;

Oxford University Press; 1996.

89.

Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics). 1 e
d:
Springer; 2007.

90.

Mandell DJ, Coutsias EA, Kortemme T. Sub
-
angstrom accuracy in protein loop reconstruction by robotics
-
inspired conformational sampling. Nature methods
6
, 551
-
2, 2009 PMCID: 2847683.

91.

London N, Movshovitz
-
Attias D, Schueler
-
Furman
O. The structural basis of peptide
-
protein binding
strategies. Structure
18
, 188
-
99, 2010.

92.

Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for
interaction datasets. Nucleic Acids Res
34
, D535
-
9, 2006
PMCID: 1347471.

93.

Cramer P, Bushnell DA, Fu J, Gnatt AL, Maier
-
Davis B, Thompson NE, Burgess RR, Edwards AM, David
PR, Kornberg RD. Architecture of RNA polymerase II and implications for the transcription mechanism.
Science
288
, 640
-
9, 2000.

94.

Groll M,

Ditzel L, Lowe J, Stock D, Bochtler M, Bartunik HD, Huber R. Structure of 20S proteasome from
yeast at 2.4 A resolution. Nature
386
, 463
-
71, 1997.

95.

Lowe J, Stock D, Jap B, Zwickl P, Baumeister W, Huber R. Crystal structure of the 20S proteasome from
th
e archaeon T. acidophilum at 3.4 A resolution. Science
268
, 533
-
9, 1995.

96.

Bohn S, Beck F, Sakata E, Walzthoeni T, Beck M, Aebersold R, Forster F, Baumeister W, Nickell S.
Structure of the 26S proteasome from Schizosaccharomyces pombe at subnanometer resolution. Proc
Natl Acad Sci U S A
107
, 20992
-
7, 2010 PMCID: 3000292.

97.

Chen

L, Pawlikowski B, Schlessinger A, More S, Stryke D, Johns SJ, Portman M, Ferrin TE, Sali A,
Giacomini K. Role of organic cation transporter 3 (SLC22A3) and Its missense variants in the
pharmacologic action of metformin. Pharmacogenet Genomics
20
, 687
-
99,
2010 PMCID: PMC2976715.

98.

Schreiber A, Stengel F, Zhang Z, Enchev RI, Kong EH, Morris EP, Robinson CV, da Fonseca PC, Barford
D. Structural basis for the subunit assembly of the anaphase
-
promoting complex. Nature
470
, 227
-
32,
2011.

99.

Eiceman G, Karpas
Z, Herbert H Hill J. Ion Mobility Spectrometry: CRC Pr I Llc; 2011.

100.

Verbeck G, Ruotolo B, Sawyer H. A fundamental introduction to ion mobility mass spectrometry applied
to the analysis of biomolecules. Journal of … 2002.

101.

Rostom AA, Robinson CV.
Detection of the Intact GroEL Chaperonin Assembly by Mass Spectrometry.
J Am Chem Soc
121
, 4718
-
9, 1999.

102.

Mesleh M, Hunter J, Shvartsburg A. Structural Information from Ion Mobility Measurements: Effects of
the Long
-
Range Potential. The Journal of … 1
997.

103.

Brenowitz M. Probing the structural dynamics of nucleic acids by quantitative time
-
resolved and
equilibrium hydroxyl radical [] footprinting'. Current opinion in structural biology 2002.

104.

Xu G, Chance MR. Hydroxyl radical
-
mediated modif
ication of proteins as probes for structural
proteomics. Chemical reviews
107
, 3514
-
43, 2007.

105.

Wales T. Hydrogen exchange mass spectrometry for the analysis of protein dynamics. Mass
spectrometry reviews 2006.

106.

Dokudovskaya S, Williams R, Devos D,

Sali A. Protease accessibility laddering: a proteomic tool for
probing protein structure. Structure (London, England : 1993) 2006.

107.

Fontana A, de Laureto P, Spolaore B. Probing protein structure by limited proteolysis. ACTA
BIOCHIMICA … 2004.

108.

Z
hou M, Sandercock AM, Fraser CS, Ridlova G, Stephens E, Schenauer MR, Yokoi
-
Fong T, Barsky D,
Leary JA, Hershey JW, Doudna JA, Robinson CV. Mass spectrometry reveals modularity and a complete
subunit interaction map of the eukaryotic translation factor eIF
3. Proceedings of the National Academy of
Sciences of the United States of America
105
, 18139
-
44, 2008.

109.

Lasker K, Velazquez
-
Muriel J, Webb B, Yang Z, Ferrin TE, Sali A. Macromolecular assembly structures
by comparative modeling and electron microscopy
. Methods in Molecular Biology, in press.

110.

Bradley P, Misura KM, Baker D. Toward high
-
resolution de novo structure prediction for small proteins.
Science
309
, 1868
-
71, 2005.

111.

Qian B, Raman S, Das R, Bradley P, McCoy AJ, Read RJ, Baker D. High
-
reso
lution structure prediction
and the crystallographic phase problem. Nature
450
, 259
-
64, 2007 PMCID: 2504711.

112.

Gray JJ, Moughon S, Wang C, Schueler
-
Furman O, Kuhlman B, Rohl CA, Baker D. Protein
-
protein
docking with simultaneous optimization of rigid
-
bo
dy displacement and side
-
chain conformations. J Mol
Biol
331
, 281
-
99, 2003.

113.

Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D. Design of a novel globular protein
fold with atomic
-
level accuracy. Science
302
, 1364
-
8, 2003.

114.

Chevalier B
S, Kortemme T, Chadsey MS, Baker D, Monnat RJ, Stoddard BL. Design, activity, and
structure of a highly specific artificial endonuclease. Mol Cell
10
, 895
-
905, 2002.

115.

Kortemme T, Joachimiak LA, Bullock AN, Schuler AD, Stoddard BL, Baker D. Computationa
l redesign
of protein
-
protein interaction specificity. Nat Struct Mol Biol
11
, 371
-
9, 2004.

116.

Eramian D, Eswar N, Shen M, Sali A. How well can the accuracy of comparative protein structure
models be predicted? Protein Sci
17
, 1881
-
93, 2008 PMCID: PMC257
8807.

117.

Eramian D, Shen M, Devos D, Melo F, Sali A, Marti
-
Renom M. A composite score for predicting errors
in protein structure models. Protein Sci
15
, 1653
-
66, 2006.

118.

Melo F, Sali A. Fold assessment for comparative protein structure modeling. Prote
in Sci
16
, 2412
-
26,
2007.