QCDNA
2008

Sept
5
,
2008
Rich Brower (Boston U.)
1
Disconnected Diagrams, Multi

grid, Nvidia
& all that
y
Richard Brower (Boston University)
James Brannick (Penn)
Ron Babich (BU)
Kipton Barros (BU)
Mike Clark (BU)
George Fleming (Yale)
James Osborn (Argonne)
Claudio Rebbi (BU)
WARNING: Much here is a FUTURE plan NOT proven results but .....
Physics:
(How strange is the proton?)
Algorithms:
(Multi

grid to the rescue?)
Hardware:
(GPU propagator farm?)
Outline
5
Physics:
Disconnected Diagrams
Connected vs. Disconnected
Want matrix element:
X
u,d
N
N
t = 0
t = t
f
t = t'
X
X
u,d
u,d,s
N
N
How strange
y
is the proton?
Who cares?
Violation of Standard Model:
Dark Energy (Neutralino scattering):
NuTev anomaly:
Nucleon Physics (include u/d + s quares):
iso

scalar Form Factors, nucleon structure function,
Spin crisis for proton, matrix element etc.
y
see Lattice
2008
:
http://conferences.jlab.org/lattice
2008
/parallel

bytopic

struct.html
S.Collins, G. Bali, A.Schafer
“
Hunting for the strangeness ... nucleon”
Takumi Doi et al “Strangeness and glue in the nucleon from lattice QCD
Ron Babich et al “
Strange quark content of the nucleon
”
8
Direct detection of dark matter
In SUSY, the neutralino scatters
from a nucleon via Higgs exchange:
The strange scalar matrix element is
a major uncertainty:
Uncertainty in
f
Ts
gives up to a factor
of
4
uncertainty in the cross

section!
Bottino et al., hep

ph/
0111229
;
Ellis et al., hep

ph/
0502001
9
Nuclear Experiment
Pate et al., arXiv:
0805.2889
[hep

ex]
J. Liu et al., arXiv:0706.0226 [nucl

ex]
(see also Young et al., nucl

ex/
0605010
)
Parity

violating electron scattering
(SAMPLE, HAPPEx, PVA
4
, G
0
)
PVES + BNL E
734
(ν
p
scattering)
Monte Carlo update
(Long auto correlations times)
GlobalHeatbathaka“StochasticEstimator:”
(Zero auto correlations)
Find
Á
= D

1
´
for
´
Gaussian
or
Gaug
e
or Z
2
(
Zero auto correlations!)
With
<
´
y
´
x
> =
±
yx
Algorithm
A
xy
Variance reduction
:
Dilution
vs hopping parameter
y
(Short distance)
Multi

grid
vs“deflation”/truncation
y
(Long distance)
Curing volume divergence
Trace versus Gauge fluctuations
Better and more source (all to all?).
Full multi

grid O(N long N) Trace?
Improving Stochastic Estimate
y
S.Collins, G. Bali, A.Schafer
“
Hunting for the strangeness ... nucleon”
x
y
12
Two sources of error: gauge noise and error in trace. In this
calculation, we largely eliminate the second source by calculating a
“nearlyexact”traceonfourtime

slices.
864
sources (x
12
for color/spin). A given source is nonzero on
4
sites on each of
4
time

slices.
Minimal spatial separation between sites is . Small
residual contamination is gauge

variant and averages to zero.
Equivalent to using a single stochastic source with
“extremedilution.”
Trace estimation
4
x
6
3
=
864
13
Preliminary Methods
ConfigurationswereprovidedbytheLHPC“SpectrumCollaboration”
anisotropic lattice with
2
dynamical flavors, Wilson fermion and gauge actions
863
configurations
64
(x
12
) inversions per configuration at the light quark mass, for the
nucleon correlators
864
(x
12
) inversions per configuration at the strange mass, for the
trace
14
Strange scalar form factor
15
Conventionally, one extracts the (e.g. zero

momentum) form factor from the
large
t
behavior of the ratio
(or from a similar expression integrated over time).
Instead, we fit the numerator directly, since this allows us
to avoid contamination from backward

propagating states, which are
problematic due to the short temporal
extent of our lattice ( ).
to explicitly take into account the contribution of (forward

propagating)
excited states.
In the following, we always treat the system
symmetrically with
Ratio approach
17
Direct fit
First, we perform a fit to the nucleon two

point function, of the
form
The coefficients and masses are very well

determined, since we
are required to calculate correlators from all initial times (a total of
863
x
64
=
55
,
232
).
Next, we perform a fit to the three

point function,
Here
j
1
and
j
2
are the form factors for the proton and its first excited
state, and
j
12
is a transition matrix element between them. In
practice, we expect
j
2
and
j
12
to absorb the contribution of still higher
states, and trust only
j
1
to be reliable.
18
Strange scalar form factor
For the renormalization

invariant quantity
f
Ts
, we estimate
where we have inserted the physical nucleon mass. The second error is the
uncertainty in relating this mass to the lattice scale, the first error is
statistical, and no other systematics are included.
Note that the matrix element in the numerator
was calculated for a world
with a
400
MeV pion. If we work consistently
in such a world by inserting our calculated
nucleon mass, the scale dependence drops
out, and we find
19
Momentum dependence of G
S
(q
2
)
PRELIMINARY
s
20
Strange axial form factor
PRELIMINARY
Results have not been renormalized.
Calculated value is distinct from zero at the
3

s
level.
Error = O(L
3
/
2
)
)
as L
3
)
1
For
Exact Trace
in a Connect correlator,
t = 0
t = t
f
t = t'
X
Most Important New Trick:
Multi

grid Variance Reduction
The signal and variance of the first term is down by
1
to
2
orders of magnitude because D
c
»
D
The Coarse level Trace for D

1
c
is as cheap to calculate as
the level down operator inverse.
This can of course be done recursively giving (I think) an
O(N log N)trace calculation to fixed tolerance.
HARDWARE
G
raphics hardware is well suited to highly parallel
numerical tasks.
Hardware vendors provide development tools to support
high performance computing.
NVIDIA'S CUDA offers direct access to graphics hardware
through a programming language similar to C.
Dirac

Wilson operator which runs at an effective
68
Gigaflops on the Tesla C
870
GPU.
The recently released GTX
280
GPU at
92
Gigaflops and
we expect improvement pending code optimization.
(Now
98
Gigaflops hope to get O(
150
) Gigaflops)
Nvidia GPU architecture
Two Generations Consumer vs HPC GPUs
Consumer cards
)
High Performance (HPC) GPUs
I.
8880
GTX
)
Tesla C
870
(
16
multi

processor with
8
cores each)
II. GTX
280
)
Tesla C
1060
(
30
multi

processor with
8
cores each)
C
870
code using
60
% of the memory bandwidth.
http://www.scala

lang.org/
Future software Plans
Need find out why we are
only saturating
60
%
of Memory bandwidth
Further educe memory traffic:
8
real number per SU(
3
)
matrix (
2
/
3
of
12
used now)
shear spinors in
4
3
blocks (
5
/
9
of used now)
Generalize to
clover Wilson
&
Domain Wall
operator (slightly better
flops/mem ratio).
DMA between GPU on Quad
system and network for cluster
Start to design
SciDAC API for many

core
technologies.
Tesla
10

Series: What’s the Big Deal?
Consumer Chip GTX
280
)
Tesla C
1060
1
U
Quad S
1070
System $
8
K
CUDA
2.0
(Compute Unified Device
Architecture)
Can compile CUDA code into highly efficient SSE

based multi

threaded C code
Need a GPU Dirac Propagator Farm
The Clark

Kennedy RHMC Paradox:
(Faster you go harder it is to keep up)
Analysis is now the
“
Ἀ
χιλλεύς
heel”
Solution: Dedicated Analysis farm.
GPU can deliver O(
10
) to O(
100
) gain in flops/$
Two quad Tesla
)
1
Sustained Teraflop!
Two quad Tesla @
25
K
?
´
One BG/L rack @
2
,
000
K
Commercial Break:
BOSTON POST DOC IN SEPT
2009
PetaAPPS/SciDAC fellow
(QCDNA in Boston Fall
2009
?)
