End-to-end performance: issues and suggestions

boundlessbazaarΔιακομιστές

9 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

185 εμφανίσεις

End
-
to
-
end performance:

issues and suggestions

TERENA 5th NRENs and Grids Workshop

Paris, June 2007

Mark Leese

TERENA 5th NRENs & Grids
Workshop, June 2007

2

Talk Emphasis


monALISA

= a monitoring tool/framework


DANTE

= a network operator


EGEE
-
II

= a Grid


Mark


= a pseudo
-
Grid end user



I’m not a real user, but I look at the issues from their viewpoint:


Large Hadron Collider in the UK (GridPP)


UK e
-
Science


OGF



Aimed at a mixed audience (NRENs and Grid users) so some
network/Grid things you will already….Zzzzzzzzzzzz :)

TERENA 5th NRENs & Grids
Workshop, June 2007

3

Contents

Just two things:


1.
What makes the Grid different to other
network users, wrt performance?


2.
What are the end
-
to
-
end performance
(monitoring) issues? Any suggestions?


If the links in the presentation don’t work,

they are listed again on the last three slides

1. What makes the Grid different

to other network users, wrt performance?

TERENA 5th NRENs & Grids
Workshop, June 2007

5

The Grid

The Grid is all about:


Sharing resources
:


the obvious, e.g. databases


the specialised, e.g. remotely control telescopes


and new ideas, e.g. CPU time


co
-
allocate resources to a task to remove the
limitations of the individual resources


most basic analogy: you can move house faster if you
have two vans


Sharing resources which are
geographically
distributed


Sharing resources
efficiently


optimisation: selecting the “best” resources for the job

TERENA 5th NRENs & Grids
Workshop, June 2007

6

The Grid

Middle
ware:

sits between the OS of
the resources (below) and the
applications that run on the Grid

S
torage
E
lement

Chemical
DB

C
ompute
E
lements

Image courtesy
of NRAO/AUI

Grid App: Process TBs
of Particle Physics data
from CERN detectors

Grid App: Analyse the
human genome

Grid App: Obtain radio
astronomy images from
remote telescopes


Network(s)

TERENA 5th NRENs & Grids
Workshop, June 2007

7

The Grid

Middle
ware:

sits between the OS of
the resources (below) and the
applications that run on the Grid

S
torage
E
lement

Chemical
DB

C
ompute
E
lements

Image courtesy
of NRAO/AUI

Grid App: Process TBs
of Particle Physics data
from CERN detectors

Grid App: Analyse the
human genome

Grid App: Obtain radio
astronomy images from
remote telescopes


Get apps running on the “right” resources (wherever they are)


Make disparate compute resources into a coherent whole


Network(s)

TERENA 5th NRENs & Grids
Workshop, June 2007

8

Optimisation

It’s a little like the checkout counters in a supermarket:


There is a line of 10 checkouts to which you can take your big
shopping basket


Two checkouts you cannot use. They are for people with “five items
or less”


caisse express


Another two checkouts cannot be used. They are reserved for
something else (the staff’s lunch break)


Six left: how big is each queue and how long will it take each person
to exit the queue (how many items in each basket)?


If you choose wrong, you get delayed!

You miss the train, you get home late,

your partner has given your dinner to the dog



To take the analogy to extremes: hopefully your basket does not
have a broken wheel :)

TERENA 5th NRENs & Grids
Workshop, June 2007

9

Scheduling


Grid job = the basic unit of work


SEs provide storage resources and access to mass storage systems


CEs provide processing power, e.g. cluster of Worker Nodes (PC farm)



Scheduling = deciding when a job will run, and with which resources



Typically there will be many CEs capable of running a job


If a CE already has lots of jobs queued, you would like to use another



File replication = proven technique for improving data access


Distribute multiple copies of the same file across a Grid


Increases number of CEs with good network connectivity to the data


Extreme example: Pisa

Roma or Pisa

Fermilab?



So, typically there may also be several SEs holding the required data

TERENA 5th NRENs & Grids
Workshop, June 2007

10

Network Aware
Scheduling (i)


So we have a set of CEs {a,b,c,…} and SEs {x,y,z,…}
capable of running a job


We want a node from each list such that the job will
complete the fastest


Take account of:


capability of CEs


size and number of jobs already waiting (queued) at CEs


performance of network link for each CE
-
SE combination



Further complicated by the compute/data intensity of
the job:


computationally intensive job: lots of maths


data intensive job: lots and lots and lots of data


do we pull the data to the job or push the job to the data?

TERENA 5th NRENs & Grids
Workshop, June 2007

11

Network Aware
Scheduling (ii)


In Utopia we would know about the current
state of the network, and any future reserved
bandwidth


In reality we could use monitored network
performance to make an estimate


It’s not perfect, but patterns (diurnal variation,
chronic poor performance…) can be identified


The following slides show iperf tests between
dedicated test nodes at LHC sites in the UK
(GridPP’s
gridmon

infrastructure)

TERENA 5th NRENs & Grids
Workshop, June 2007

12

Network Aware
Scheduling (iii.a)


Transfer at

00:00, yes. Transfer at 12:00, no. There’s a big
difference between 500 and 200 Mbps for data intensive jobs!

TERENA 5th NRENs & Grids
Workshop, June 2007

13

Network Aware
Scheduling (iii.b)


RAL Tier
-
2

Tier
-
1: l
ocal transfers are likely the best performers

TERENA 5th NRENs & Grids
Workshop, June 2007

14

Network Aware
Scheduling (iii.c)


Here, you have absolutely no idea what performance you would get


a
void


Summary: ignore the network at your peril :)

TERENA 5th NRENs & Grids
Workshop, June 2007

15

Network Aware
Scheduling (iv)


Two good papers to read:

1.
B. Volckaert, P. Thysebaert, M. De Leenheer, F. De Turck,
B. Dhoedt, P. Demeester


Network Aware Scheduling in Grids

2.
Richard McClatchey, Ashiq Anjum, Heinz Stockinger,
Arshad Ali, Ian Willers, Michael Thomas


Data Intensive and Network Aware (DIANA) Grid
Scheduling



We don’t consider potential uses in more detail (job
placement, replica selection) because we don’t
know if it will happen!

TERENA 5th NRENs & Grids
Workshop, June 2007

16

Network Aware
Scheduling (v)


There are some

ve feelings:


“The network is not a problem. Over
-
provisioning will always keep us ahead. Either that
or fibre and GigE everywhere”


Report

of the International Grid Performance Workshop 2005 concluded that
"Performance simply is not on the critical path for many application projects.
Applications that struggle to get code to execute correctly simply do not consider
whether they are using resources efficiently or achieving good performance“


Personal experience suggests that there is so much to think about elsewhere, that the
network is often the last thing to be considered


Right now, Grid apps rely on the network being good, with no real checks



And by way of real life indications…



EDG WP7

developed “network cost function”:


Returned cost of variable size file transfers between source and dest Grid elements


Based on periodic (WP7) iperf measurements


Used by WP2 Replica Optimization Service:


job placement: where to start a job so that it is as close as possible to the required data


replica selection: from where to fetch the closest replica once a job had started


EDG was not a production Grid, and the work was not taken forward

TERENA 5th NRENs & Grids
Workshop, June 2007

17

Network Aware
Scheduling (vi)


In EGEE…


Tommaso Coviello and Tiziana Ferrrari proposed to use network
performance data from
EGEE
-
JRA4



CompletionTimeCE
i

=

{JobExecutionTime





+ max(InputDataTransferTime,QueueTime)}



estimate file transfer times based on thruput


reject paths exhibiting packet loss


SEs selection refined based on SEs using low congestion links (jitter
the suggested test)



Some prototype work, but not taken forward


QueueTime found to be unreliable


Data for 100 paths required within 0.2 seconds of receiving request


Grid Information Service was not ready to hold the data


a problem for JRA4’s Web Service interface (WS,


accessible but slow
)

TERENA 5th NRENs & Grids
Workshop, June 2007

18

Network Aware
Scheduling (vii)


In WLCG/EGEE (if I understand correctly)…



The “close SE” approach is applied:


Each CE must have a “close” SE: the node with the “best” access for data
retrieval from that CE


These relationships are statically defined in the Grid’s Information Service,
which provides information about the Grid resources and their status



$ lcg
-
infosites
--
vo dteam closeSE



Name of the CE: g02.phy.bg.ac.yu:2119/blah
-
pbs
-
dteam



se.phy.bg.ac.yu



Name of the CE:
fangorn.man.poznan.pl:2119/jobmanager
-
lcgpbs
-
dteam



se1.egee.man.poznan.pl



se2.egee.man.poznan.pl

TERENA 5th NRENs & Grids
Workshop, June 2007

19

Network Aware
Scheduling (viii)


To run a job the user submits a job description in JDL (Job Description
Language) format


It defines which executable to run, any parameters, input data (Grid files) etc.



A
match
-
making

process then takes places to identify a CE to execute the job

1.
Identify all CEs which:

1.
can run the job, i.e. match the user’s requirements (JDL)

2.
are “close” to an SE holding the required input Grid files

2.
select CE with the highest rank


by default, rank = estimation of the time interval between the being job submitted and
execution actually beginning


a function of the number of running and queued jobs at each CE


See
gLite User Guide

for more info



As already stated, the presence of replicas of data increases the number of CEs
“close” to the data which can potentially execute the job


But decisions are still made on the static declaration of “close” SEs



Users are able to re
-
write the site selection code themselves

TERENA 5th NRENs & Grids
Workshop, June 2007

20

Difference 1



So, difference 1…

The Grid
may

use network performance
data to improve its decision making

TERENA 5th NRENs & Grids
Workshop, June 2007

21

Difference 2



Difference 2…

The Grid
will

exercise the network

TERENA 5th NRENs & Grids
Workshop, June 2007

22

Qualitative View


By it’s very nature…


sharing lots of resources to build powerful “systems”…


to process complex, large data sets…


in geographically distributed teams


some in real
-
time, e.g. visualisation


so far there has been lots of “embarrassingly parallel” problems
(completely independent tasks which can be executed in parallel)
but what about tasks requiring inter
-
processor communication
(MPI, Message Passing Interface)?


…=
a lot of data

moving across the network:


high bandwidth


low
-
latency


stable and guaranteed transmission rates

TERENA 5th NRENs & Grids
Workshop, June 2007

23

Quantitative View (i)


The Large Hadron Collider
is a collection of four
experiments based at
CERN (ALICE, ATLAS,
CMS and LHCb) that will
monitor the collision of
accelerated particles


≈ 15 Petabytes of data
generated every year


Around 100,000 standard
CPUs required to process


GridPP (UK) is
contributing the equivalent
of 10,000 PCs

TERENA 5th NRENs & Grids
Workshop, June 2007

24

Quantitative View (ii)


My understanding is that the LHC when operational, will be pushing
out 700 Mbytes/s (≈ 5 Gbps) from the Tier
-
0 to each Tier
-
1



11 Tier
-
1s, linked to CERN with 10 Gbps Optical Private Network


So no problems there



Additional variable flows ≤ 4 Gbps are expected between the Tier
-
1s



What about Tier
-
1s to Tier
-
2s?


> 150 Tier
-
2s, 18 in UK


Tier
-
1s and Tier
-
2s currently linked by standard research networks


Are you going to commission dedicated fibres or lambdas for each?

TERENA 5th NRENs & Grids
Workshop, June 2007

25

Quantitative View (iii)

TERENA 5th NRENs & Grids
Workshop, June 2007

26

Rolls Royce Networks


Lots of projects working on adding extra
intelligence into the network, and/or interfacing Grid
applications with network control plane for auto
-
provisioning of dedicated bandwidth:


Cisco’s
Network Based On
-
demand/Grid System

(NBGS)


The NAREGI project


Enlightened Computing


http://www.g
-
lambda.net/


These are still development projects


Can fibre/lambdas be provided for all that need it?


Even if £$


provided, temptation to spend on CPU
power?


May still fall victim to end
-
system and “last mile”
(e.g. firewall) problems

TERENA 5th NRENs & Grids
Workshop, June 2007

27

Is the Grid a lot of Hype?


It’s good to be skeptical about things. Every four years people
say England will win the World Cup/Coupe du Monde ;
-
)


The Grid is ambitious…


…but so was the “World Wide Wait”


Now everyone loves the Web, and it has become important to
people:


Internet banking, online shopping (flights, holidays, music,
supermarket…), e
-
Government etc. etc.


MySpace, Facebook, YouTube


The Web also drove investment in the Net infrastructure and
as a result it can now support video conferencing, VoIP etc.

TERENA 5th NRENs & Grids
Workshop, June 2007

28

Summary of Differences

1.
Network Operations:

We can safely say that
greater demands
will

be placed on the
network:


massive datasets, 1000’s of networked
“resources”


geographically distributed:
L
ong
F
at
N
etworks


high bandwidth, high availability, low latency


networks will need to be debugged for efficiency


2.
Network Intelligence:

The Grid
may

want to
consume network performance data to
improve its decision making

2. What are the
end
-
to
-
end

performance (monitoring) issues?

TERENA 5th NRENs & Grids
Workshop, June 2007

30

The Overall Issue


We have seen that the Grid
could

use
network performance data for decision
making…


…but we don’t know whether it will


As a result, we concentrate on debugging the
network for Grid users

TERENA 5th NRENs & Grids
Workshop, June 2007

31

End
-
to
-
End?


When I say “end
-
to
-
end” I mean PC
-
PC, not PoP to
PoP or similar


Core and Metro Area are normally fine


Most problems are in the
last mile
:


End
-
system:


NIC


disc


TCP config


poor cabling


the application itself (e.g. older versions of scp)


I could go on for ever (“no, please don’t!”)


Site firewall


Off
-
site connections

TERENA 5th NRENs & Grids
Workshop, June 2007

32

So Many Issues


Beyond the basics of which tests to run, and how to control/schedule
them, there are
too many

end
-
to
-
end performance issues to consider
when monitoring. Here, I mention
a few

and make some suggestions.



TCP performance


Parallel TCP streams


Different data transfer protocols (e.g. GridFTP vrs HTTP)


New protocols, e.g. DDCP


TCP
-
IP is ubiquitous so we stick with it
-

we can’t necessarily wait for
new protocols and network architectures


Measurement types


active vrs passive


capture logs of real GridFTP transfers…is there Grid Information Service
support?


can we
monitor Grid workflows in real
-
time
?


Too many test paths. Can we plug in to VO data to test only the
required paths

TERENA 5th NRENs & Grids
Workshop, June 2007

33

Over
-
Provisioning

Q:

Okay, so why don’t we just throw some more bandwidth at the
problem? Upgrade the links.


A:

For want of a more interesting term to make sure you’re still paying
attention, this is what I call the Heroin Effect…


You start off with a little, but that’s not really doing it for you; it’s not solving
the problem. So you keep increasing the dose, yet it’s never as good as you
thought it would be.


By analogy y
ou keep buying more and more bandwidth to take you to new
highs but it's never quite as good as you thought it would be


Simple over
-
provisioning is not sufficient


Doesn’t address the key issue of
end
-
to
-
end

performance


Network backbone in most cases is genuinely not the source of the problem


Last mile (campus network

end
-
user system

your app) often cause of the
problem: firewall, wiring, hard disc,
application

and many more potential culprits



Also, If simple over
-
provisioning was a total solution, there would not be
so much other work going on, e.g. protocol research (high speed TCPs)

TERENA 5th NRENs & Grids
Workshop, June 2007

34

Lets Puts Fibre
Everywhere (1)


Fibre is cheaper than it was, but for large deployments, it’s still
expensive


We can see the benefits of fibre with the UKLight infrastructure
and the
ESLEA

exploitation project, but it still doesn’t address
the end
-
to
-
end issue. Take a
real
-
life

ESLEA example
(thanks to
ESLEA for the figures)




The UK wanted to transfer data from FermiLab (Chicago) to
UCL for analysis by physicists, before returning the results


datasets
currently

1
-
50TB


50TB would take > 6 mths on production net, or one week at
700Mbps


So a 1Gbps circuit
-
switched light path was provisioned


Result = disc
-
to
-
disc transfers @ 250Mbps, just 1/4 of
theoretical max


Tests revealed a problem at an end site


TERENA 5th NRENs & Grids
Workshop, June 2007

35

Lets Puts Fibre
Everywhere (2)


UCL:
RealityGrid
, for
modelling complex
condensed matter
systems: computational
steering, visualisation.


Test node: 2 * 1.8GHz
Athlon, 4 GB, GigE,
CentOS



DL:
HPCx

super computer


Test node: 3 GHz P4, 2
GB, GigE, Scientific Linux



RTT is always 9mS


TCP bandwidth is, errr....

TERENA 5th NRENs & Grids
Workshop, June 2007

36

Mark’s Tips


There are lots of tools, frameworks, infrastructures out there.


Massive list at
http://www.slac.stanford.edu/xorg/nmtf/nmtf
-
tools.html


Pick something that works for you
-

it’s a balance of:


ongoing administration


deployment effort (e.g. persuading remote sites to install tools and allow
you to run tests)


how intrusive the tests are


Start your investigations in the last mile


Do put real data over the network


you can send 1 ping a second forever and see 10
-
8

loss


you then run an iperf test and the performance is terrible


Keep historic data: things change


you
will

want to look back, and you will want points of reference


When you see a problem, follow it up and get information


Not only is the problem fixed, but you get to demonstrate why this is useful
which helps with deployment, support, growing user base…


Remember the social aspects
-

persistent but patient :)

TERENA 5th NRENs & Grids
Workshop, June 2007

37

Suggestions: Tools and
Techniques


Start with the local host:


As you would expect:


uname


netstat


ifconfig (watch error counters etc.)


LISA

(Localhost Information Service Agent)


a component of MonALISA


almost complete system monitoring (load, CPU, memory, disk, disk I/O,
paging, processes, network traffic and connectivity...)


Check everything:


TCP configuration


machine load


disc (sas, sata, nasty old ide?)


If TCP is the problem, what UDP rates can you achieve?

TERENA 5th NRENs & Grids
Workshop, June 2007

38

Suggestions: Tools and
Techniques


ping still useful but need to send much faster than 1 per second, and
for a long time….10
-
8

loss


“back of envelope” calculation: on Saturday I ran a 10 sec iperf test
which transferred 624MB in 480,000 packets. So ≈ 1.3KB per packet


1 loss every 100,000,000 packets ≈ 128GB transferred before a loss
causes your transfer rate to drop


can use
Synack

tool (sparingly) if icmp is blocked



traceroute and reverse traceroutes: regularly measuring the routes to
your most important collaborators is very useful



dedicated monitoring boxes are useful here because they may be
allowed (firewalls etc.) for icmp

TERENA 5th NRENs & Grids
Workshop, June 2007

39

Suggestions: Tools and
Techniques


As we will see, time series data is probably the most useful



When did your problems start? When did things change?



Unfortunately, relies on there being proximity between your
paths/devices and ones for which there is available data



If you suspect the problem is in the core you
may

be able to find the
problem router (or rough location) through a so called "looking glass"
servers: statistics of network operator performance



ping and iperf very useful here…but be wary:


In May 2004, Les Cottrell (SLAC) said… “As measured by NetFlow,
25% of the traffic on Abilene is iperf and ping type traffic”

TERENA 5th NRENs & Grids
Workshop, June 2007

40

Suggestions: Tools and
Techniques


Thrulay

is an iperf
-
like tool for measuring TCP and UDP bandwidth


useful because it also gives you the RTT seen by the transfer, not
ping/traceroute’s estimate



Two “detective” type tools:

1.
Tom Dunnigan and Rich Carlson's Network Diagnostic Tool (
NDT
)


client
-
server


useful because client can be
lightweight
: Java applet, runs in a Web
browser on most systems


command line client (compile and install) also available


public servers (linux boxes with Web100 kernels) although I think only
one outside US (thank you SWITCH)


detects problems, makes suggestions: duplex problems, TCP tuning
amongst others

2.
The
SURFnet Detective

TERENA 5th NRENs & Grids
Workshop, June 2007

41

Suggestions: Tools and
Techniques

NDT’s suggestion

TERENA 5th NRENs & Grids
Workshop, June 2007

42

Suggestions: Tools and
Techniques

We
could

do these but
don’t

because there’s too much data to process/correlate:


Cisco NetFlow data


routers record details of all traffic “flows” which they see:


src and dest IP addresses and ports


start and end time


amount of traffic transferred


Parsing firewall logs:


[root@gridmon2 ~]# iperf
-
c hepgrid7.ph.liv.ac.uk


------------------------------------------------------------


Client connecting to hepgrid7.ph.liv.ac.uk, TCP port 5001


TCP window size: 16.0 KByte (default)


------------------------------------------------------------


[ 3] local 193.62.125.96 port 58316 connected with 138.253.178.107 port 5001


[ 3]
0.0
-
10.0 sec

873 MBytes

732 Mbits/sec



Jun 10 22:12:58: NetScreen device_id=gw
-
fw system
-
notification
-
00257(traffic): start_time="2007
-
06
-
10 22:15:55"
duration=22

service=tcp/port:5001 src zone=ESC
-
DMZ dst zone=Untrust
action=Permit
sent=948533470 rcvd=40793960

src=<hidden>
dst=<hidden> src_port=58316 dst_port=5001 session_id=995619



Not wholly accurate (22 secs not 10) and ignores overheads but can be used
relative

TERENA 5th NRENs & Grids
Workshop, June 2007

43

Suggestions: Tools and
Techniques


SNMP data is (understandably) impossible to obtain for non
-
networkers


Sharing data with the OGF
NM
-
WG

XML schemas may improve things



And now some quick examples from gridmon:


Dedicated boxes


Same spec, OS, configuration
-

makes life a lot easier (comparing like
-
for
like)


If running regular tests, get the results in an SQL data


fast, repeatable
queries


If no dedicated boxes available, deploy a box for:


either the best performance possible


Something representative of systems at that end
-
site


Sorry, no
-
end system examples here


we configured the boxes ourselves
;
-
)

TERENA 5th NRENs & Grids
Workshop, June 2007

44

Example 1


Glasgow running transfer tests to Edinburgh over weekend 28
-
29
th

October


Experiencing poor rates (80Mbps)


1
st

thing: despite transferring just 80Mbps, residual TCP bandwidth drops by ≈ 400Mbps


Warning bells

TERENA 5th NRENs & Grids
Workshop, June 2007

45

Example 1


Traceroute data reveals suspect router…


traceroute to gridmon.epcc.ed.ac.uk (129.215.175.71), 30 hops max, 38
byte packets

1 194.36.1.1 (194.36.1.1) 0.941 ms 0.882 ms 0.815 ms

2 130.209.2.1 (130.209.2.1) 0.875 ms 0.831 ms 0.830 ms

3 130.209.2.118 (130.209.2.118)
60.415 ms

55.453 ms

31.327 ms

4 glasgowpop
-
ge1
-
2
-
glasgowuni
-
ge1
-
1
-
v152.clyde.net.uk (194.81.62.153)
32.420 ms 34.404 ms 29.424 ms

5 glasgow
-
bar.ja.net (146.97.40.57) 43.467 ms 52.298 ms 39.349 ms

6 po9
-
0.glas
-
scr.ja.net (146.97.35.53) 45.856 ms 44.445 ms 41.388
ms

7 po3
-
0.edin
-
scr.ja.net (146.97.33.62) 51.509 ms 63.493 ms 31.435
ms

8 po0
-
0.edinburgh
-
bar.ja.net (146.97.35.62) 22.454 ms 25.412 ms
31.381 ms

9 146.97.40.122 (146.97.40.122) 44.602 ms 42.494 ms 35.492 ms

10 gridmon.epcc.ed.ac.uk (129.215.175.71) 33.515 ms 34.623 ms
37.694 ms

TERENA 5th NRENs & Grids
Workshop, June 2007

46

Example 1


Reverse route confirms. Traceroutes are normal until we hit suspect router…



traceroute to gppmon
-
gla.scotgrid.ac.uk (194.36.1.56), 30 hops max, 38
byte packets

1 vlan175.srif
-
kb1.net.ed.ac.uk (129.215.175.126) 0.435 ms 0.387 ms
0.380 ms

2 edinburgh
-
bar.ja.net (146.97.40.121) 0.357 ms 0.329 ms 0.322 ms

3 po9
-
0.edin
-
scr.ja.net (146.97.35.61) 0.564 ms 0.485 ms 0.485 ms

4 po3
-
0.glas
-
scr.ja.net (146.97.33.61) 1.656 ms 1.511 ms 1.499 ms

5 po0
-
0.glasgow
-
bar.ja.net (146.97.35.54) 1.850 ms 1.352 ms 1.422
ms

6 146.97.40.58 (146.97.40.58) 1.679 ms 1.661 ms 1.569 ms

7 glasgowuni
-
ge1
-
1
-
glasgowpop
-
ge1
-
2
-
v152.clyde.net.uk (194.81.62.154)
1.796 ms 1.677 ms 1.646 ms

8 130.209.2.117 (130.209.2.117)
31.197 ms

34.615 ms

29.121 ms

9 130.209.2.2 (130.209.2.2) 32.814 ms 32.158 ms 32.145 ms

10
gppmon
-
gla.scotgrid.ac.uk (194.36.1.56) 41.634 ms 37.555 ms
24.635 ms



Graphs and traceroutes provide evidence for further investigation

TERENA 5th NRENs & Grids
Workshop, June 2007

47

Example 1


Further investigation revealed that the router had exhausted its
CAM space



<see next slide if you want to know what this is>



In simple terms, the router was forced to switch in software


Because a particular lookup in a routing/switching/access
table was not being hardware accelerated, problems were
caused under certain flow conditions


The solution: the CAM dynamic database was re
-
optimised (to
free up CAM space) and the unit began switching in hardware
again

TERENA 5th NRENs & Grids
Workshop, June 2007

48

Example 1


CAM = Content
-
Addressable Memory


Hardware (fast) implementation of an associative area


a data word (not memory address!) is used to access it


the CAM searches its entire contents to see if the data word is stored


if the word is found, the CAM returns a list of one or more corresponding storage
addresses, or other data associated with those storage addresses


CAM memory is used for switching and routing, e.g. Ethernet switches store
learned MAC addresses and their associated switch port in CAM



MAC Address

Located on Port



-------------

---------------



000039
-
0643f5

26



000089
-
01af9a

5



000102
-
162346

16


When an Ethernet frame arrives at the switch with a destination address of
000089
-
01af9a

the switch searches its CAM for that address.


The CAM will return “
5
” so the switch sends this Ethernet frame out on port 5

TERENA 5th NRENs & Grids
Workshop, June 2007

49

Example 2


Local departmental firewall reconfigured to switch off strict
checking of TCP sequence numbers


Potential minefield:
SACK

etc.

TERENA 5th NRENs & Grids
Workshop, June 2007

50

Example 3


Almost constant 33% UDP packet loss


Fatal to most/all applications using UDP


Occasional dip to 0%

TERENA 5th NRENs & Grids
Workshop, June 2007

51

Example 3


Zooming into a particular day shows a period of 0% loss


Site firewall limits UDP to 1,000 packets per second, per endpoint pair


Temporarily raised to 20,000 pps for Video Conferences

TERENA 5th NRENs & Grids
Workshop, June 2007

52

The Answer


Blair (vintage 1996) before he game to power…


Education, education, education; became a mantra for his party


NRENs are ideally placed to provide this

Ask me my three main
priorities for Government and I
tell you:
education,
education, education
.

TERENA 5th NRENs & Grids
Workshop, June 2007

53

The Answer


Blair (vintage 1996) before he game to power…


Education, education, education; became a mantra for his party


NRENs are ideally placed to provide this

Yes, why don’t you stupid
English learn some French?

Ask me my three main
priorities for Government and I
tell you:
education,
education, education
.

TERENA 5th NRENs & Grids
Workshop, June 2007

54

The Answer


Blair (vintage 1996) before he game to power…


Education, education, education; became a mantra for his party


NRENs are ideally placed to provide this

Yes, why don’t you stupid
English learn some French?

Ask me my three main
priorities for Government and I
tell you:
education,
education, education
.

French? What’s
French?

TERENA 5th NRENs & Grids
Workshop, June 2007

55

NFNN

As an example:


Networks for non
-
Networkers
workshops


Aimed at people working at the
technical level in high
-
bandwidth
dependant science



Talks on TCP, LAN, diagnostic
steps, security…


http://gridmon.dl.ac.uk/nfnn/

TERENA 5th NRENs & Grids
Workshop, June 2007

56

Your Application


Is your application making effective use of the
network?


Consider using multiple TCP sockets (i.e. multiple
streams) for your data transfers


One thread per socket


Keep your “pipe” full of data


use asynchronous I/O, i.e. run computation and I/O in
parallel


pre
-
fetch data you know you are going to need, again in
parallel with other computation or I/O


when possible, read/write large blocks of data at a time:
better to infrequently r/w


1MB than frequently r/w 4K

TERENA 5th NRENs & Grids
Workshop, June 2007

57

What Is Your Application
Doing?


Instrument your code, e.g.
Netlogger
, a
“Networked Application Logger”


Methodology and set of tools


Low overhead: can generate up to 5000/500
events/sec using the C/Java APIs with
negligible impact on the app


Simple and sensible methodology, e.g.


Rule 3: Log all of the following events: Entering
and exiting any program or software component,
and begin/end of all I/O (disk and network).

TERENA 5th NRENs & Grids
Workshop, June 2007

58

Netlogger


client side GridFTP


note the large
overhead (≈ 8s) of
initial handshaking
before real writing
begins



initial handshaking

TERENA 5th NRENs & Grids
Workshop, June 2007

59

Conclusion


The Grid
could

use network performance data


The reality is that
it doesn’t


The Grid
will

exercise networks


Core = fine. Metro = mostly fine. Most problems in the last mile.


Not every Grid app wants, needs or can afford dedicated
λ
’s


Education, education, education. But please, no wars!


Tune your end systems
and

applications


Instrument you application so you can see what’s happening




For more information:
m.j.leese@dl.ac.uk

TERENA 5th NRENs & Grids
Workshop, June 2007

60

Links (1)


The GridPP (LHC in the UK) "gridmon" network monitoring
infrastructure:
http://gridmon3.dl.ac.uk/gridmon/



Network Aware Scheduling in Grids:


"Network Aware Scheduling in Grids" paper:
http://users.atlantis.ugent.be/bvolckae/papers/NOC2004.pdf


"Data Intensive and Network Aware (DIANA) Grid Scheduling" paper:
http://hst.web.cern.ch/hst/publications/diana
-
JoGC.pdf


Report of the International Grid Performance Workshop 2005:
http://www
-
unix.mcs.anl.gov/~schopf/GPW2005/report.pdf


EDG WP7 Final Report:
https://edms.cern.ch/file/414132/2.1/DataGrid
-
07
-
D7
-
4
-
0206
-
2.0.pdf


EGEE
-
JRA4:
http://egee
-
jra4.web.cern.ch/EGEE
-
JRA4/


gLite User Guide:
https://edms.cern.ch/file/722398/gLite
-
3
-
UserGuide.html

TERENA 5th NRENs & Grids
Workshop, June 2007

61

Links (2)


Rolls Royce Networks:


Cisco’s Network Based On
-
demand/Grid System:
http://www.terena.org/activities/nrens
-
n
-
grids/workshop
-
03/NBGS
-
Terena.pdf


The NAREGI project:
http://www.naregi.org/index_e.html


Enlightened Computing:
http://www.mcnc.org/index.cfm?fuseaction=page&filename=enlightened_co
mputing.html


G
-
Lambda:
http://www.g
-
lambda.net



Monitoring Grid workflows in real
-
time:
http://www.di.unipi.it/~augusto/seminars/200705_OGF20/2007
-
04
-
09_OGF
-
Slides.pdf



Exploiting fibre infrastructures, UK ESLEA project closing conference:
http://www.eslea.uklight.ac.uk/conf.html


UCL Reality Grid project:
http://www.realitygrid.org


Daresbury Laboratory HPCx super computer:
http://www.hpcx.ac.uk


TERENA 5th NRENs & Grids
Workshop, June 2007

62

Links (3)


End host monitoring, LISA (Localhost Information Service Agent):
http://monalisa.cacr.caltech.edu


Synack, alternative ping tool:
http://www
-
iepm.slac.stanford.edu/tools/synack/


Thrulay, iperf
-
like tool:
http://www.internet2.edu/~shalunov/thrulay/


Network Diagnostic Tool:
http://e2epi.internet2.edu/ndt/


SURFnet Detective:
http://detective.surfnet.nl/en/index_en.html



Sharing network performance data, OGF Network Measurements
Working Group:
http://nmwg.internet2.edu/



TCP Selective Acknowledgements (SACK):
http://www.ietf.org/rfc/rfc2018.txt



Netlogger (Networked Application Logger):
http://dsd.lbl.gov/NetLogger/