4-scalabilityx - Users Site

homelybrrrInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

72 εμφανίσεις

1

Scalability

1.
Optimizing P2P Networks: Lessons learned from social networking

a)
Social Networks

b)
Lessons Learned

c)
Organizing
P2P Networks


2.
Node Topologies

a)
Centralized, Ring, Hierarchical & Decentralized

b)
Hybrid:

o
Centralized
-
Ring

o
Centralized
-
Centralized

o
Centralized
-
Decentralized

c)
Reflector Nodes

3.
Gnutella Case Studies

a)
3 case
studies

4.
DHTs

a)
what are they?

b)
example

2

Social Networks



Stanley
Milgram

(not a Harvard
professor
)


1967 social networking experiment



How many ‘social hops’ would it take for messages to traverse through the US
population (200 million)



Posted 160 letters

to randomly
recruited people
in Omaha, Nebraska

Boston

Omaha



Asked them to try to pass these
letters to a stockbroker working in
Boston, Massachusetts



Rules:



use
intermediaries
whom
they know on a first name
basis



chosen intelligently



make a note at each hop



42 letters made it

one version of
the experiment



Average of 5.5 hops



Demonstrated the ‘small world
effect’

Suggests that
the social network of the United States is indeed connected with a path
-
length (number of hops) of around 6


The 6 degrees of separation !

Does this mean that it takes 6 hops to traverse 200 million people??

3

Lessons Learned from

Milgrim’s Experiment



Social circles are highly clustered



A few members have wide
-
ranging connections



these form a bridge between far
-
flung social clusters



this bridging plays a critical role in bringing the network closer together

For example





A quarter of all letters passed through a local storekeeper



A half were mediated by just 3 people

Lessons

Learned





These people acted as gateways or hubs between the source and the wider world



A small number of bridges dramatically reduces the number of hops

4

From Social Networks to

Computer Networks…



There are a number of similarities to social networks



People = peers



Intermediaries = Hubs, Gateways or Rendezvous Nodes (JXTA speak...)



Number of intermediaries passed through = number of hops

Are P2P Networks Special then?





P2P networks are more like social networks than other types of
computer network because they are often:



Self Organizing



Ad
-
Hoc



Employ clustering techniques based on prior interactions (like we
form relationships)



Decentralized discovery and communication (like we form
neighbourhoods
, villages, cities etc
)



What about social networking sites?



huge


“If
Facebook

were a country, it would be the eighth most
populated in the world, just ahead of Japan, Russia and Nigeria.”



But the application overlay network does not reflect social network



Use centralized data centers.

5



Problem: how do we
organize peers

within
ad
-
hoc, multi
-
hop pervasive

P2P networks?





network of self
-
organizing peers organized in a decentralized fashion



such networks can rapidly expand from a few hundred peers to several
thousand or even millions


Peer to Peer: What’s the problem?



P2P Environment Recap:



Unreliable Environments



Peers connecting/disconnecting


network failures to participation



Random Failures e.g. power outages, Cable, DSL failure, hackers



Personal machines are much more vulnerable than servers



algorithms have to cope with this continuous restructuring of the network
core.




P2P systems need to treat failures as normal occurrences
not freak exceptions



must be designed in a way that promotes redundancy with the tradeoff of a
degradation of performance

6

For P2P




This does not mean abstract numerical benchmarks e.g.
how many milliseconds will it take to compute this many
millions of
FFTs
?




Rather, it means asking question like:



How long will it take to retrieve this particular file?



How much bandwidth will this query consume?



How many hops will it take for my package to get to a peer on the far side
of the network?



If
I add/remove a peer to the network will the network still be fault tolerant?



Does
the network scale as we add more
peers?

So, how do we Organize Networks in

Order to Get Optimum Performance?

7

3 main factors that make P2P networks more sensitive to performance
issues:

Performance Issues in P2P Networks

1.
Communication
.


Fundamental necessity


Users connected via different
connection
speeds


Multi
-
hop

2.

Searching


No central Control so more effort is needed


Each hop adds to total
bandwidth

3.

Equal Peers


Free Riders


imbalance
in the
harmony
of network


Degrades performance for others


Need to get this right

and
adjust accordingly


8

Peer Topologies


Core


Centralized


Ring


Hierarchical


Decentralized


Hybrid


Centralized
-
Ring


Centralized
-
Centralized


Centralized
-
Decentralized

9

Centralized


Client/server


Web servers


Databases


Napster search


Instant Messaging

10

Ring


Failover
clusters


Simple load balancing


Assumption


Single
owner


co
-
ordination

11

Hierarchical


Tree structure


DNS


www.example.com






12

Decentralized


Gnutella


Freenet


Internet routing

13

Centralized + Ring


Robust web applications


High availability of servers

14

Centralized + Centralized


N
-
tier apps


Database heavy systems


Web services gateways


Google.com

uses this topology
to deliver their

search engine

15

Centralized + Decentralized


New Wave of P2P


Clip2 Gnutella Reflector (next)


FastTrack


KaZaA


Morpheus


Email


Like Social Networks perhaps ?

16

F1.mp3



ID0:F1.mp3



C

F1.mp3

F2.mp3

F3.mp3

0

1

2

Reflector Nodes



Known as ‘super peers’


in JXTA these are
Rendezvous peers



cache file list of connected users


maintain an index



When a query is issued, the Reflector does not retransmit it
-

it answers
the query from its own memory



Do they remind you of anything ?

17

Napster

Gnutella

User

Napster.com

Gnutella Super Peers:

1. Natural??

2. Reflector (clip2.com)

=?

User

Napster

N2

N3

Napster

Duplicated

Servers

Napster = Gnutella?

18

The figure below is a view of the topology of a Gnutella network as shown on the LimeWire web site, the
popular Gnutella file
-
sharing client. Notice how the power
-
law or centralized
-
decentralized structure is
demonstrated.


The Gnutella
Network

19

Another View of the Gnutella Network

20

Gnutella Studies 1: Free Riding

E. Adar and B.A. Huberman (2000),


Free Riding on Gnutella
,” First Monday 5(10),

http://firstmonday.org/issues/issue5_10/adar/index.html

Two types of free riding

1.
download files but never provide any files for other to download

2.
users that have undesirable content



They found 22,084 of the 33,335 peers in the network (66%) of the peers share
no files



24,347 or 73% share ten or less files



top 1 percent (333 hosts) represent 37 percent of the total files shared



20 percent (6,667 hosts) sharing 98% of the files

shows
-

even without Gnutella Reflector nodes, the Gnutella network naturally
converges into a centralized + decentralized topology with the top 20% of
nodes acting as super peers or
reflectors

21

Gnutella Studies 2: Equal Peers

Study on
Reflector Nodes

[clip] www.clip2.com

Studied Gnutella for one month


Noted an apparent scalability barrier when query rates went above

20
per second.

Why??


In a network of roughly 1000 nodes, a
servent

must handle up to 20 queries
per second.



a

dial
-
up 56
-
K link cannot keep up with this amount of
traffic



one node connected in the incorrect place can grind the whole network
to a
halt because it becomes a dead end



The network fragments.



This is why P2P networks place slower nodes at the edges

22

Gnutella Studies 3: Communication

Peer
-
to
-
Peer Architecture Case Study: Gnutella Network

Matei Ripeanu, on
-
line at:
http://people.cs.uchicago.edu/~matei/PAPERS/P2P2001.pdf

Studied topology of Gnutella over several months & reported two findings:

1.

Gnutella network shares the benefits and drawbacks of a
power
-
law structure

-

networks that organize themselves so that most nodes have a few links and a
small number of nodes have many

-

found to show an unexpected degree of robustness when facing random node
failures.

-

vulnerable to attacks e.g. by removing a few of the
super nodes
can have a
massive effect on the function of the network as a whole.


2.

Gnutella network topology does not match well with the underlying Internet
topology leading to inefficient use of network bandwidth.

He gave 2 suggestions:

1.
use an agent to monitor network and intervene by asking
servents
to drop/add
links to keep the topology optimal.

2.
replace the
Gnutella
flooding mechanism with a smarter routing and group
communication mechanism.

23

Gnutella
Studies

1.

Gnutella

shows properties associated with
power
-
law distribution


(e.g., a node with twice the connections is four times less frequent)


2.
Power
-
law distributions happen all over the place in nature and society:



word frequency distribution


Sizes of meteorites and sand particles


Sizes of cities


the Pareto principle (80


20 rule)
-

20% of the population own 80% of the
wealth


Zipf

distribution (and
Zipf
-
Mandelbrot)


Mandelbrot coined the term fractal



And has re
-
emerged recently as The Long Tail on the Web.




24

Scalability Through Structure



Gnutella,
Kazaa

can be classified as ‘unstructured’
networks



interconnection of nodes is ad
-
hoc, highly dynamic,
defined independently by each node according to
individual requirements.



settles into a topology with qualities associated with
power
-
law distribution.




A class of P2P systems that are known as
‘structured’ evolved just after the millennium.



Chord



CAN



Pastry



Tapestry




Generally a form of Distributed Hash Table (DHT)

What are
DHTs
?


A DHT is a topology that provides similar functionality to a
typical hash table.


put(key
, value)


get(key
)


Peers are buckets in the table


with their own local hash tables


Allows a peer to publish a resource onto a network using
a key to determine where the data will be stored (i.e.
which peer will receive the data).


Using keys presupposes a logical ‘space’ which the keys
map onto.


The key is mapped to the space using a hashing function to
ensure equal distribution of resources across the network.


Nodes are responsible for sections of this space.


Why
DHTs
?


Address the flooding issue without resorting to
centralized/decentralized architecture.


Typically search can be achieved in
O(log
n
) hops where
n

is the number of nodes in the network.


only a few neighbors need to be known


typically
O(log
n
)


small neighborhoods and flat topology makes for a
robust network, easy to handle churn.


Example: Chord Topology


Divides the key space into a circle


keys are
n
-
bit sized


ring can contain up to 2
n

nodes


keys can range from 0 to 2
n



1


Consistent hashing algorithm (e.g.
MD5) is used to evenly distribute keys
around the ring.


increases probability of robustness


allows nodes to join and leave without
disrupting the network


O(1/n) fraction of keys are moved to a
different location


Node IDs are distributed based on the
key size and the number of nodes in
the network.


A node should be responsible for
keys/nodes keys

Chord Finger Tables


Just knowing your precursor and
successor leads to very bad
performance


O(
n
) hops to find a key (O(
n
/2)
expected)


Chord nodes have a routing (finger)
table containing approx.
O(log
n
) nodes


The distance of nodes in the table
increases exponentially


having this many nodes in the finger
table means
O(log
n
) hops are needed
to find the key


For each query for key
k

there is choice
of
O(log
n
) nodes. Choose the one
whose id is closest to
k

DHT Issues


DHTs

are
structured
. Maintaining the structure has
overhead.


presupposes equal capabilities in nodes


NOT
a power
-
law distribution


not always possible to have fuzzy or attribute
-
based
queries. It’s a lookup facility


you need to know the key


Searching on a Gnutella network is open ended.


you may get results, you may not.


DHT algorithms are deterministic and designed for
lookup


So data going missing is more problematic


replication needs to be employed to ensure data
availability

30

Closing Remarks

1.
Summary

a)
Centralized + Decentralized


understand from the
original Gnutella to the new models

b)
The role of Reflector
nodes

c)
Structured topologies (
DHTs
)


efficient lookup
without Centralization