For Understanding Large-Volume

muttchessAI and Robotics

Nov 8, 2013 (3 years and 7 months ago)

69 views

An Architecture
-
based Framework
For Understanding Large
-
Volume
Data Distribution

Chris A. Mattmann


USC CSSE Annual Research Review

March 17, 2009

Agenda


Research Problem and Importance


Our Approach


Classification


Selection


Analysis


Evaluation


Precision, Recall, Accuracy Measurements


Speed


Conclusion & Future Work

Research Problem and
Importance


Content repositories are
growing rapidly in size


At the same time, we
expect more immediate
dissemination of this data


How do we distribute it…


In a performant manner?


Fulfilling system
requirements?

Data Distribution Scenarios

A medium
-
sized
volume

of
data, e.g., on the order of a
gigabyte needs to be delivered
across a
LAN
, using multiple
delivery intervals

consisting of
10 megabytes of data
per
interval
, to a single
user
.

A
Backup Site

periodically

connects across the
WAN

to the
Digital Movie Repository

to
backup its entire catalog and
archive of over
20 terabytes

of
movie
data

and
metadata
.

Data Distribution Problem
Space

Insight: Software Architecture


The definition of a system in the form of its canonical
building blocks


Software Components: the computational units in the system




Software
Connectors
: the communications and interactions
between software components




Software Configurations: arrangements of components and
connectors and the rules that guide their composition

Data Distribution Systems

Data
Producer

Data
Consumer

Data
Consumer

Data
Consumer

Data
Consumer

data

???

data

Connector

Insight: Use
Software Connectors

to
model
data distribution
technologies

Component

Component

Impact of Data Distribution
Technologies


Broad variety of data
distribution technologies


Some are highly
efficient, some more
reliable


P2P, Grid, Client/Server,
and Event
-
based


Some are entirely
appropriate to use,
some are
not
appropriate

Data Movement Technologies


Wide array of available OTS “large
-
scale” connector technologies


GridFTP, Aspera, HTTP/REST, RMI, CORBA,
SOAP, XML
-
RPC, Bittorrent, JXTA, UFTP, FTP,
SFTP, SCP, Siena, GLIDE/PRISM
-
MW, and more


Which one is the best one?


How do we compare them


Given our current architecture?


Given our distribution scenarios & requirements?

Research Question


What types of software connectors
are best suited for delivering vast
amounts of data to users, that satisfy
their particular scenarios, in a
manner that is performant, scalable,
in these hugely distributed data
systems?

Broad variety of distribution
connector families


P2P, Grid, Client/Server, and Event
-
based


Though each connector family varies
slightly in some form or fashion


They all share 3 common atomic connector
constituents


Data Access, Stream, Distributor


Adapted from our group’s ICSE2000 Connector
Taxonomy

Connector Tradeoff Space


Surveyed properties of 13 representative distribution
connectors, across all 4 distribution connector
families and classified them


Client/Server


SOAP, RMI, CORBA, HTTP/REST, FTP, UFTP, SCP,
Commercial UDP Technology


Peer to Peer


Bittorrent


Grid


GridFTP, bbFTP


Event
-
based


GLIDE, Sienna

Large Heterogeneity in
Connector Properties

How do experts make these
decisions?


Performed survey of 33 “experts”


Experts defined to be


Practitioners in industry, building
data
-
intensive systems


Researchers in data distribution


Admitted architects of data
distribution technologies


General consensus?


They
don’t

the
how

and the
why

about which connector(s) are
appropriate


They rely on anecdotal evidence
and “intuition”

45% of respondents claimed
to be
uncomfortable


being addressed as a data
distribution expert.


Why is it bad to have these
types of experts?


Employ a small set of COTS, and/or pervasive
distribution technologies, and stick to them


Regardless of the scenario requirements


Regardless of the capabilities at user’s institutions


Lack a comprehensive understanding of
benefits/tradeoffs amongst available distribution
technologies


They have “pet technologies” that they have used in similar
situations


These technologies
are not always applicable

and
frequently only satisfy one or two scenario requirements and
ignore the rest

Our Approach: DISCO


Develop a software framework for:


Connector Classification


Build metadata
profiles

of connector technologies,
describing their intrinsic properties (DCPs)


Connector Selection


Adaptable, extensible algorithm development framework
for selecting the “right” connectors (and identifying wrong
ones)


Connector Selection Analysis


Measurement of accuracy of results


Connector Performance Analysis

DISCO in a Nutshell

Scenario Language


Describes distribution scenarios

Data Distribution
Delivery Schedule
Performance Requirements
Number of Intervals
Volume Per Interval
Timing of Interval
Consistency
Scalability
Dependability
Efficiency
Access Policies
Geographic Distribution
Number of Data Types
Total Volume
WAN
LAN
Number of Users
Number of User
Types
Producers
Consumers
Automatic
Initiated
Automatic
Initiated
Types of Data
Data
Metadata
e.g., 10 MB, 100 GB, etc., int + higher order unit

e.g., 1, 10, int

e.g., SSL/HTTP 1.0, Linux File System Perms, string from controlled value range

1
-
10,
computed
scale

e.g., 1, 10, int

e.g., 1, 10, int

e.g., 1, 10, int

Distribution Connector Model


Developed model for distribution
connectors


Identified combination of primitive
connectors that a distribution
connector is made from



Model defines important properties of
each of the important “modules” within
a distribution connector


Defines value space for each property


Defines each property


Properties are based on the
combination of underlying “primitive”
connector constituents


Model forms the basis for a
metadata

description (or profile) of a distribution
connector


Distribution Connector Model

Selection Algorithms


So far


Let data system architects encode the data
distribution scenarios within their system using
scenario language


Let connector gurus describe important properties
of connectors using architectural metadata
(connector model)


Selection Algorithms


Use scenario(s) and connector properties identify
the “best” connectors for the given scenario(s)

Selection Algorithms


Formal Statement of the problem




Selection algorithm interface



?

Connector

KB

scenario

(bbFTP, 0.157)

(FTP,0.157)

(GridFTP,0.157)

(HTTP/REST, 0.157)

(SCP, 0.157)

(UFTP, 0.157)

(Bittorrent, 0.021)

(CORBA, 0.005)

(Commercial UDP Technology,
0.005)

(GLIDE, 0.005)

(RMI, 0.005)

(Sienna, 0.005)

(SOAP, 0.005)

This interface is desirable
because it allows a user to rank
and compare how “appropriate”
each connector is, rather than
having a binary decision

Selection Algorithms

Selection Algorithm Approach


White box


Consider the
internal

properties of a
connector (e.g., its internal architecture)
when selecting it for a distribution scenario


Black box


Consider the
external

(observable)
properties of the connector (such as
performance) when selecting it for a
distribution scenario

Develop complementary
selection algorithms


Users familiar with connector
technologies develop
score

functions


Relating observable
properties (performance
reqs) of connector to
scenario dimensions


Software architects fill out
Bayesian
domain profiles

containing conditional
probabilities


Likelihood a connector,
given attribute A and its
value, and given scenario
requirement, is appropriate
for scenario S

Selection Analysis


How do we make decisions based on a
rank list?


Insight:
looking at the rank list, it is
apparent that many connectors are
similarly ranked, while many are not


Appropriate versus Inappropriate?


Selection Analysis

(bbFTP, 0.15789473684210525)

(FTP,0.15789473684210525)

(GridFTP,0.15789473684210525)

(HTTP/REST, 0.15789473684210525)

(SCP, 0.15789473684210525)

(UFTP, 0.15789473684210525)

(Bittorrent, 0.02105263157894737)

(CORBA, 0.005263157894736843)

(Commercial UDP Technology, 0.005263157894736843)

(GLIDE, 0.005263157894736843)

(RMI, 0.005263157894736843)

(Sienna, 0.005263157894736843)

(SOAP, 0.005263157894736843)

appropriate

inappropriate

Selection Analysis

Selection Analysis


Employed k
-
means data clustering algorithm


k

parameter defines how many sets data is partitioned into


Allows for clustering of data points (x, y) around a
“centroid” or mean value


We developed an exhaustive connector clustering
algorithm based on k
-
means


clusters connectors into 2 groups, appropriate, and inappropriate


uses connector rank value as y parameter (x is the connector
name)


exhaustive in the sense that it iterates over
all

possible connector
clusters (vanilla k
-
means is heuristic & possibly incomplete)


Tool Support


Allows a user to utilize different connector
knowledge bases, configure selection
algorithms and execute them and visualize
their results




Decision Process

87%


80.5%



Precision

-

the fraction of
connectors correctly identified as
appropriate for a scenario


Accuracy

-

the fraction of
connectors correctly identified as
appropriate or inappropriate for a
scenario

Decision Process: Speed

Conclusions & Future Work


Conclusions


Domain experts (gurus) rely on tacit knowledge and
often cannot explain design rationale


Disco provides a quantification of & framework for
understanding an ad hoc process


Bayesian algorithm has a higher precision rate


Future Work


Explore the tradeoffs between
white
-
box

and
black
-
box

approaches


Investigate the role of architectural mismatch in
connectors for data system architectures

Thank You!


Questions?

Backup

Related Work


Software Connectors


Mehta00 (Taxonomy), Spitznagel01,
Spitznagel03, Arbab04, Lau05


Data Distribution/Grid Computing


Crichton01, Chervenak00, Kesselman01


COTS Component/Connector selection


Bhuta07, Mancebo05, Finkelstein05


Data Dissemination


Franklin/Zdonik97