presentations from this session combined - eBay Research Labs

photofitterInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

135 views

eBay Research Labs

Distributed Computing

Sunil Mohan

Evan Chiu

Paul Strong



28
th

September 2007

eBay Inc. Confidential

2

Overview


Distributed Systems/Grid Computing


Network scale


more resources


Inherent resilience


multiple instances and replicas


Integrated system



Leverages trends


Bandwidth and latency improvements turn discrete
components into a fabric


From server centric applications to network distributed
services


eBay Inc. Confidential

3

Opportunities & Challenges



Opportunities


Solve previously insoluble problems


Faster time to result


Greater agility


Economies of scale


Challenges


Managing the inherent complexity

eBay Inc. Confidential

4

Presentations


Scalable Web Log Processing (Sunil Mohan)


Using compute clusters to scale analytics, specifically Sojourner log file
processing


Mobius (Evan Chiu)


Using compute clusters to scale analytics, specifically analyzing and
visualizing user activity patterns (page flow) within eBay.com.


Grid Computing & Enterprise Management (Paul Strong)


Understanding the challenges of managing at scale


Modeling massive, distributed systems


Using Semantic Web technologies to help understand our infrastructure


Building infrastructure visualization, navigation and management tools



eBay Research Labs

Scaleable Web Log
Processing

Evan Chiu, Sunil Mohan



September 26, 2007

eBay Inc. Confidential

6

Agenda


Distributed Architecture

for Processing Sojourner Logs


eBay Trends & Merchandising

eBay Inc. Confidential

7

Web Log Mining Takes a Looooong Time

eBay Trends

Tracking Activity Volume for Specific Keywords


100’s GBytes of data (eBay Web Logs)


Days to run


Timely results were important

Parallelize the Task


Limited to single computer, small nbr. of processes


Distribute Across Several Machines ?

Easily ?

Support Multiple Users ?

eBay Inc. Confidential

8

Sojourner
: eBay’s Web Session Log

Sojourner Logs Page Traffic


GUID (pseudo
-
UserId) and Session
-
ID


Page with Modal Parameters


Built on CAL



App Server

App Server

eBay Inc. Confidential

9

Mining Sojourner Logs

In Production, e.g. for Finding


Query Traffic


Demand Services


Related Products, Queries, etc.




UED


Many reports, including page flow analysis for revenue attribution

Research Labs


Page Flows (e.g. Mobius)


User Behavior and Session Analysis


eBay Trends

eBay Inc. Confidential

10

Distributable

Sojourner Produces a Lot of Data

Over 2 Billion Events / Day !

Sojourner Listener for Finding


5 machines in parallel to process live logs

eRL Sojourner Feed (from UED)


~ 25% of scrubbed data



0.5 Billion Events / day



~ 50 GBytes / day,
after compression

Data is Organised


80 files / day


Each file is organized into complete user sessions

eBay Inc. Confidential

11

Example Sojourner Log Mining Application

eBay Trends

Associating eBay activity with current events

Current Events Have Impact on eBay Activity


Events


New Product Release, e.g. “
Apple’s iPhone



Celebrity, e.g. “
Kobe Bryant scores 50
”,

or “
Kurt Vonnegut Dies



Activity


Queries, Item
-
Views


Bids, BINs


Listings

eBay Inc. Confidential

12

eBay Marketing Message

Urgency

News

eBay Inc. Confidential

13

eBay Trends
: Sample Event

Event
: “
Kurt Vonnegut Dies
”.

Date
:
2007
-
04
-
11





eBay Inc. Confidential

14

eBay Trends
: eRL Prototype

News

Crawler

Sojourner

Miner

Activity

Up?

Sojourner

Logs

Event

Flagged

News Web
Sites

Yes

eBay

Merchandising

Engine

Quick Detection

eBay Inc. Confidential

15

eBay Trends
: Sample Output

Event
: “
Apple iPhone Goes On Sale


Date
:
2007
-
06
-
29





Apple iPhone goes on sale today

Apple iPhone Goes on Sale Today
-

Christian
Broadcasting Network

Apple’s Long
-
Awaited iPhone Goes on Sale
-

New York
Times



Apple iPhone Reviews

USA TODAY REVIEW: Apple's iPhone isn't perfect, but it's worthy of
...
-

Detroit Free Press

Apple's IPhone May Wow New Yorkers; AT&T's Service, Not So Much
-

Bloomberg

eBay Inc. Confidential

16

Can We Spot New Trends Quickly?

Early Detection of Trends


Merchandize


Advertise


Sojourner Log Processing is the Bottleneck !


Surely We Can Do Better …

eBay Inc. Confidential

17

Reliable Scaleable Computing with
MapReduce


Designed to handle Vast Data Volumes


Focus on data intensive processing


Programmers Don’t Need Special Parallel Programming Knowledge


Map
: (key, value)


intermediate (key', value') pairs


Reduce
: intermediate (key', value') pairs


reduced (aggregated) data


Automatic Parallelization and Execution


Data Partitioning


Scheduling on machines in cluster


Handles machine failure


Handles inter
-
machine communication


Schedules tasks on nodes close to data


Designed to be executed on Commodity Hardware


One Company’s Deployment
:

1800 dual
-
2GHz Xeon machines, 4GB RAM,

320 GB Disk, and Gigabit Ethernet

eBay Inc. Confidential

18

eRL Deployment of MapReduce using
Hadoop



Java based implementation of MapReduce


Apache Project


Test Deployment on ERL Cluster


9 4
-
cpu (i386) machines, 5 processes / machine


Customized Sojourner APIs


Java


Perl


Performance Increase


Sample eBay Trends task processing 4 days of logs:


Perl with 4 processes on 1 machine:
9 hrs, 45 mins
.


Java, Hadoop:
48 mins
.


2.3 Billion Events, 191 GBytes


Sample Mobius Task: 7 days of logs:


Java with 5 threads, 1 machine:
Over 20 hours


Java, Hadoop:
~ 2 hours

eBay Inc. Confidential

19

Conclusion


Sojourner Log Mining can be Scaled by huge amount


Cheap commodity hardware


Current Events provoke spikes in eBay Activity


eBay Trends


New Applications through Real
-
Time

or Near
-
Real
-
Time Scale
-
up


Merchandising by early notification of eBay Trends


Niche Merchandising


Researchers


Evan Chiu


Sunil Mohan

eBay Research Labs

Mobius

A user behavior visualization tool

Chi
-
Hsien Chiu

Ideas: Sundaresan, Neel Mentor:
Gupta, Raghav

eBay Inc. Confidential

21

Introduction



Organizations often generate and collect large volumes of data; most of
this information is usually generated automatically by Web servers and
collected in server log. Analyzing such data can help these organizations to
determine the value of particular customers, cross marketing strategies
across products and the effectiveness of promotional campaigns, etc.




Web Mining, wiki.


Sojourner
:



A framework developed by eBay, it’s not any single product. This framework
allows us to track the entire progression of any specific user’s activity on the eBay
website, and provides a mechanism to track this exhaustively for each and every
user. In other words, it allows us to track user “sojourns” on the eBay site.”


from Sojourner Analytics Platform User Guide.




From the UBI Sojourner Training Slides

eBay Inc. Confidential

22

What’s Missing?


Effective visualizing the activity patterns of individual or group of users on
the eBay web properties.

eBay Inc. Confidential

23

Example Use Cases


eBay Motors site


Do the users like the new launched site more or do they like the one before?


Do the users browsed the new launched site following our expected patterns?


Google AdWords


Are we paying effective words?


What’s the difference of performance between the former and after?


Before V.S. Aug. 2007 new launch


Did the user go
X

page after they entered eBay? and how?


What’re the pages that users went to after they visited
Y

page?


How they left the eBay site? What’s the difference between them?

eBay Inc. Confidential

24

Challenge


Sojourner logs are text files, it’s hard to perform complex queries on top of
that. (50 GB compressed per day, over a billion # of page views per day)


Collecting data across days takes so many time (more than 1 TB, more than
10 hours, maybe).


Collected path information is usually huge, millions of flow of the web site
for a day is quite often, how to display it to user?


How to generate data for multiple users at same time?

eBay Inc. Confidential

25

Solution & Tools Used


Solution:


An interactive Web application that allows people to visualize the eBay users access patterns
with customized criteria.


Tools


Sojourner:



A framework developed by eBay, it’s not any single product. This framework allows us
to track the entire progression of any specific user’s activity on the eBay website, and
provides a mechanism to track this exhaustively for each and every user. In other words,
it allows us to track user “sojourns” on the eBay site.”


from Sojourner Analytics Platform
User Guide.


Apache Hadoop:


“A software platform that lets one easily write and run applications that process vast
amounts of data.




from the Hadoop web site.


eRL Sojourner Java library:



A Java library developed by eRL for processing Sojourner logs.


Dojo:



A open source JavaScript, AJAX toolkit. We used its GFX module for the visualization
part.




eBay Inc. Confidential

26

A Simple Use Case Walk Through


Show me what the users did after they entered the new launched eBay site.


Criteria:


Shows traffics in eBay US only,


Ignore pages that are not visible to the user,

Sojourner

Logs

Query

Mobius on Hadoop Cluster

Thanks to the eBay UBI team

for providing us the

25% Sojourner logs.

Analyst

Shared resource

eBay Inc. Confidential

27

Demo






Mobius project is set up at:

http://erl03.arch.ebay.com:8080/Mobius/

eBay Inc. Confidential

28

Future Enhancements


Efficiency of storing the result into database.


Page type grouping.


Aggregating repeat/similar patterns.


Splitting a page to sub
-
page group


More comprehensive query options.


Front end/UI filters to enhance viewability and usability.


Editing existed queries.


Related ratio comparing


Allow users view any kinds of time series


Exporting the graph into GraphML.

eBay Research Labs

Grid Computing

Paul Strong

Distinguished Engineer, eBay Research Labs

Chair, Open Grid Forum (OGF)

Co
-
Chair OGF Reference Model Working Group



28
th

September 2007

eBay Inc. Confidential

30

eBay’s Drivers


Extreme Scale


241m Registers Users, 100m+ Items, 6m+ New Items Per Day


Extreme Growth


Near exponential growth in listings for most of history


12 years


Extreme Agility


Roll code to the site every 2 weeks


Constant, predictable presence


Must be 24x7x365


Efficiency

Failure To Keep Up Is Not An Option!

eBay Inc. Confidential

31

eBay Marketplace’s Scale


241 million registered users


103+ million Items


6+ million new items per day


34 billion SQL transactions per day


600+ production database instances (inc replicas)


100+ clusters


3+ PB SAN storage


3000 SAN ports


16000+ V3 Service Instances


8000+ Servers for V3


3000
-
ish Servers for Voyager

eBay Is BIG!

eBay Inc. Confidential

32

The Big Problem

# Relationships

# Components

Management complexity scales with this

eBay Inc. Confidential

33

Understanding Relationships

Service A is
composed

of

Persistence Sub
-
Service B

Business Logic Sub
-
Service C

Presentation Sub
-
Service D

A

B

C

D

eBay Inc. Confidential

34

Understanding Relationships

A

B

C

D

Business Logic Sub
-
Service C is
composed

of

A Load Balancing Service

Several Application Instances

App

App

LBS

eBay Inc. Confidential

35

Understanding Relationships

A

C

The Application Instances are
hosted

on

App

App

LBS

B

D

Operating System Instances

OS

OS

LB

The Load Balancing Service is
hosted

on

A Load Balancer Operating System

eBay Inc. Confidential

36

Understanding Relationships

OS

OS

LB

A

C

App

App

LBS

B

D

Svr

Svr

LB

VS

The Operating System Instances are
hosted

on

Servers or Virtual Servers, which are in turn hosted on servers

The Load Balancer OS is
hosted

on

A Physical Load Balancer

eBay Inc. Confidential

37

Relationships Are Everything!


Everything is interconnected


Changing one thing causes ripples


How you connect things together determines business functionality
and business value


Agility is the ability to change these relationships dynamically
(easier with loosely coupled services)


Virtualization is about standardizing a relationships and
interposing/isolating one end from the other


Understanding these relationships allows you to


Tie business processes to the infrastructure they run on


Map value to cost


Understand and manage traffic flow


Understand and manage provisioning etc.


It’s all about managing relationships, not things!

eBay Inc. Confidential

38

Research Areas Of Focus


How do you simplify the management of
complex Grids/Distributed Systems?


How do you model Grids/Distributed Systems?


How do you serialize this model?


How do you visualize infrastructure to make it
easy to diagnose problems and to manage it?

eBay Inc. Confidential

39

Work Areas


Standards


Modeling effort leverages and drives Open Grid Forum reference
model


Engaged with DMTF and SNIA


Modeling Search Back End


Shared by Dagwood (see later), CMDB and Search Back End
Management Tool


Dagwood


Pilot production management tool


Due in Jan 2008


eBay Inc. Confidential

40

Data

Business Logic

Presentation

Physical

Virtualized

Physical

OE

Virtualized

OE

Platform
Instance

Virtualized

Platform

Biz Process/
-
Service

OS

OS

LB

A

C

App

App

LBS

B

D

Svr

Svr

LB

VS

Categorizing The Components

eBay Inc. Confidential

41

Physical

Virtualized

Physical

OE

Virtualized

OE

Platform
Instance

Virtualized

Platform

Biz Process/
-
Service

Storage

Network

Compute

Aggregations

Web Server
Farm

Federation

Clusters

Load
Balanced
Farms

eBay Search

eBay Sell Item

PayPal

Database

Network

File systems
-

NFS, CIFS

LDAP

Web Server

Application
Server

File Systems

LUNs,

Volumes

Disks, Array Controllers,
SAN Switches

Virtualized OS


eg Solaris Containers,
BSD Jails etc.

OS
-

eg AIX, HP/UX,
Linux, Solaris,
Windows etc.

VMMs & Hypervisors


Hardware Partitions

Servers,

Blades etc.

Load Balancers,
Global IP in

clusters

IP, TCP, UDP etc

VLANs

Switches,
Routers etc..

eBay Buy Item

eBay Auction

N.B. Diagram shamelessly borrowed from the Open Grid Forum Reference Model (formerly EGA Reference Model)

Categorizing The Components

eBay Inc. Confidential

42

Standard Models


The Open Grid Forum

eBay Inc. Confidential

43

Modeling Search Back End

eBay Inc. Confidential

44

What Is Dagwood?


A layer on top of existing configuration and monitoring tools that
allows



Simpler diagnosis of problems through better visualization and
navigation of the infrastructure


Quicker and more reliable workflow by providing context sensitive
access (right click in navigation paradigm) to existing tools


Integrates and adds value to


Search Back
-
End Management Tool (Phase 1 deliverable)


CMDB, RDS, ODB etc. (Future phases)


CAL, SN
-
Mon etc.


Phase 1 Use Cases


Fault diagnosis of Search Back End infrastructure query flow


Diff view of what has changed

eBay Inc. Confidential

45

What’s Really Different About Dagwood


Innovative use of Semantic Web technologies



i.e. Ontology


RDF, OWL etc.


Makes tooling extensible


semantics, i.e. knowledge of
the structure of eBay.com, can be built into the data
rather than into the plug
-
ins, thus obviating the need for
new/updated code when things change

eBay Inc. Confidential

46

Comparing Dagwood With CMDB


Traditional CMDB


Proprietary model


Typically focused on hardware and
physical assets (software)


Answers questions like


How many of X do I have


What is the asset tag of server Y


Show me who owns server Z


OS/CMS


Open, extensible model


Focused on relationships between
assets and the business they support


Answers questions (in addition to trad.
CMDB) like


Show me current performance against my
business SLAs


Show me all the components and client
interactions that are impacted when load
balancer Y is down


Show me the current bandwidth utilization
of the physical network connections that
support this user interaction


Show me the route through the site for this
type of transaction


Compare the state of the site now with the
state of it the last time that failure X
occurred


Effectively delivers integration of data
warehouse
-
like functionality and config
management within one service

eBay Inc. Confidential

47

Dagwood Phase 1

Dagwood Semantic Query

(Production Quality)

Dagwood Query, Navigation & Control User Interface

XUNI Core


Dagwood Stack

(Non Production

Quality


for

testing & validation

only)

Search Back


End

Management

Tool

Adaptors/Plug
-
Ins

eBay Inc. Confidential

48

Dagwood Phase 1++

Dagwood Semantic Query

(Production Quality)


CMDB, RDS, ODB

Dagwood Query, Navigation & Control User Interface

XUNI Core


Dagwood Stack

(Non Production

Quality


for

testing & validation

only)


Search Back


End

Management

Tool

Adaptors/Plug
-
Ins

eBay Inc. Confidential

49

Demo

eBay Inc. Confidential

50

More Information


Research Labs Wiki


http://research.arch.ebay.com/projects/Grid/


Open Grid Forum (OGF)


http://www.ogf.org/


OGF Reference Model Working Group


https://forge.gridforum.org/sf/wiki/do/viewPage/projects.rm
-
wg/wiki/HomePage


RDF


http://www.w3.org/RDF/


http://planetrdf.com/guide/


OWL


http://www.w3.org/2004/OWL/


A Good Book


on XML, RDF and OWL


“A Semantic Web Primer” by Gregoris Antoniou & Frank van Harmelen