USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION

zoomzurichAI and Robotics

Oct 16, 2013 (3 years and 10 months ago)

89 views

USING HADOOP & HBASE
TO BUILD CONTENT
RELEVANCE &
PERSONALIZATION

Tools to build your big data application


Ameya Kanitkar


Ameya Kanitkar


That

s me!



Big Data Infrastructure Engineer @ Groupon
, Palo Alto
USA (Working on Deal Relevance & Personalization
Systems)


a
meya.kanitkar@gmail.com

http
://www.linkedin.com/in/
ameyakanitkar

@
aktwits

Agenda




Basics

of Hadoop & HBase



How you can use

H
adoop & HBase for big data
application



Case Study
: Deal Relevance and Personalization
Systems at Groupon with
H
adoop & HBase

Big Data Application Examples




Recommendation Systems



Ad targeting



Personalization Systems



BI/ DW



Log Analysis



Natural Language Processing

So what is
H
adoop?




General purpose framework for processing
huge

amounts of
data
.



Open Source



Batch

/ Offline Oriented



Hadoop
-

HDFS



Open Source
Distributed

File System.



Store
large

files. Can easily be accessed via application
built on top of HDFS.



Data is
distributed

and
replicated

over multiple machines



Linux Style
commands
eg
.
ls
,
cp
, mv,
touchz

etc






Hadoop


HDFS



Example
:

hadoop

fs


dus

/data/

185453399927478 bytes =~
168 TB


(
One of the folders from
one of our
hadoop

cluster)


Hadoop


Map Reduce



Application Framework built on top of HDFS to
process

your
big data



Operates on
key
-
value
pairs



Mappers
filter and transform
input data



Reducers
aggregate

mapper output

Example


Given web logs, calculate
landing page conversion rate
for each product



So basically we need to see how many impressions each
product received and then calculate conversion rate of for
each product

Map Reduce Example

Map 1: Process Log File:

Output: Key (Product ID), Value
(Impression Count)

Map 2: Process Log File:

Output: Key (Product ID), Value
(Impression Count)

Map N: Process Log File:

Output: Key (Product ID), Value
(Impression Count)

Reducer: Here we receive all
data for a given product. Just run
simple for loop to calculate
conversion rate.

(Output: Product ID, Conversion
Rate

Map Phase

Reduce Phase

Recap



We just
processed terabytes of data
, and calculated
conversion rate across
millions

of products.



Note: This is
batch process
only. It takes time. You can
not start this process after some one visits your website.



How about we generate recommendations in batch process
and serve them in real time?



HBase



Provides real time
random read/ write access

over HDFS



Built on Google’s ‘
Big Table
’ design



Open Sourced



This is
not RDBMS
, so
no joins
. Access patterns are
generally simple like get(key), put(key, value) etc.










Row

Cf
:<
qual
>

Cf
:<
qual
>

….



煵慬
>

Row 1

Cf1:qual1

Cf1:qual2

Row 11

Cf1:qual2

Cf1:qual22

Cf1:qual3

Row

2

Cf2:qual1

Row N


Dynamic

Column Names. No need to define columns upfront.


Both
rows

and
columns

are (lexicological)
sorted

Row

Cf
:<
qual
>

….

user1

Cf1:click_history:{
actual_cl
icks_data
}

Cf1:purchases:{
actual_pur
chases
}


user11

Cf1:purchases:{
actual_pur
chases
}

user20

Cf1:mobile_impressions:{a
ctual mobile impressions}

Cf1:purchases:{
actual_pur
chases
}


Note: Each row has different columns, So think about this as a
hash map
rather
than at table with rows and columns

Putting it all together

Analyze Data

(Map Reduce)

Generate
Recommendations

(Map Reduce)

Store data in
HDFS

Serve Real Time
Requests

(HBase)

Web

Mobile

Do
offline

analysis in
H
adoop
, and serve
real time

requests with
HBase

Use Case: Deal Relevance &
Personalization @ Groupon

What are Groupon Deals?

Our Relevance Scenario

Users



Our Relevance Scenario

Users

How do we surface relevant
deals ?




Deals are
perishable

(Deals
expire or are sold out)



No

direct
user intent
(As in
traditional search
advertising)



Relatively Limited User
Information



Deals are
highly local



Two Sides to the Relevance Problem

Algorithmic

Issues


How to find

relevant deals for

individual users

given a set of

optimization criteria

Scaling

Issues


How to handle

relevance for

all users across

multiple

delivery platforms

Developing Deal Ranking Algorithms



Exploring Data


Understanding signals, finding
patterns



Building Models/Heuristics


Employ both classical machine
learning techniques and heuristic
adjustments to estimate user
purchasing behavior



Conduct Experiments


Try out ideas on real users and
evaluate their effect

Data Infrastructure

2013

2011

2012

20+

400+

2000+

Growing Deals

Growing Users


1
00 Million+
subscribers



We need to store data
like, user click history,
email records, service
logs etc. This tunes to
billions of data points
and TB’s of data

Deal Personalization Infrastructure Use
Cases


Deliver Personalized
Emails


Deliver Personalized
Website & Mobile
Experience

Offline System

Online System

Email

Personalize
billions of emails

for

hundredsof

millions of users

Personalize one of the most popular

e
-
commerce mobile & web app

for
hundreds of millions

of

users &
page views

Architecture

HBase
Offline

System

HBase for
Online System

Real Time

Relevance

Email

Relevance

Map/Reduce

Replication

Data Pipeline


We can now
maintain different
SLA on online and
offline systems



We can tune
HBase cluster
differently for
online and offline
s
ystems



HBase Schema Design

User ID

Column

Family 1

Column Family 2

Unique Identifier
for Users

User History and
Profile

Information

Email History For Users


Most of our data access patterns are via “User Key”


This makes it easy to design HBase schema


The actual data is kept in JSON

O
verwrite user history
and profile info

Append email history for
each day as a separate
columns. (On
a
vg

each
row has over 200
columns)

Cluster Sizing

Hadoop +
HBase

Cluster

100+ machine Hadoop
cluster, this runs heavy
map reduce jobs

The same cluster also
hosts 15 node HBase
cluster

Online HBase
Cluster

HBase

Replication

10 Machine
dedicated HBase

c
luster to serve
real time SLA


Machine Profile


96 GB RAM (HBase
25 GB)


24 Virtual Cores
CPU


8 2TB Disks



Data Profile


100 Million+
Records


2TB+ Data


Over 4.2 Billion Data
Points




Questions?




Thank You!


(We are hiring!)

www.groupon.com
/
techjobs