ARCHITECTURE, ALGORITHMS, AND DESIGN

fishglugΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 7 μήνες)

65 εμφανίσεις

REAL
-
TIME RECOMMENDATIONS FOR RETAIL:

ARCHITECTURE, ALGORITHMS, AND DESIGN

Juliet
Hougland

and Jonathan Natkins

Who Are We?

Jonathan Natkins

Field Engineer at
WibiData

Before that,
Cloudera

Software Engineer

Before that,
Vertica

Software/Field Engineer

Juliet
Hougland

Data Scientist, previously
at WibiData

MS in Applied Math

BA in Math
-
Physics

Recommendations in Retail

Personalized versus Non
-
Personalized

Recommendations in Retail

Personalized versus Non
-
Personalized

Recommendations in Retail

Personalized versus Non
-
Personalized

Recommender Contexts

Taste History

Based on everything you know about a user

Interests over months/years

Current Taste

Based on a user’s immediate history

Interests over minutes/hours

Ephemeral

Extreme version of current taste

For example, location

Demographic*

Similar to taste history, but less subjective

Geographic region, age bracket, etc.

Why Does Real
-
Time Matter?

Relevancy

I am a Special Snowflake

Natty

Requirements for a Real
-
Time System

General System Requirements

Handle millions of customers/users

Support collection and storage of complex data

Static and event
-
series

Real
-
Time System Requirements

Quickly retrieve subsets of data for a single user

Aggregate/derive new, first
-
class data per user

What is
Kiji
?

The
Kiji

project is a
modular, open
-
source framework
for building real
-
time applications
that collect, store,
and analyze

entity
-
centric data

kiji.org

github.com
/
kijiproject

What is
Kiji
?

The
Kiji

project is a
modular, open
-
source framework
for building real
-
time applications
that collect, store,
and analyze

entity
-
centric data

kiji.org

github.com
/
kijiproject

Three Challenges

Developing models for use in real
-
time

Scoring models in real
-
time

Deploying models into a production
environment

How Can We Make Real
-
Time Models?

Population interests
change slowly

Individual interests
change quickly

How Can We Make Real
-
Time Models?

Population interests
change slowly

Individual interests
change quickly

Models don’t need
to be retrained
frequently

How Can We Make Real
-
Time Models?

Population interests
change slowly

Individual interests
change quickly

Models don’t need
to be retrained
frequently

Application of a model
should be fast

A Common Workflow

Train a model over
the entire dataset

Save fitted model
parameters to a file or
another table

Access the model
parameters when
generating new
recommendations
based on new data

This is
EXPENSIVE

Developing Models

KijiExpress

Scala

interface for interacting with
Kiji

data

Uses Scalding for designing complex
dataflows

Model Lifecycle

Allows analysts and data scientists to break apart
a model into phases

Scoring Models in Real
-
Time

Batch isn’t real
-
time

Scoring Models in Real
-
Time

Batch isn’t real
-
time

Number of

Users

Number of Interactions

Scoring Models in Real
-
Time

Batch isn’t real
-
time

Number of

Users

Number of Interactions

A few users with

many interactions

Scoring Models in Real
-
Time

Batch isn’t real
-
time

Number of

Users

Number of Interactions

A few users with

many interactions

A lot of users with

few interactions

Fresheners Compute Lazily

Client

KijiScoring

Server

HBase

Read a column

Get from HBase

Fresheners Compute Lazily

Client

KijiScoring

Server

HBase

Read a column

Get from HBase

Freshness
Policy

Fresheners Compute Lazily

Client

KijiScoring

Server

HBase

Read a column

Get from HBase

Freshness
Policy

Yes, return to client

Fresheners Compute Lazily

NO

Client

KijiScoring

Server

HBase

Read a column

Get from HBase

Freshness
Policy

Scorer

Fresheners Compute Lazily

Client

KijiScoring

Server

HBase

Read a column

Get from HBase

Freshness
Policy

Scorer

Yes, return to client

Write back for next time

Kiji

Application Stack

Deployment Challenges

Kiji

Model Repository

Link between application and models

Stores Freshener metadata

FreshnessPolicy
, Scorer, attached column

Location of trained model

Stores Scorer code

Code repository makes model scoring code available
to the application from a central location

New models can be deployed to the Model
Repository and made immediately available to
the application

Kiji

Model Repository

Retail Recommendation

Types of Recommenders

Recommendation

Algorithms

Collaborative

Filtering

Methods

Content

Based

Methods

Memory

Based

Model

Based

Content
-
Based Recommenders

Orange
-
Nosed

Lab Assistant

Meeps

a lot

Build models around entities using
features that we think reflect
inherent characteristics

Content
-
Based Recommenders

safer

faster

knife

Pandora: Content
-
Based

Expertly
-
Characterized

Music

Collaborative Filtering

Represent users
-
item

affinities as a sparse

matrix

Beaker

Banana

Slicer

Pineapple

Slicer

Users
≈ Rows

Items


Columns

Aspirational Ratings

I put in my queue…

I actually watch

Collaborative Filtering

Represent users
-
item

affinities as a sparse

matrix

Beaker

Banana

Slicer

Pineapple

Slicer

Users


Rows

Items


Columns

Simple aggregate predictors




Collaborative Filtering: How It Works

Similar Users

Similar Products

Similar Entities

What do we mean by similar?

Jaccard

Index:

a measure of set similarity

Cosine Similarity:
the angle between two vectors

Pearson Correlation:
statistical measure, similar to cosine

Naively, we could compare every entity to each other























…But that would not scale
will with increasing
numbers of entities

Building the Similarity Matrix

Collaborative Filtering: Is This Useful?

Problem:
Too much data!

Tracking user preferences and all their events generates huge
amounts of data

Problem:
Too little data!

Dimensions of user
-
space and item
-
space are usually very large

More variables makes it more difficult to generate user
preferences

Problem:
Cold start

If you don’t know anything about a user, what should you
recommend?

Problem:
More

ratings means
slower

computations

Identifying neighborhoods of entities is expensive

Collaborative Filtering: Why Is It Useful?

Because it works

Content
-
agnostic

All that matters is co
-
occurrence of events

Amazon: Item
-
Item Collaborative Filtering





Used for personalized recommendations

Fill screen real estate with related items

Produces specific, but non
-
creepy
recommendations

Linden, G.; Smith, B.; York, J., "Amazon.com recommendations: item
-
to
-
item collaborative filtering,"
Internet Computing, IEEE

, vol.7,
no.1, pp.76,80, Jan/Feb 2003

>

Item
-
Item Collaborative Filtering





Beaker buys a banana slicer

Then:

Generate list of candidate items to predict ratings for

Predict ratings for candidate items

Select Top
-
N items

Accessing External Data

KeyValueStore

API enables external data access
when applying a model

External data might be…

Trained model parameters

Hierarchical/Taxonomic data

Geo
-
lookup

Store external data flexibly

Text files, sequence files,
Kiji

tables, etc.

Data access is decoupled from use during execution

If the data doesn’t fit in memory, put it in a table

How Much Less Work Can We Do?

We can choose a
predictor that allows
us to truncate a sum


There are two ways
terms in the sum of
our predictor can be
small

No rating

Small similarity

How Much Less Work Can We Do?

We can choose a
predictor that allows
us to truncate a sum


There are two ways
terms in the sum of
our predictor can be
small

No rating

Small similarity

How Much Less Work Can We Do?

We can choose a
predictor that allows
us to truncate a sum


There are two ways
terms in the sum of
our predictor can be
small

No rating

Small similarity

Ignore unrated items

How Much Less Work Can We Do?

We can choose a
predictor that allows
us to truncate a sum


There are two ways
terms in the sum of
our predictor can be
small

No rating

Small similarity

Ignore dissimilar items

How Much Less Work Can We Do?

If we only present a few recommendations,
we don’t need to predict ratings for all items

Choose your candidate set to estimate ratings
wisely or infer from nearest neighbors

Organizing Data in Item
-
Item CF

Accessing Data During Freshening

Want to Know More?

The
Kiji

Project

kiji.org

github.com
/
kijiproject

Questions about this presentation?

Twitter: @
JulietHougland

or @
nattyice

Email:
natty@wibidata.com