Personalized Query Classification - Department of Computer ...

zoomzurichΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

56 εμφανίσεις

Personalized Query
Classification

Bin Cao, Qiang Yang, Derek Hao Hu, et al.

Computer Science and Engineering

Hong Kong UST

Query Classification and Online
Advertisement



QC as Machine Learning

Inspired by the KDDCUP’05 competition


Classify a query into a ranked list of categories


Queries are collected from real search engines


Target categories are organized in a tree with each node
being a category



Our QC Demo


http://q2c.cs.ust.hk/q2c/


Personalization


The aim of
Personalized Query Classification

is to classify a user query
Q
to a ranked list of
predefined categories for
different users

Queries

Categories

golf

Car

Sports

Places

bass

Entertainment/Music

Living/Fishing

Michael
Jordan

Information/Research

Sports/Basketball

Shopping

PQC: Personalized Query Classification


classify a user query
Q
to a ranked list of
categories for
different users

Queries

Categories

golf

Car

Sports

Places

bass

Entertainment/Music

Living/Fishing

Michael
Jordan

Information/Research

Sports/Basketball

Shopping

Question:

Can we personalize search without
user registration info?



Profile based PQC



Context based PQC



Conclusion

Difficulties


Web Queries are


Short, sparse: “
adi
”, ”
cs
”, “
ps



Noisy: “
contnt
”, “
gogle



New words are emerging all the time: “
windows7



Training data are hard for human to label


Experts may have different understandings for the
same ambiguous query


E.g. “
Apple
”, “
Office
”, etc.


Method 1: Profile Based


Profile (U) = { <Q, Search
-
Result, Clicked
-
URL>} in
the past


Profile based Personalized Query Classification

-
….

√ ….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-

….

√ ….

√ ….

-
….

√ ….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-

….

√ ….

√ ….

-
….

√ ….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-

….

√ ….

√ ….

Michael Jordan

Method 2: Context Based


Context = a session of

user submitted queries


Graphical Model

Machine Learning

UCB

Michael Jordan

Outline


Introduction



Profile based PQC



Context based PQC



Conclusion

How to construct a user profile?


To achieve personalized query classification,
under independence assumption




ACM KDDCUP 2005 Solution: estimating: p(q|c)


Focus:
estimating p(u|c) for personalization


Difficulty: sparseness


Too many possible categories


Limited information for each user

p(c|q,u)


p(q|c)p(u|c)p(c)

Categorized

Clickthrough Data:Too
Few!


Clickthrough Data


Search
Engines

Collaborative Classification


Leverage information from similar users: user
-
class matrix

C1

C2

C3

C4

C5

User A



X



?

X

User B





?

X



User C

X

X



?

X

啳敲⁄



?





X


interested in

X not
interested in

Also can be a value indicate
degree of interests

Extending Collaborative Filtering (CF)
Model to Ranking
(Liu and Yang, SIGIR 008)


Previous method for CF:


Memory based

approach: Finding users having
similar interests to help predicting missing values


Model based

approach: estimating probability
based on new user’s known values


We propose a
collaborative ranking

model to
improve model based approach


Using preference or ranking instead of values


better at estimating the preference for users


Nathan Liu and Qiang Yang.
EigenRank: Collaborative Filtering via Rank
Aggregation
. In
ACM SIGIR Conference

(
ACM SIGIR 08
), Singapore, 2008


Predicted Ratings

Rating Database

Active User Ratings

Rating Prediction

1. Item y
2

2. Item y
3

Item List

Sort

Ranking


Collaborative Ranking Framework

Collaborative Ranking for Intention
Mining

Interest Score Matrix

P(U|C)

|user,

or user

group|

Preference
Matrix

|Category|

|Preference={(URL1<URL2)}|

|User|

Our objective is to uncover the interest
probability P(U|C) consistent with the given
observed

preference for each query

Input

Output

|Intention category|

Solution: Automatically Generate
Labeled

Data (to assist human labelers)


Clickthrough


Connects queries and
urls


Contains users’ personal
interpretation for query


url


a

Query

url


b

Query

User
A

User
B

||

C1

C2

We need the category information for
urls



Experimental Results: F1 metric

How to enlarge training set?

1….

2….

3….

1….

2….

3….

1….

2….

3….

A few human
labeled data

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-

….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-
….

√ ….

√ ….

-

….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-

….

√ ….

√ ….

A HUGE number of
clickthrough

logs
without labels

Online Knowledge Bases,
such as ODP, Wikipedia

Online Knowledge Base such as
WiKi

Knowledge Base

Knowledge Base

Plentiful
Documents

Links

Meaningful
Ontology

“Label” Retrieval from Online KB

Wikipedia Concept Graph

Labels on
result pages:

Shopping:
Commercial

Sports: non
-
Commercial

Video Games:
Commercial

Research:non
-
Commercial

Use labeled result pages as
“Seeds” to retrieve the most
relevant documents as
training data

Taking Online
Commercial
Intention as an
example

Obtain “Pseudo
-
Relevance” Data

1….

2….

3….

1….

2….

3….

1….

2….

3….

A few human
labeled data

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-

….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-
….

√ ….

√ ….

-

….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

√ ….

-
….

√ ….

-

….

√ ….

√ ….

A HUGE number of
clickthrough

logs

We learn a classifier using
the retrieved “labeled”
documents

We apply the classifier to
“label” the HUGE
clickthrough

log

We can use the HUGE
“label”
clickthrough

log
for evaluation

Preliminary results on F(URL)

C


We evaluated the performance of the
classifier trained with the relevant documents
retrieved from Wikipedia


AOL query data set, 10,000 held out for test

F1 for 18 classes on AOL Query Classification task

Number of labeled query
Seed

Training Queries enriched

by

search snippets

Training document
s
retrieved from Wikipedia

100

12%

28%
(5,000 Instances)

200

21%

36%
(10,000 Instances)

400

31%

38%
(15,000
Instaces
)

Outline


Introduction



Profile based PQC



Context based PQC: Hao Hu, Huanhuan Cao,
et al. @ SIGIR 2009, ACML 2009.


Conclusion

Context based PQC for

Online Commercial Intention


The commercial intention of the same query can be
identified given its context information

Allan Iverson

shoes

T
-
short

Michael Jordan

Commercial!

Offer ads!

Context based PQC for

Online


Commercial Intention

[Cao etc. SIGIR’09]


The commercial intention of the same query can be
identified given its context information

Graphical Model

Machine Learning

UCB

Michael Jordan

Non
-
Commercial!

Redirect to scholar
Search!

Two questions:


How do we model query context?




How do we detect whether two queries are
semantically similar?

Feature Generation/Enrichment

Graphical Models

Conditional Random Field

Motivation: model the query logs as a conditional random field. Therefore,
the relationships between consecutive and even skip queries can be modeled.


Question: How do we decide whether two “skip queries” (non
-
consecutive
queries) are related and should be linked?

Semantic Relationship between queries


Given Query A and Query B, how do we
determine the degrees of relevancy of these
two queries in a semantic level?


Send queries to search engines


Obtain search results


Determine distance between search results

Context based PQC for

Online Commercial Intention


The commercial intention of the same query can be
identified given its context information

Allan Iverson

shoes

T
-
short

Michael Jordan

Commercial!

Offer ads!

Context based PQC for

Online Commercial Intention


The commercial intention of the same query can be
identified given its context information

Graphical Model

Machine Learning

UCB

Michael Jordan

Non
-
Commercial!

Redirect to scholar
Search!

Evaluation


U
sing context information


Vs


N
ot using context information

Preliminary Experimental Results of
PQC for Online Commercial Intention


Dataset


AOL Query Log data


Around ~20M Web Queries


Around 650K Web users


Data is sorted by anonymous UserID and
sequentially arranged.


Each item of clickthrough log data contains


{AnonID, Query, QueryTime, ItemRank, ClickURL}


Preliminary Results

In our preliminary experimental studies, we annotated four users with
the OCI (commercial / non
-
commercial) status in their clickthrough logs.


More larger
-
scale experimental studies to be followed.


Evaluation Metric: Standard F1
-
measure


Baseline classifier: the classifier in Dai’s WWW
2006 work (http://adlab.msn.com/OCI/OCI.aspx)

F1 for users on AOL Data

Model

User 1

User 2

User 3

User 4

Baseline (non
-
context)

83.4%

82.3%

84.0%

83.1%

Context base
PQC

92.7%

94.2%

91.3%

92.6%

Preliminary Results

The parameter we tune is the threshold we use to determine whether
we add the “skip edges” in the CRF model or not.

Ongoing work: Personalized Query
Classification


Efficiency



More ground truth data for evaluation

PQC and
Personalized Search


Similar input:


Query Log, Clickthrough Data, IP Address, etc.



Different output:


Personalized Search


ranked results



PQC


Discrete intention categories
,


Application: advertisements etc.



Conclusions: PQC


Have user profile information?


Profile = <User, Query, URLs>


Output=Class


Method = Collaborative Ranking


Have
query stream information?


Context = <User, Query
-
Stream, URLs>


Output=Class


Method = CRF
-
based method


Q & A