talk slides - Penn State University

nostrilshumorousInternet and Web Development

Nov 18, 2013 (3 years and 11 months ago)

73 views

RefSeerX
:

Citation Recommendation

1.
Context
-
based

2.
Without user supervision

Qi He, The
Pennsylvania State University, 2010



Recommendations

Qi He, The
Pennsylvania State University, 2010



Movies

TV Genius

While search engines help you find things you know you
are looking for, recommender helps you find the rest.

Music

Our focus: citation recommendation from the social network perspective


RefSeerX
: in the domain of scientific world (CiteSeerX)








Can be generalized to other linked corpus, e.g. Web


Qi He, The
Pennsylvania State University, 2010

paper

citations

The authors are unaware of related work


they do not know they are looking for


recommender helps them find the citations

Previous work


document recommendation systems

Qi He, The
Pennsylvania State University, 2010

recommend to reviewers/readers

[Basu01][Wang08][Chandrasekaran08]

user profile:

published papers;

homepage etc.

Full
-
text similarity

[Shaparenko09]

Based on partial list of citation
r




full list r

Collaborative Filtering: [McNee02][Torres04][Zhou08]


paper
-
citation, author
-
paper, paper
-
venue graphs

Relevance features between paper and citation:


[Strohman07] train text features, citation graph features on r



[Tang09] topical similarity between r


of a placeholder and its

context


r


burden

Previous work


link prediction

Qi He, The
Pennsylvania State University, 2010

Pair
-
wise citation probability:

[Nallapati08][Tang09]

-
Scalability:

-
modeled for every pair of papers

-
Iterative training process

-
Has to retrain when a new document is added



-
Topic inconsistency:











Paper

Topic

generate

sample

Citation

Topic

could be the same

consistency?

Our idea: citation context


Citation context: words surrounding each placeholder


Impact of context length [Ritchie08]


We used 50 words before and after each placeholder



Context can be used as features to model citation motivations [Aya05]


Context
-
based recommendation



The context may not be consistent with the text of citation


Document similarity may not work well then



Context can be used to complement the content of citation [Huang04]


What the author writes (text of citation) + what some others read (citation context)


Qi He, The
Pennsylvania State University, 2010

RefSeerX
: Context
-
aware Citation
Recommendation

Qi He, The
Pennsylvania State University, 2010



Context is used to link the paper and the citation



Similar contexts lead to similar citations

Missing important citations in a specific place;



Know the citation context.

A public demo
:
http://cxs06.ist.psu.edu/demo




Advantages over general search engines


Use keywords in the citation context to search a literature
search engine like Google Scholar or CiteSeerX

Qi He, The
Pennsylvania State University, 2010

The classic paper that the others read as
the “frequent itemset mining” work.


Not found in the 1
st

page of Google Scholar
& CiteSeerX!

Challenge
s


How to model citations


Context tends to be bias and static.


From the social network perspective
: a
ggregation over others required so that we
can get a robust representation of citation.



How to use contexts


Explainable: the citation is relevant/authoritative to the specific citation context.


Cross
-
context effect: the citation needs to take into account the various contexts in the manuscript.










How to handle the sparse problem (# citations follows the power
-
law)


Papers without citations (no in
-
link citations)


We use the document similarity.



Qi He, The
Pennsylvania State University, 2010


2.1 Mixture Models

……

2.2 Nonparametric
Distribution

Citations to nonparametric mixture models

may be ranked high (save quota)

Formalize the
proble
ms


Global Recommendation


Input: a manuscript (i.e., a global context and a set of
out
-
link local contexts)


Output: a

ranked list of citations as the bibliography of
the manuscript



Local Recommendation


Input:
a few sentences (i.e., an out
-
link local context)
as the query


Output: a ranked list of citations for the placeholder
associated with the out
-
link local context.

Qi He, The
Pennsylvania State University, 2010

Two steps


Step 1:
the candidate set
-

quickly retrieve a
large candidate set which has good coverage
over the possible citations.



Step 2:
modeling context
-
based citation
relevance

-

for each placeholder or for the
bibliography, ranking the citations by
relevance and returning the top
K
.

Qi He, The
Pennsylvania State University, 2010

The candidate set:


context
-
oblivious candidates


4 types of candidates






1:
GN

(e.g.,
G100
)
-

the top
N

documents with abstract and title most similar to
d
1
.

2:
Author
-

the documents that share authors with
d
1
.

3:
CitHop

-

the papers cited by documents (1 and 2).

4:
AuthHop

-

the documents written by the authors whose papers are in 1, 2 and 3.

Qi He, The
Pennsylvania State University, 2010

d
1
: manuscript (query)

The candidate set:


context
-
aware candidates


2 types of candidates








5:
LN

(e.g.,
L100
)
-

the top
N

papers whose in
-
link contexts are most similar to a local context.

6:
LCN

(e.g.,
LC100
)

-

the papers containing the top
-
N

out
-
link contexts most similar to a local context.

Qi He, The
Pennsylvania State University, 2010

c
*
: a local context (query)

c’
: another local context in
the same manuscript

Evaluate the candidate set

Methods

Coverage per Doc

Candidate Set Size per Doc

G1000

0.44

1000

L100

0.55

341

LC100

0.63

674

L1000

0.69

2,844

LC1000

0.78

5,692

Author

0.05

17

L100+CitHop

0.61

1,327

L1000+CitHop

0.72

9,049

LC100+CitHop

0.73

3,561

G1000+CitHop

0.73

3,790

LC1000+CitHop

0.83

22,629

Author+CitHop

0.15

63

L100+G1000

0.75

1,312

LC100+G1000

0.79

1,618

(L100+CitHop)+G1000

0.79

2,279

(LC100+CitHop)+G1000

0.85

4,460

(LC1000+G1000)+
CitHop

0.92

24,793

LC100+G1000+(
Author+CitHop
)

0.79

1,674

(LC100+G1000)+
AuthHop

0.88

39,496

Qi He, The
Pennsylvania State University, 2010

We used this set
as the final
candidate set.

Training:

456, 787 docs

1, 810, 917
contexts


Testing:

1, 612 docs



higher better

smaller better

Context
-
Based Relevance Model (CRM)


Conditional similarity

Qi He, The
Pennsylvania State University, 2010

d
1

c
1

c
2

c
k
1

.




d
2

b
1

b
2

b
k
2

.




c
1

and
b
1
: global contexts

others: local contexts


d
1
: manuscript

d
2
: citation candidate

conditional similarity:

sim
(
d
1
,

d
2
;
C
) or
sim
(
d
1
,

d
2
;
c
*
)

A heuristic and its disadvantages


Let’s look at angle measure like cosine similarity





if

d
1
is not similar to
d
2
,

d
1
is similar to
c
,

d
2
is similar to
c
;
then

d
1
may be relevant to (cite)
d
2
under the context
c
.
1
d
2
d
c

A simple heuristic
:

sim
(
d
1
,d
2
)

cos
(
d
1
,c
)
cos
(
d
2
,c
).
Qi He, The
Pennsylvania State University, 2010

Problems:

1)
Compare context with paper: they may not be consistent to each other!

2)
If
d
1

is the manuscript and
c

is the query context, the ranking of
d
2

is
irrelevant to
cos
(
d
1
,
c
).
No cross
-
context effect.

Improve the heuristic


Use contexts to model papers







Each context is a dimension


All in
-
link contexts are equally important (not always!
Just a simple assumption)

Qi He, The
Pennsylvania State University, 2010

.




d
1
=

c
1

c
2

c
k
1

.




=
d
2

b
1

b
2

b
k
2

Improve the heuristic


Use contexts to model papers







Each context is a dimension


All in
-
link contexts are equally important (not always!
Just a simple assumption)

Qi He, The
Pennsylvania State University, 2010

.




d
1
=

c
1

c
*

c
k
1

.




=
d
2

b
1

b
2

b
k
2

Local recommendation

Improve the heuristic


Use contexts to model papers







Each context is a dimension


All in
-
link contexts are equally important (not always!
Just a simple assumption)

Qi He, The
Pennsylvania State University, 2010

.




d
1
=

c
1

c
2

c
k
1

.




=
d
2

b
1

b
2

b
k
2

Global recommendation

= A mixture of local
recommendations

Measure a context with a vector of contexts


Geometric distance




v
1

v
2

v
3


Each
d
has a real vector space
V
(words of
d
;
contexts are unit column vectors in
V
)

If
d
contains all possible contexts,

d
=
R
n

V
In general,

d
defines a subspace of
V
. For simplicity,
denoted as
V
.
p
(
V
)

p
(
v
1

v
2

v
3
)

p
(
v
1
)

p
(
v
2
)

p
(
v
3
)

1
Distance between
d

and
c
:

1)
If
c

linearly spanned over
d
, span({
c
})

=V
; then
p(c)=1
;

2)
If
c

does not share any word with
d
,
p(c)=0
.


p
d
(
c
) = information kept after projecting
d

on span({
c
})

So,
p
d
(
c
) defines an effective probabilistic and geometric
relevance measure for
c

to document
d
.

Qi He, The
Pennsylvania State University, 2010

c

Gleason's Theorem


Compute
p
d
(
c
).





p
d
(
c
)

Trac
(
T
d
P
c
)

Trac
(
T
d
cc
T
)

c
T
T
d
c
T
d
is a symmetric positive semidefinite matrix and
Trac(T
d
)

1

-

T
d
represents and characterizes
d
.
P
c
is the projection matrix for subspace spanned by
c
.
Qi He, The
Pennsylvania State University, 2010

Represent
d
:
T
d



Assume an unknown generative distribution
p
gen

for contexts.


p
gen

independently generates
k

contexts and these
contexts are then independently judged to be
relevant to
d
.


Optimization:



Qi He, The
Pennsylvania State University, 2010

)
(
)
(
1
i
d
k
i
i
gen
c
p
c
p







k
i
i
d
T
i
k
i
i
gen
T
c
T
c
c
p
d
1
1
)
log(
)
(
log
max
arg
contexts?

of
number

large

a
with
case

general

about the

how

however,
itself);

y to
(similarit

1
)
(

and

then

),
1
(
in
context

one
only

is

there
If
1
1



c
p
c
c
T
k
d
d
T
d
Complexity of generating
T
d

Qi He, The
Pennsylvania State University, 2010

.
in

polynomial
a

as

scale

will
technique
numerical

and

determine

to
parameters


are

There
contexts.
t
independen
linearly

of
number

the
is


and

,
1
with
vectors
column

orthogonal

most
at

of
set

a

are


where
,

as

d
represente

be
can

:
1
n
Propositio
2
2
r
1
r
)
O(r
r
)
t
(t
r
t
t
t
T
i
i
i
i
T
i
i
d





1)
A popular paper usually has thousands of in
-
link citation contexts,
r

could be a large number.

2)
Hundreds of thousands of density matrices need to be estimated (one
for each document) .

3)
The addition of new documents to the corpus will cause the addition of
new in
-
link contexts, requiring a re
-
computation of all the
T
d
.

Closed
-
form solution

of generating
T
d

Qi He, The
Pennsylvania State University, 2010


.

all

similar to

be

should


of
estimator

overall

that the
assume

to
reasonable

is
it

Then,

.

is
relevant

is


given that


of

estimate

likelihood

maximum

the
,
context
each
For
:
n
Observatio
T
i
i
d
T
i
i
i
d
i
c
c
T
c
c
c
T
c






k
i
T
i
i
d
k
i
F
T
i
i
d
T
c
c
k
T
c
c
T
d
1
1
2
1

get

we
s,
derivative

Taking
||
||
min
arg

:
on
optimizati

form
-
closed

new
A
Relevance for

global recommendation


sim
(
d
1
,
d
2
;
C
)

Qi He, The
Pennsylvania State University, 2010













1
2
2
2
1
1
1
1
2
2
1
2
1
1
i
2
1
i
1
)
(
1
)
;
(
Then,

.
1

and

1
Let
:
2
n
Propositio
k
i
k
j
j
i
k
T
i
i
d
k
T
i
i
d
b
c
k
k
C
,d
d
sim
b
b
k
T
c
c
k
T

Given a query
d
1
, we rank citation candidates
d
2

using proposition 2.

Relevance for

local recommendation


sim
(
d
1
,
d
2
;
c
*
)

Qi He, The
Pennsylvania State University, 2010














1
2
2
2
1
1
1
1
2
*
2
*
2
1
*
2
1
1
i
2
1
i
1
)
(
)
(
1
)
;
(
Then,

.
1

and

1
Let
:
twice)
Theorem

s
Gleason'

(applying

4
n
Propositio
k
i
k
j
j
i
k
T
i
i
d
k
T
i
i
d
c
b
c
c
k
k
c
,d
d
sim
b
b
k
T
c
c
k
T

Given a query
d
1

and
c
*
, we rank citation candidates
d
2

using proposition 4.








1
2
2
1
2
*
2
2
*
1
i
2
)
(
1
)
,
(
,
1
Let
:
3
n
Propositio
k
i
j
k
T
i
i
d
c
b
k
d
c
sim
b
b
k
T

Only given
c
*
, we rank citation candidates
d
2

using proposition 3.

Evaluation Metric (1)


Recall
: removed original citations from the
testing documents. The recall is defined as the
percentage of original citations that appear in
the top
K

recommended citations.

Qi He, The
Pennsylvania State University, 2010

Evaluation Metric (2)


Co
-
cited probability
: for each pair <
d
i
,
d
j
>,
d
i

is
an original citation and the
d
j

is a
recommended one, calculate the probability
that they have been co
-
cited by the popularity
in the past.



Qi He, The
Pennsylvania State University, 2010

j
i
j
i
d
d
d
d
P
or


citing

papers

of
number

and

both

citing

papers

of
number

citations.

original

of
number

the
is


pairs,


all
over

averaged
Then
l
l
K

Evaluation Metric (3)


NDCG
:
d
1

(testing doc),
d
2

(candidate)



Use the average co
-
cited probability of
d
2

with all
original citations of
d
1

to weigh the citation
relevance score of
d
2

to
d
1
.




Then, we sort all
d
2

w.r.t
. this score (suppose
P
max

is the highest score) and define 5
-
scale relevance
number for them as the ground truth: 4, 3, 2, 1, 0
for documents in (
3 P
max
/
4
,
P
max
], (
P
max
/
2
,
3 P
max
/
4
], (
P
max
/
4
,
P
max
/
2
], (
0
,
P
max
/
4
] and
0
.



Qi He, The
Pennsylvania State University, 2010

Global recommendation

baselines

Baselines

Descriptions

HITs

Ranked by the authority scores in the candidate set subgraph

Katz

Ranked by Katz

distance

l
-
count

Ranked by the number of citations in the candidate set subgraph

g
-
count

Ranked by the number of citations in the whole corpus

textsim

Ranked by similarity with the query manuscript

diffusion

Ranked by topical similarities generated by the multinomial

diffusion kernel to the query manuscript

mix
-
features

Ranked by the weighted linear combination of the above 6 features

Qi He, The
Pennsylvania State University, 2010

Global recommendation

performances

Qi He, The
Pennsylvania State University, 2010

Why is Katz also good?

Recall: G1000+CitHop 0.73 (recall) 3,790 (size)


LC100+G1000 0.79(recall) 1,618 (size)

The jump of HITs

Some candidates with moderate authoritative scores are
targets (people stop citing the most well
-
known papers
after they become the standard techniques and shift the
attentions to other recent good papers).

Top 25 (1
st

page): 40%
original citations can be
found

How about conference papers only?

Qi He, The
Pennsylvania State University, 2010

0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
25
50
75
100
125
150
175
200
225
250
recall

top k recommendations per document

Randomly selected 100 conference papers for testing.

0.865 recall on the first page!

Local recommendation

performances

Qi He, The
Pennsylvania State University, 2010

CRM
-
crosscontext
: local contexts help each other



already beat baselines of global recommendation in a harder problem

CRM
-
singlecontext
: treat local contexts independently



similar to the simple cosine similarity heuristic (not the same), which is worse



the heuristic:



now,

).
(
cos
)
(
cos
)
(
2
1
2
1
,c
d
,c
d
,d
d
sim








1
2
1
1
2
*
2
*
2
1
*
2
1
)
(
)
(
1
)
;
(
k
i
k
j
j
i
c
b
c
c
k
k
c
,d
d
sim
How about new papers?

Qi He, The
Pennsylvania State University, 2010

New papers without enough in
-
link citation contexts are hard to be recommended.


Document
-
based similarity is not good.

RefSeerX
:

Citation Recommendation

1.
Context
-
based

2.
Without user supervision

Qi He, The
Pennsylvania State University, 2010



Citation recommendation

without user supervision


Totally unaware of related work


do not know
the context

Qi He, The
Pennsylvania State University, 2010

…To tackle the problem, we develop a context
-
aware approach. The core idea is to design a
novel non
-
parametric probabilistic model which
can measure the context
-
based relevance
between a citation context and a document…

1.
Truly novel?

2.
Or just unaware
of related work?

Identify the citation context

Qi He, The
Pennsylvania State University, 2010

When you write papers, how many times do you want to make some citations at a
place but you are not sure which papers to cite? Do you wish to have a
recommendation system which can recommend a small number of good candidates
for every place that you want to make some citations? In this paper, we present our
initiative of building a context
-
aware citation recommendation system. High quality
citation recommendation is challenging: not only should the citations recommended
be relevant to the paper under composition, but also should match the local
contexts of the places citations are made. Moreover, it is far from trivial to model
how the topic of the whole paper and the contexts of the citation places should
affect the selection and ranking of citations.
To tackle the problem, we develop a
context
-
aware approach. The core idea is to design a novel non
-
parametric
probabilistic model which can measure the context
-
based relevance between a
citation context and a document
. Our approach can recommend citations for a
context effectively. Moreover, it can recommend a set of citations for a paper with
high quality. We implement a prototype system in CiteSeerX. An extensive empirical
evaluation in the CiteSeerX digital library against many baselines demonstrates the
effectiveness and the scalability of our approach.

The whole text as a long query?

==or==

automatically identify the
citation context as the query

Context
-
level language model

Qi He, The
Pennsylvania State University, 2010

Drawback: the testing window could be uniformly distributed over contexts and
is not directly relevant to certain citations.

Document
-
level relevance model

Qi He, The
Pennsylvania State University, 2010

Drawbacks:

1)
Sparsity of contexts for the good citations;

2)
Not scalable.

Topic
-
level relevance model

Qi He, The
Pennsylvania State University, 2010

Drawback:

A single feature is not enough.

Citation context need features

Qi He, The
Pennsylvania State University, 2010

Variable
-
length Markov learning process:

Problem: the relations among features are non
-
linear.

Inference with the ensemble

of decision trees

Qi He, The
Pennsylvania State University, 2010

An unbiased score for a single decision tree is:

The least variance estimator of score is the weighted sum where the weights are
the inverse of variances.

Evaluation Metrics

Qi He, The
Pennsylvania State University, 2010

W
g
: the part of a testing paper covered by the ground truth contexts

W
t
: the part covered by testing contexts generated by our methods

Results

Qi He, The
Pennsylvania State University, 2010

Other Applications

-

Verbose Search


A
llow the user to submit verbose queries to portray their information
needs

in search engines.


Instead of context decomposition, match to search log contexts once arrived in the
target Web pages

Qi He, The
Pennsylvania State University, 2010

What did Steve Jobs say about the iPad? Rumor said that their next
product would be iMattress.

Search
Log

Other Applications

-

Local Search


A
ssociate the results with local contexts.


Search “the hot bar in San Francisco”


The results labeled/reviewed/talked by local bar guys are ranked higher


Free: easy to cheat


The results associated with local contact numbers are ranked higher


The results called often by local people are ranked higher


Expansive: not easy to cheat


More data, more reliable


Differentiate the
bursty

pattern, constant pattern

Qi He, The
Pennsylvania State University, 2010

The hot bar in San Francisco

Other Applications
-

Online Advertising


D
eliver advertisements to the user in some place within the actual body
text of a Web page. The advertising context is the HTML content
surrounding the advertising place.


Sell advertising contexts to advertisers, rather than asking them to bid entity keywords.

Qi He, The
Pennsylvania State University, 2010

Can we ask Apple to buy this
place for the iMattress ad?
Even no such entity keyword.

Conclusions


Define the context
-
based citation recommendation problem



Use contexts to model citations



Build the fast context
-
conditioned relevance model



A public demo and extensive empirical evaluations

Qi He, The
Pennsylvania State University, 2010