Identifying Temporal Patterns in Collections of Documents

internalchildlikeInternet and Web Development

Nov 12, 2013 (3 years and 10 months ago)

80 views

Identifying Temporal Patterns
in Collections of Documents

Rich Caruana, Thorsten Joachims,

Johannes Gehrke, Benyah Shaparenko


Cornell University


{caruana,tj,johannes,benyah}@cs.cornell.edu

KDD Challenge 2005

Open Task

Introduction and Goals


Evolution of Document Collections


How can one identify and compactly
summarize the temporal development of
topics in a data stream?


Identifying Influential Ideas


What are key documents that drive the
development and change?


Identifying Influential Authors


Which authors have the largest impact on

the development?


Identifying Key Documents:


Goals


Identify “leading” papers:


Which documents introduce new ideas?


Which documents most influence future work?


Constraints


Analysis not limited to scientific paper



Must work without citation data


Only text and timestamp may be used

Identifying Key Documents:
Methods


Document Lead/Lag Index


Find k nearest neighbors (cosine distance)


Raw lead/lag score:


LL
raw
(d) = (# later NN)


(# earlier NN)


Scaled lead/lag score (avoid edge effects):



LL
norm
(d) = LL
raw
(d) / AVG
d’

year(d)
(LL
raw
(d’))

d

1997

1998

1999

2000

2001

2002

2003

d
1

d
2

d
3

d
4

Identifying Key Documents:
Biobase Results

Score

Year

Cites

Paper Title and Authors

1.153


2000

45

“characterization of a new human b7
-
related protein: b7rp
-
1 is the
ligand to the co
-
stimulatory protein icos” by t. boone, d. brankow,
m.a. coccia, t. dai, j. delaney, h. han, t. horan, h. hui, s.d. hkare, t.
kohno, r. manoukian, k. miner, j. pistillo, m. sonnenberg, j.s.
whoriskey, s.k. yoshinaga, m. zhang

1.082

2000

2

(168)

“mycobacterium tuberculosis and human macrophage: the bacillus
with ‘environment
-
sensing’” by v. colizzi, m. fraziano, f. mariani

1.082

2000

242

“taci and bcma are receptors for a tnf homologue implicated in b
-
cell autoimmune disease” by h. blumberg, c.h. clegg, s.r. dillon, r.
enselman, k. foley, d. foster, j.a. gross, a. grossman, k. harrison, h.
haugen, j. johnston, w. kindsvogel, a. littau, c. lofton
-
day, k.
madden, m. moore, s. mudri, j. parrish
-
novak, w. xu

1.082

2000

111

“mouse inducible costimulatory molecule (icos) expression is
enhanced by cd28 costimulation and regulates differentiation of
cd4 t cells” by v.a. boussiotis, t.t. chang, p.
-
j. chen, t. chernova, j.s.
duke
-
cohan, e.a. greenfield, c. jabs, v.k. kuchroo, v. ling, a.e.
lumelsky, n. malenkovich, a.j. mcadam, a.h. sharpe

Identifying Key Documents:

NIPS Results

Score

Year

Cites

Paper Title and Authors

1.167

1996

128

“improving the accuracy and speed of support vector machines” by
chris j.c. burges, b. schoelkopf

1.128

1999

17
(466)

“using analytic qp and sparseness to speed training of support
vector machines” by john c. platt

0.986

1999

18

“regularizing adaboost” by gunnar raetsch, takashi onoda, klaus
-
robert mueller

0.953

1996

41

(3711)

“support vector method for function approximation, regression, and
signal processing” by v. vapnik, s. golowich, a. smola

0.945

1998

27

“training methods for adaptive boosting of neural networks” by
holger schwenk, yoshua bengio

0.945

1997

3

“modeling complex cells in an awake macaque during natural
image viewing” by william e. vinje, jack l. gallant

0.934

1998

17

“em optimization of latent
-
variable density models” by chris bishop,
markus svensen, chris william

0.934

1995

584

“a new learning algorithm for blind signal separation” by s. amari, a.
cichocki, h. h. yang

Aggregated Lead/Lag Index:

Goals


Identify “leading” authors


Who are the major players?


Which authors most influence future work?


Constraints


Analysis not limited to scientific paper



Must work without citation data


Only text and timestamp may be used

Aggregated Lead/Lag Index:
Methods


Author Lead/Lag Index


Assume author
a

has documents
d
1
,…,d
n


Compute scaled lead/lag score for each
document and average


LL
norm
(
a
) = 1/n (LL
raw
(
d
1
)+…+ LL
raw
(
d
n
))


Compute variance
v

of LL
norm
(
a
) and rank by



LL
norm
(
a
)


2 * sqrt(v / n)


Use smoothing to avoid small sample artifacts

Aggregated Lead/Lag Index:
Biobase Results


Aggregated Lead/Lag Index:
NIPS Results

Temporal Cluster Histograms:
Goals


What are main topics in a collection?


Identify key topics.


How big a fraction of documents are on each
topic?


How do topics develop?


What are new emerging topics?


Which topics are fading?


When did particular topics peak in popularity?

Temporal Cluster Histograms:
Methods


K
-
Means Clustering


Measure distance via TFIDF cosine on

text vectors


Different k = 7, 13, 30


10 runs, select run with least squared error


Visualization


Find how many documents are in each cluster
in each year


Plot these numbers in a “stacked” histogram


Label clusters with the 5 words of highest
value in centroid vector

Temporal Cluster Histograms:
Biobase Results

12: /inf >, inf >, 2 <, <, 4 <

11: dc, csf, dcs, gm, cells

10: il, 12, gamma, cells, production

9: cells, cell, protein, hla, mhc

8: hiv, infected, cd4, virus, aids

7: alpha, &, tnf, ifn, gamma

6: isolates, pylori, strains, pcr, infection

5: patients, ra, sle, disease, hla

4: mice, vaccine, responses,
immunization, immune

3: transplantation, patients, graft,
transplant, rejection

2: asthma, ige, allergic, allergen, allergens

1: hcv, hepatitis, hbv, rna, liver

0: <, /sup >, sup >, cells, cd4<

1/6 of Biobase (15
papers min)
0
500
1000
1500
2000
2500
3000
1
2
3
4
Year
Number of Papers
0%
20%
40%
60%
80%
100%
1
2
3
4
Year
Temporal Cluster Histograms:
NIPS Results

NIPS k-means clusters (k=13)
0
20
40
60
80
100
120
140
160
180
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Year
Number of Papers
Temporal Cluster Histograms:
NIPS Results

12: chip, circuit, analog, voltage, vlsi

11: kernel, margin, svm, vc, xi

10: bayesian, mixture, posterior, likelihood,
em

9:

spike, spikes, firing, neuron, neurons

8:

neurons, neuron, synaptic, memory,
firing

7:

david, michael, john, richard, chair

6:

policy, reinforcement, action, state,
agent

5:

visual, eye, cells, motion, orientation

4:

units, node, training, nodes, tree

3:

code, codes, decoding, message, hints

2:

image, images, object, face, video

1:

recurrent, hidden, training, units, error

0:

speech, word, hmm, recognition, mlp

NIPS k-means clusters (k=13)
0
20
40
60
80
100
120
140
160
180
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Year
Number of Papers
Topic Drift:

Goals


Identify change within topics


Did topic drift in focus?


Did new ideas get introduced without
abandoning a topic?

Topic Drift:

Methods


Identify drift in description of topics


Use clusters from k
-
means clustering


Compute centroids of each cluster for


The earlier half of the years


The later half of the years


Extract the 5 highest scoring terms in each
centroid and time period

Topic Drift:

Biobase Results


Topic Drift:

NIPS Results


Conclusions


Scaled document lead/lag index



Works well without citation informations


Ideas vs. citations (e.g. Platt’s paper)


Author lead/lag index find key authors


Can find influential authors


K
-
means finds meaningful clusters



Clusters correspond to know topics in NIPS


Effectively shows development of topics


Areas for Future Work


Temporal flow clustering


Determine flow in stream of ideas


Analyze how topics “forks” and “merges” over time


Explicitly exploit time in distance metric


Characterizing author behavior


Are there patterns of how authors move from topic to topic?


What marks emerging trends early? (e.g. prominent authors)


Who are early vs. late adopters of trends?


What characterizes authors that publish on few/many topics?


Improved clustering algorithms


Guidance in determining distance metric, number of clusters


Meta
-
clustering


Burst (New Topic) Detection and how they align with events


Corpora beyond scientific literature


Email, web pages, news, etc.