Hierarchical Topic Detection UMass - TDT 2004

courageouscellistΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

72 εμφανίσεις

Hierarchical Topic Detection
UMass
-

TDT 2004

Ao Feng

James Allan

Center for Intelligent Information Retrieval

University of Massachusetts Amherst


Task this year


4 times the size of TDT4 (407,503 stories
in three languages)


Many clustering algorithms not feasible
(all algorithms with complexity
Ω
(n
2
) will
take too long)


Time limited
-

one month


Pilot study this year


We need a simple algorithm that can be
finished in a short time

HTD system of UMass


Two step clustering


Step 1


K
-
NN






Step 2


agglomerative clustering



Similarity>threshold?



×

Step 1


event threading


Why event threading?


Event:
something that happens at a specific time and
location


An event contains multiple stories


Each topic is composed of one or more related events


Events have temporal locality


What do we do


Each story is compared to limited previous stories


For simplicity, events do not overlap (false assumption)


Step 2


agglomerative clustering


Agglomerative clustering has complexity of
Ω
(n
2
)


Modification required


Online clustering algorithm


Limited window size


Merge until 1/3 left


First half clusters removed and new events come in


Clusters not overlapping


Assumption: stories in the same source are more
likely to be in the same topic


Clusters in the same source are merged first


Then the same language


Finally all languages

Official runs


We submitted 3 runs for each condition


UMASSv1 (UMass3): baseline run


Tf
-
idf term weight


Cosine similarity


Threshold=0.3


Window size=120


UMASSv12 (UMass2): smaller clusters have
higher priority in agglomerative clustering




UMASSv19 (UMass1): similar to UMASSv12


Double window size

Evaluation results

site


score



Condition

TNO

ICT

UMass

CUHK

eng,nat

0.0262

(0.0377
0.0040)

TNO2

0.0898
(0.0966

0.0767)

ICT3d

0.2125
(0.3204
0.0030)

UMass1

0.3273

(0.4674
0.0554)

CUHK1

mul,eng

0.0275

(0.0403
0.0027)

TNO3

0.1118

(0.1212

0.0934)

ICT1e

0.1942

(0.2910
0.0063)

UMass1

0.2783

(0.3969
0.0481)

CUHK1

Our result is not good, why?


Online clustering algorithm


Reduces complexity


Stories far away in time cannot be in the same cluster


The assumption of time locality is not valid for topic


Non
-
overlapping clusters


Increase miss rate


Miss correct granularity


Hard to find


UMass HTD system reasonably quick but ineffective


One day per run

What did TNO do?


TNO


UMass: 1/8 detection cost, similar
travel cost. How?


Four steps


Build the similarity matrix for a sample with size
20,000


Agglomerative clustering to build a binary tree


Simplify the tree to reduce travel cost


For each story not in the sample, find the 10
closest stories in the sample and add it to all the
relevant clusters

Why is TNO successful?


To deal with the large size, TNO used a
20,000 documents sample for clustering


Clustering tree is binary which gets most
possible granularities


Branching factor of 2 or 3 reduces travel cost


Each story can be assigned to at most 10
clusters


greatly increases the probability to find a perfect
or nearly perfect cluster

Detection cost


Overlapping clusters


According to TNO’s observation, adding a story to different
clusters decreases miss rate significantly


Branching factor


Smaller branching factor keeps more possible granularities.
In our experiment, limited branching factor improved
performance


Similarity function


There is no evidence that different similarity functions show
large difference


Time locality


Our experiment denies the assumption, larger window size
gets better results


Travel cost


With the current parameter setting, a smaller
branching factor is preferred (optimal value 3)


Comparison of travel cost


ICT: eng,nat 0.0767 mul,eng 0.0934


CUHK: 0.0554 0.0481


UMass: 0.0030 0.0063


TNO: 0.0040 0.0027


Reason: branching factors


The current normalization factor is very large


normalized travel cost negligible in comparison to detection
cost

Toy example

Most topics are small

Only 20(8%) have more than 100 stories

Generate all possible clusters of size 1 to 100



Put them in a binary tree

Detection cost for 92% topics is 0!!!

Plus empty cluster and whole set, the other 8% at most 1

Travel cost is



So the combined cost is



It is comparable to most participants!

With careful arrangement of the binary tree, it can be easily improved


What is wrong?


The idea of the travel cost is to avoid cheating experiments like
power set


The normalized travel cost and detection cost should be
comparable


With current parameter setting, small branching factor can
reduce both travel cost and detection cost


Suggested modification


smaller normalization factor, like the old one
-

travel cost of the
optimal hierarchy


If normalized travel cost too large, give a smaller weight to it


Increase CTITLE and decrease CBRANCH so that the optimal
branching factor is larger (5~10?)


Other evaluation algorithms, like expected travel cost (still too
expensive, need some approximation algorithm)

Summary


This year’s evaluation shows that overlapping clusters
and small branching factor can get better results


Current normalization scheme of travel cost does not
work well


Need some modification


New evaluation methods?



Reference

Allan, J., Feng, A., and Bolivar, A.,
Flexible Intrinsic Evaluation of
Hierarchical Clustering for TDT
, in CIKM 2003, pp. 263
-
270.