13feedback - School of Computing - University of Leeds

cowphysicistInternet and Web Development

Dec 4, 2013 (4 years and 27 days ago)

94 views

COMP3410 DB32:

Technologies for Knowledge Management

Feedback on coursework;

and Evaluation in IR and KM

By Eric Atwell, School of Computing,
University of Leeds

(including re
-
use of teaching resources from other sources,
esp. Stuart Roberts, School of Computing, Univ of Leeds)

today’s objectives


Feedback on db32cw1


We need to understand how we evaluate IR systems


Evaluation is also important in other technologies for
Knowledge Management, including Data
-
Mining

db32cw1: Evaluating Information
Retrieval Tools and Techniques for
Academic Knowledge Management

Each report should be up to 5 pages (not including
references), and should be graded for quality of:
Introduction; Methods; Results
-

including full
references in an Appendix; and Conclusions.

These four components will be weighted 1:1:2:2 and
combined into a single overall grade.

All grades will be on a letter scale from A** down to
F*

I
Introduction:

-
state the task (i.e. find impact of 3+ topics, AND do
this to evaluate 3+ IR tools)

-

which research topics you chose to investigate,

-

and why: a methodical rationale eg a contrast
between topics with many v few papers

COMMON PROBLEM: fundamental
misunderstanding, eg of difference between
“journal” and “paper”


M
Methods:

-
explain which IR/KM tools you investigated,

-

and why: a methodical rationale, eg comparing a
tool for a specialised domain (eg CiteSeer or
Medline) v a general tool (eg Cite
-
U
-
Like or google
scholar);

-

what search terms you tried;

-

and why: a methodical rationale, eg try paper title,
if this does not work well try “Atwell 1987”

GOOD IDEA:
comparing more than 3 tools

R
Results:

-
-

state how successful you were in finding citations in EACH
of the research topics you chose,

-

state different results using each of the tools you
investigated;

-

state different results using different search terms;

-

cite EXAMPLES OF the papers you found for each topic,
eg (Smith 2003);

-

and list FULL references for the papers you cited in an
Appendix.

GOOD IDEA: A fuller list of citation references

GOOD IDEA: A methodical overview summary of results,
eg in a table

C
Conclusions:

-
Explain which of the research topics you chose had the most “impact”,
in terms of follow
-
on citations

GOOD IDEA: Intelligent analysis eg self
-
citation more common but
less impact than citation by others

-

If not many (or any) follow
-
on papers for one or more of the research
topics you chose, suggest why not;

-

State which of the search tools was most useful?

-

Why?
-

justify this choice;

-

State recommendations for a fellow student setting about a similar
task eg for Final Year Project;

GOOD IDEA: no one tool is best overall (best is a sensible use of
several tools)

GOOD IDEA: Referencing other comparisons of search tools (eg
Harzing paper in Handouts)

-
GOOD IDEA: summary of pros and cons for each IR tool (cf Harzing)

Evaluation/Effectiveness measures


effort

-

required by the users in formulation of queries


time

-

between receipt of user query and production of
list of ‘hits’; or between input and results/output of DM


presentation
-

of the output


coverage

-

of the collection


user satisfaction



with the retrieved items, rules etc.


recall

-

the fraction of relevant items retrieved


precision
-

the fraction of retrieved items that are relevant

recall and precision

Imagine a user has issued a query and received a set of items
in response:


let R = set of items in a collection, relevant to user’s




needs; |R| = number of items in this set


let H = set of items retrieved (hits); |H| = number of hits

Precision = |R

H| / |H| = |relevant hits| / |all hits|


the fraction of retrieved items that are relevant

Recall


= |
R

H| / |R| = |relevant hits| / |all relevant|


the fraction of relevant items that are retrieved.

for a perfect system Precision = Recall = 1.

BUT if we retrieve ALL items in the collection we are
guaranteed a recall value of 1, and minimum precision!

Precision is easier to calculate


Precision can be measured, provided a user can say
whether each retrieved item is relevant to his/her query
or not. BUT only for limited set of hits, eg first page


Recall cannot normally be measured exactly since the
user is unlikely to know just how many relevant items
there are in the collection as a whole.


eg: How many pages are there on the Internet that will
give you information that would help you find a
holiday?


Nevertheless the measures are a useful way of
understanding what effects attempts to improve a
system might have. Depending on what sort of
information need we have we might try to improve
precision (usually at the expense of recall) or vice versa.

measuring Precision and Recall


How do we know (or measure) whether an item is
relevant to a user or not?


Consider the query:

client AND server AND architecture


first three hits:

1. research paper on C/S architecture

2. text book on C/S architecture

3. web site of company offering C/S solutions

IR specialist: 1: relevant; 2: relevant; 3: relevant

measuring Precision and Recall


How do we know (or measure) whether an item is
relevant to a user or not?


Consider the query:

client AND server AND architecture


first three hits:

1. research paper on C/S architecture

2. text book on C/S architecture

3. web site of company offering C/S solutions

User A: student trying to understand C/S architecture

1: relevant; 2: relevant; 3: not relevant

measuring Precision and Recall


How do we know (or measure) whether an item is
relevant to a user or not?


Consider the query:

client AND server AND architecture


first three hits:

1. research paper on C/S architecture

2. text book on C/S architecture

3. web site of company offering C/S solutions

User B: author of item 1., already has item 2.

1: not relevant; 2: not relevant; 3: not relevant

measuring Precision and Recall


How do we know (or measure) whether an item is
relevant to a user or not?


Consider the query:

client AND server AND architecture


first three hits:

1. research paper on C/S architecture

2. text book on C/S architecture

3. web site of company offering C/S solutions

User C: IT Manager looking for C/S IT consultant

1: not relevant; 2: not relevant; 3: relevant

measuring Recall and Precision


these measures depend on the concept of ‘relevance’


how best to determine ‘relevance’ needs careful thought


objective view: the degree to which a document deals with the
subject of interest to the user


subjective view: takes account of the user’s prior knowledge and
contents of other retrieved documents.


Problem: How to account for the ‘gap’ between the user’s interests
and the expression of interest in the form of the query?


Don’t be surprised if you develop a search engine as part of a
knowledge management system, and discover that the users
don’t rate it as highly as you rated it. IS Evaluation should
always involve end users of course, but this subjective
concept of relevance is central to the metrics

measuring Recall


how do we know what the set of all relevant
documents (in the collection) is?


detailed knowledge of complete collection usually not
practicable


under experimental conditions get users to ask queries for
which the complete set of relevant documents is known


for ranked output, keep going down the ranked list until
user has enough information. Assume there are no further
relevant documents.


Precision and Recall in ranked list of hits

0
0.2
0.4
0.6
0.8
1
1.2
1
2
3
4
5
6
7
ranked item
R
P
R(ideal)
P(ideal)

rank rele
-


vant?

1

Y

2

N

3

Y

4

N

5

Y

6

Y

7

N

Assume 10 relevant items in collection

Gold Standards to measure precision and recall


For experiments, eg to compare performance of rival IR
algorithms, researchers can select a sample and “hand
-
annotate”:
decide in advance whether each document is relevant (and
annotate/mark it as relevant)


The IR algorithm cannot “see” the annotation, it selects “hits”
using other metrics


Then count the hits which are “really” relevant, compute Precision
and Recall


This also works for Machine Learning , Data Mining:


eg evaluate Clustering against pre
-
defined Classes


In general, ML model predicts annotation (tag), evaluation
compares against “real” tag (assigned by human annotator)


Precision and Recall: rival techniques


exact match Boolean models likely to give high
precision, low recall (eg Web of Science)


Probabilistic vector model provides ranking;
precision & recall depend on cut
-
off, good recall if
we go down the list (eg Google Scholar)


stemming leads to more matches, so expect recall to
improve overall, precision to deteriorate.


stemming will alter the rank ordering of hits; not
clear whether precision and recall will be improved
at top of ranked list.


Quantitative v Qualitative Evaluation

Easy to find a “number”, eg Precision and Recall, F
-
score, …

Harder to find “reasons”, examples, general classes

Eg pros and cons of each tool you are comparing

Problems common to all systems

eg reasons why papers/citations may not be found by
ANY IR tool


Query Broadening may improve
performance


User unaware of collection characteristics is likely to
formulate a ‘naïve’ query


query broadening aims to replace the initial query
with a new one featuring one or other of:



new index terms


adjusted term weights


One method uses feedback information from the user


Good for evaluation: based on “relevance to user”

Db32cw2: simple linux tools


I created the files using some simple linux tools


/home/www/db32/lectures/typescript


http://www.comp.leeds.ac.uk/db32/lectures/typescript


Q: Why not “normalise” to merge The, the etc?


Change to lower case, delete non
-
alphanumerics


… but this may not be necessary …


Precision/Recall type metrics will differ, but the
evaluation involves the question “are the results
plausible and convincing?”


Knowledge Discovery: Key points


Knowledge Discovery (Data Mining) tools semi
-
automate the process of discovering patterns in data.


Tools differ in terms of what concepts they discover
(differences, key
-
terms, clusters, decision
-
trees, rules)…


… and in terms of the output they provide (eg clustering
algorithms provide a set of subclasses)


Selecting the right tools for the job is based on business
objectives: what is the USE for the knowledge discovered


Human analysis and judgement is also essential

Evaluation: key points


where indexing is based on keywords from the
documents and users have little understanding of the
collection, precision and recall are likely to be poor.


difficult to measure precision and recall as ‘relevance’
difficult to define


Evaluation should also be qualitative: relative pros
and cons of each system being evaluated


Consider further problems which NONE of the
systems handle well, eg documents not in the online
collection (eg not English, books, pay
-
for journals,…)