LBSC670_Class09_cv_110711x - Erik Mitchell

hurriedtinkleΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

89 εμφανίσεις

LBSC 670

Information Organization


Today


Guest Speaker

Jeremy York


HathiTrust



Classification Thoughts and CV


Overview & History


Related concepts


Examples


A note on MARC
specifications



Classification
concpets

Aboutness
, specificity, granularity



Words have power
,“
-

classification
systems exist within a socio
-
political
context


Classification methods
Manual/automatic,
Pre/Post coordinate, Hierarchical/faceted,
formal/social

CV overview


What are controlled vocabularies?


Types


Basic concepts


How are cv created and maintained


Metadata standards


Example Systems


When does a CV turn into a KO?


Term Lists, Thesauri, Taxonomies,
Ontologies



Controlled Vocabularies


organized lists
of words and phrases,
or
notation

systems, that are used to
initially tag content, and then to find it
through navigation or search.


(Warner
via
Leise
, Fast)



the primary purpose of vocabulary
control is to achieve
consistency in
the description of content objects

and to facilitate retrieval


(ANSI Z39.19)



Knowledge Organization



tools that present the
organized
interpretation

of knowledge structures


(
Hjørland
)




classification schemes

that organize
materials at a general level…,
subject
headings

that provide more detailed
access, and
authority files
that control
variant versions of key information


(
Hodge
)


Uses of controlled
vocabulary


Define scope, content, and context of
a
body of knowledge



Support discovery
-

Navigation
,
search,
browsing



Map
information objects to
user
terminology



Enforce term
consistency and
relationships

A good CV. . .


Removes ambiguity



Defines relationships
between things



Contextualizes
information

CV Concepts


Content Analysis


Ambiguity


Synonymy


Exhaustivity


Specificity


Co
-
extensivity


Aboutness


Semantic structure


Warrant (User,
Literary,
Organization)



Form Analysis


Linguistics


Grammar


Semiotics


Single / Multiple terms



Indexing & Retrieval


Pre vs. Post Coordinate


Recall vs. Precision


Natural language
processing (NLP)


http://
bit.ly
/lbsc_670_cv

Content
Analysis


Ambiguity


Each term should relate to a single

concept


Synonymy


Each concept should be identified by a single entry


Specificity


Using the most specific words or phrase expressing the subject


Exhaustivity


The extent to which the entire document is indexed (Summarization,
depth)


Co
-
extensivity



Assign as many terms as needed to bring out the main theme, and
according to guidelines sub
-
themes.


(p. 29, Lancaster)



nothing more, nothing less



Semantic Structure


Terms can be related with equivalence, hierarchy, or associated
relationships (Use, See, NT, BT, RT)




Content Analysis (2)


Aboutness

= Subject/topic?


Wilson (1968)


Author intent, topicality, relationship to other resources,
textual analysis


Farithorne
(1969)


Intentional
aboutness

(author), extensional
aboutness

(document)


Maron
(1977)


objective about (document), subjective about (user), and
retrieval about (information retrieval)


Hjorland
(2001)




Closely related to theories of meaning, interpretation,
and epistemology



Content Analysis (3)


Wilson

s criteria for evaluating
aboutness (1968)


Identify author

s purpose (intent)


Weigh the predominant topics, elements
(topical analysis)


Group/count a document

s use of concepts
and references (bibliometrics)


Identify essential elements (text analysis)

Content Analysis (4)


Literary Warrant



The inclusion of a vocabulary term in a controlled vocabulary based
on its appearance in one or more content items. For example, a
medical text may use the term

oncology.


Based on literary warrant,
that term would be included in the controlled vocabulary even
though the general public uses the term

cancer.


(
Glosso
-
Thesaurus
)



User Warrant



The inclusion of a vocabulary term in a controlled vocabulary based
on use by users. Such terms can be identified through search log
analysis or free listing.


(
Glosso
-
Thesaurus
)



Organizational Warrant



Justification for the...selection of a preferred term due to the
characteristics and context of the organization using the resource


(ANSI Z39.19)



Form Analysis


Linguistics


Synatx/Form (grammar)


Morphology (internal word structure)


Semantics (meaning)


Pragmatics, discourse analysis (word/phrase
use)


Semiotics


study of signs/
symbols



Lexical structure


Document layout, markup, tags (think DOM)

Indexing & Retrieval


Pre/Post
-
Coordinate


Organization prior to retrieval


Organization at the point of retrieval


Recall / Precision


Recall: Number of retrieved relevant docs / total number
of docs in collection


Precision: number or retrieved relevant docs / all relevant
docs in collection


Natural language processing


Uses semantics and syntax to automatically distill

aboutness


Recall & Precision


A collection of 100
documents


Searches



Vocabularies



Recall 100/100 = 1


Precision 100/100 = 1



Facet



Recall 20/100= .2


Precision 20/28 = .71



OWL



Recall 1/100 = .001


Precision 1/1 = 1

CV Entry

# of docs

Controlled
Vocabularies

100

Faceted analysis

20

Ontologies

5

OWL

1

RDF

3

Recall = # of docs retrieved / total # of docs in collection

Precision = # relevant of docs retrieved / total relevant # of docs in collection

Types of Controlled Vocabularies


Term Lists


Glossaries, Dictionaries, Gazetteers, Folksonomies



Synonym rings


Z39.19 example


Oracle Text



Taxonomies


Website navigation scheme



Thesauri / Ontologies


Authority files, subject thesauri, topic maps

http://www.taxotips.com/

Thesauri & taxonomy examples


List

of

vocabularies


http
:
//www
.
slais
.
ubc
.
ca/resources/indexing/
database
1
.
htm



Taxonomy

warehouse


Two Examples


Health & Ageing Thesaurus


Thesaurus of Geographic names


CV Structures


Organization structures


Hierarchical systems


Term Lists / Enumerative systems


Hierarchies


Tees


Facets / Associative relationships


Folksonomies



Hierarchies


Features


Inclusiveness



Is
-
a


relationship


Inheritance


Transitivity


Systematic


Mutually exclusive


Neccesary

and
sufficient






From http://
bit.ly
/lbsc_670_cv

Relationships


Equivalence ( Term Lists)



use

,

see

,

isVersionOf

,

isFormatOf



Hierarchical (Thesauri, Taxonomies)


Generic



is a



Partitive



is part of

,

has part

,

has conceptual
part

,

member of



Instance



Associative (Facets, Ontologies)



isReferencedBy

,

isRequiredBy

,

hasDerivative


Faceted
vocabularies




Multi
-
dimensional, multi
-
relationship driven, Subject, Object, Predicate

From http://
bit.ly
/lbsc_670_cv

Folksonomy


Features


Single level
description


Open vocabulary
list


User
supplied/harvested
tags



http://
trendistic.indextank.com
/

Term List Examples



Authority files


Maps to preferred terms


Library of Congress


Encoded Archival Context


Union List of Artist Names


Glossaries/Dictionaries

Words & definitions,
sometimes topic focused


Glosso
-
Thesaurus


Folksonomies



Contextualization
,
Trend discovery
,
Personal Information


Synonym rings


Used for back
-
end equivalence
in searching


Princeton Wordnet



Choosing a framework


Use questions


Who is your user, what are their needs?


What systems are your users familiar with?


Will this system be internal/external?


Content questions


How extensive, defined is the information?


Is your subject matter static or fluid?


What organizational framework best describes your content?


System Questions


What access are you trying to provide?


What external pressures exist?


What external entities/theories will interact with this system?

Thesauri Definitions



Guide to use of terms, showing
relationships between them, for the
purpose of providing standardized,
controlled vocabulary for information
storage and retrieval

(
Monash
)



A list of words showing similarities,
differences, dependencies, and other
relationships to each other

(
USG
)

Creating a CV (1)


Design methods


Re
-
use existing, start with content & desired use
ideas


Committee / community approach


Top
-
down


Concept driven


Bottom
-
up


Document driven


Empirical approach


Deductive approach


Select terms, create relationships, perform term control


Inductive approach


Establish CV at outset, build hierarchies on as needed
basis

Creating a CV (2)


Top
-
Down (deductive)


Identify audience


Identify all topics,
concepts, uses, and
context of the domain


Sort topics identified into
an appropriate
organization scheme
(enumerative,
hierarchical, faceted)


Solidify structure and
clean up gaps &
redundancies


Assign documents to
categories, test retrieval



Bottom
-
up (Inductive)


Identify audience


Survey documents for
topics/concepts.


Build system on the fly


let content drive structure
and limits of system


Identify gap &
redundancies in system


Test retrieval



Creating a CV (3)


Think about scope, use, content, maintenance


Gather Terms


Based on existing systems, content


Based on user needs/expectations


Investigate issues of specificity, exhaustivity, granularity


Build hierarchies, relationships


Broader/narrower terms, Related terms, Use/Use for, see/see
also


Establish Rules


Implement


Evaluate


Maintain

http://
www.boxesandarrows.com
/view/
creating_a_controlled_vocabulary

Evaluating a CV


Goals


Determine if the CV solves retrieval needs of
user/system


Determine if CV matches user

s content
model/term expectations


Methods


Expert evaluation of CV


User based card sorting compared to actual CV


Identification of non
-
included documents


Analysis of use of system
-

HCI

CV Maintenance


Primary responsibility


Editor, board, committee


New terms


Is it really new or a different view


What is the proper form & placement


Modified terms


Include a change log


Use a

USE


reference to point to new term


Deleted terms


Unused / Overused terms


May want to keep for historical retrieval purposed


Modification history


Use modification notes, date/time stamps



Case study
-

MeSH


http://
www.nlm.nih.gov
/
bsd
/
disted
/vid
eo/

Thesauri Concepts


Preferred terms


Non
-
preferred terms


Semantic relations between terms


How to apply terms (guidelines,
rules)


Scope notes


Adding terms (How to produce terms
that are not listed explicitly in the
thesaurus)

Common thesaural identifiers


SN

Scope Note


Instruction, e.g. don

t invert phrases


USE

Use
(another term in preference to
this one)


UF

Used For


BT

Broader Term


NT

Narrower Term


RT

Related Term

Thesauri Guides


National Information Standards
Organization. (2005). Guidelines for the
construction, format, and management of
monolingual thesauri. ANSI/NISO Z39.19
-
2005. Bethesda, MD: NISO Press.


http://www.niso.org/standards/resources/Z39
-
19
-
2005.pdf?CFID=5559601&CFTOKEN=31747314



Aitchison
, Jean &
Gilchirist
, Alan.
Thesaurus Construction: A Practical
Guide. 3
rd

ed. London:
Aslib
, 1997.


Willpower Information Management
Consultants


http://www.willpower.demon.co.uk/thesprin.htm



Thesaurus Exploration


http://www.getty.edu/research/tools/v
ocabularies/tgn
/



Protégé introduction and tour


What is protégé?


What is it used for?


How will we use it this semester?

When is a CV an Ontology?



The study of being or existence




A conceptualization of a
specification


(
Gruber
)



An ontology formally defines a
common set of terms that are used to
describe and represent a domain.


(OWL)

Webster

s Dictionary


Webster

s Third New International
Dictionary

defines Ontology
as
:

1.
A science or study of being, specifically
a branch of
metaphysics
*

relating to
the nature and relations of being.

2.
A theory concerning the kinds of entities
and specifically the kinds of abstract
entities that are to be admitted to a
language system.


*
Metaphysics: Nature of being

or


existence.

Next Week


Work time for Protégé


Exploration of ontologies