The Question of Quality

beepedblacksmithUrban and Civil

Nov 29, 2013 (3 years and 8 months ago)

79 views

The Question of Quality

Most
of this presentation is based
on the work of Marcos
Gonçales

as cited in the references

Goals for this class


Consider quality in digital libraries


How do we define quality


How do we measure quality


How does quality control impact a user
?

Understanding Quality in a DL


Quality indicators: proposed descriptions of
quantities or observable variables that may
be related to quality


“measures” = stronger term. Requires validation


Gonçalves et al provide analysis of quality conditions and
recommend specific quantities to be used.


Dimensions of quality


Proposed indicators


Application to DL concerns

Getting the data


Where does the data come from?


Logging


Surveys


Focus Groups


Know what information is needed, then
choose the method most likely to provide the
data.


More about the sources of data after we see what
we need to know.

What are we looking for?


What characteristics of a digital library raise
questions about quality?


Data objects


Metadata


Collection


Catalog


Repository


Services


What characteristics do we want each of
those to have?

Dimensions of Quality

Dimensions of Quality


Digital Object


Accessibility


Pertinence


Preservability


Relevance


Similarity


Significance


Timeliness


Metadata Specification


Accuracy


Completeness


Conformance


Collection


Completeness


Catalog


Completeness


Consistency


Repository (may hold
more than one
collection)


Completeness


Consistency


Services


Composability


Efficiency


Effectiveness


Extensibility


Reusability


Reliability

Spot check


For your digital library project,


how will you define quality for each of
these factors?


Data objects


Metadata


Collection


Catalog


Repository


Services


What is your intention, or your
goals, for each of these?

I will ask each group to
present two of these (briefly),
but prepare all of them.

Information need
-

Digital
Objects


Accessibility


What collection?


# of structured streams


Rights management
metadata


Communities to be served


Relevance


Feature frequency


Inverse document frequency


Document size


Document structure


Query size


Collection size


Significance


Citation/link patterns




Preservability


Fidelity (
lossiness
)


Migration cost


Digital object complexity


Stream formats


Pertinence


Context


Information content


Information need


Similarity


All the same features as in
relevance


Also: citation/link patterns


Timeliness


Age


Time of latest citation


Collection freshness


Information need
-

Metadata
Specification


Accuracy


Accurate attributes


# attributes in the record


Completeness


Missing attributes


Schema size


Conformance


Conformant attributes


Schema size

Information
-

Collection and
Catalog


Completeness of the Collection


Collection size


Size of an “ideal” collection


Completeness of the Catalog


# of digital objects with no metadata


Item level metadata


Size of the collection


Catalog Consistency


# of metadata specifications per digital object

Information about the
Repository


Completeness


# of collections


Consistency


# of collections


Catalog/collection match


How well do the catalogs match the collections?


Are the catalogs for all the collections at the
same level of detail?

Service Information Need


Composability

(ability to be
combined to form new services)


Extensibility


Reusability


Efficiency



Response time


Effectiveness


Precision/recall (of search)


Classification


Extensibility


# extended services


# services in the DL


# lines of code per service
manager


Reusability


# reused services


# services in the DL


# lines of code per service
manager


Reliability


# service failures


# accesses


Making more concrete


Each of the measures listed gives an
idea of the information need


Exactly what do we measure?


How do we combine numbers obtained
to get a usable result?


Following pages describe specific
measures and formulas for combining
those.

Digital Object Accessibility


Basic requirement


If a user cannot access the DO, there is little point in
having it in the DL


Identified measures:


Collection, # structured streams, rights management
metadata, communities


Say it another way:


Is it present in a collection in the repository?


Is there a service that can retrieve and display the content?


Is the rights management open enough for access by this
user?

Digital Object Accessibility
-

formally

Define
do
x

= a specific digital object

Accessibility = Acc(do
x,
ac
y
) =


0, if there is no collection
C

in the DL repository R
such that
do
x


C


Otherwise,
acc = (∑
z


struct_streams(do
x
)
r
z
(ac
y
))/
|struc_streams(do
x
)|


where r
z
(ac
y
)) is a rights management rule defined as


1, if


Z has no access constraints, or


Z has access constraints and ac
y


cm
z,

»
Where cm
z,


Soc(1) is a community that has the right to
access z; and


0, otherwise

This does not deal with accessibilty related to accessing the streams


An illustration


NDLTD is the Networked Digital Library of
Theses and Dissertations


Some institutions requre that all theses and
dissertations be stored in this DL


Student chooses how visible to make the
document.


Parts of the document may be visible while other parts
are not


The document, or parts of it, may be visible to a
restricted community.

Accessiblity case


etd
x

is a specific electronic thesis or
dissertation of interest


acc(etd
x
) is


0 if it is not in the collection


Otherwise
(∑
z


struct_streams(etd
x
)
r
z
(ac
y
))/
|struc_streams(do
x
)|


Where r
z
(ac
y
) = 1


if
etd
x

is marked “world wide access” or
etd
x
is

marked
“local institution only” and ac
y


C
where C is defined as
identifiable members of the local institution


= 0 otherwise

With the numbers


An example from VT


For authors name beginning with
A (219 entries):


Unrestricted
ETDs
: 164


Restricted
ETDs
: 50


Mixed
ETDs
: 5


Percent unrestricted: 0.5, 0.5, 0.167, 0.1875, 0.6)


Overall measure of accessibility outside VT:


(164 *1 + 50 * 0 + .5 + .5 + .167 + .1875 + .6)/219


0.76

Spot check


What is the accessibility for Theses at
VT for author’s names beginning with
D?


(See Table 3.3 of the Quality chapter)

Solidifying Pertinence


How do we measure something like
pertinence?


Relation between the information
content of a digital object and the need
of the user


Depends on the user’s situation
--

background, current context, etc.

Pertinence


Inf(do
i
)

represents the information content of
digital object

i


IN(ac
j
)
is the Information Need of actor (user)
ac
j


Context (
ac
j
,
k
)

the combined effects of social
factors that determine the pertinence of
do
i

to
ac
j

at time

k


Two communities of actors


Users
whose information needs we try to satisfy


External Judges
who are responsible for judging the
relevance of a document in response to a query.


Non overlapping groups

Pertinence formula


Pertinence (do
i
, ac
j
, k): Inf(do
i
) X IN(ac
j
) X
Context(ac
j
, k)
defined as


1 if
Inf(do
i
)
is judged by ac
j

to be informative with
regard to

IN(ac
j
)
in context

Context(ac
j
, k)


0
otherwise


Rather complex way to say that the
information is relevant if either the user or a
qualified independent judge says it is

Preservability


Property of a digital object that
describes its state relative to changes in
hardware and software, representation
format standards


Ex new recording technologies
(replacement of VHS video tapes by
DVDs)


New versions of software such as Word or
Acrobat


New image standards such as JPEG 2000

Digital preservation techniques


Migration


Transform from one format to another


Ex. Open the document in one format and save in another or do an
automated transformation


Emulation


Reproducing the effect of the environment originally used to
display the material


Keep an old version of the software, or have new software that can
read the old format


Wrapping


Keep the original format, but add enough human
-
readable
metadata so that it can be decoded in the future


Note that the material is not directly usable


Refreshing


Copy the stream of bits from one location to another


Particularly suitable for guarding against the physical deterioration of
the medium


Most commonly used

Preservability issues


Obsolescence


How out of date is the digital object?


Many versions of the software?


Old storage media?


Difficult to migrate


Appropriate tools? Expertise?


Fidelity


How different is the migrated version from the original?


Distortion = loss of information


Preservability

of a digital object in a digital library is a function
of the fidelity of the migration and the obsolescence of the
object


Preservability(do
i
, dl) = (fidelity of migrating (
do
i
,
format
x
,
format
y
),
obsolescence(do
i
,
dl)
)


Two values to reflect the two dimensions of the concept: fidelity and
obsolescence

Miniclip

Internet Archive

Preservability factors


Capital direct costs


Software


Developing software to create new versions of the object
or obtaining licenses for new versions of the original
software


Hardware


For processing the migration and for storing the results


Indirect operating costs


Monitoring digital objects for migration needs


Maintaining up
-
to
-
date intellectual property rights


Storage


Staff training

Calculating Obsolescence


obsolence(do
i
, dl) =
cost of
converting/migrating the digital object,
do
i
,
within the context of a specific
digital library

Calculating fidelity


fidelity is the inverse of distortion.

fidelity(do
i
,
format
x
,
format
y
) =

1/(distortion(mp(
format
x
,
format
y
)) + 1.0)



One common measure of distortion


mean squared error (
mse
)


Let {
x
n
} be a stream of
do
i

and {
y
n
} be the converted stream


mse({x
n
}, {
y
n
}) = ∑
N
n
-
1
(x
n

-


y
n
)
2

/ N

Use
mse

for distortion:

fidelity(do
i
,
format
x
,
format
y
) =
1/(mse({x
n
}, {
y
n
})
=

N
n
-
1
(x
n

-


y
n
)
2

/ N + 1.0)

No distortion:
must yield a
fidelity of 1.0

A Preservation Scenario

From
Gonçales
, adopted from one of his sources


Librarian learns that special collection of 1,000 digital images, stored
in TIFF v5.0, is in danger of obsolescence because the latest
version of the display software does not support that version.


Librarian decides to migrate all images to JPEG 2000, now the
de
facto

image preservation standard, recommended by the Research
Libraries Group (RLG)


Librarian does search for options, finds a tool costing $500, that
converts TIFF 5.0 to JPEG 2000


About 20 hours needed to order, install, learn, apply the software to
all images. Hourly rate of $66.60 per library employee.


To save space, choose to use a compression rate that produces
average
mse

= 8 per image.


Preservability

of each image =
preservability

(image
-
TIFF5.0, dl) =
(1/9, ($500 +$66.60 *20)/1000) = (0.11, $1.83)

Both numbers are costs and lower is better

Fidelity

loss

Obsolescence

cost

Distortion +1

Hourly rate * hours

# images

Relevance


Relevance(d
0
i
,q) =


= 1 if d
0
i
is judged by an external judge to be relevant to
query
q



= 0 otherwise


Measure of the distance between the vector
representing the object and the vector representing
the object


The “external judge” requirement makes the
measure objective and independent of local
contextual issues.

Relevance
has a consistency,
independent of the momentary information need.


Pertinence

is a measure of usefulness within a
particular information need.

Significance



Significance is an expression of the
absolute usefulness of a given digital
object, independent of particular user
needs.


Citation records of objects in digital
libraries offer one measure of significance.
(This disadvantages the most recently
obtained objects, since they have had less
time to be cited by others.)

Look at
ACM DL
and the citation counts,
for example.

Life Cycle and Quality


The quality indicators relate to the core components of a
digital library


creation, use, finding, distribution.


Creation


Authoring, modifying


Describing, Organizing, Indexing


Use


Access, filtering


Finding (seeking)


Searching, Browsing, recommending


Distribution


Storing


Archiving


Networking


Quality and Lifecycle
-

2

Quality and Life Cycle
-

3


Note that some elements repeat



Timeliness is relevant to the content and to
the metadata that describes the content


Accessibility affects both usefulness and
distribution.


References


Gonçalves, M. A., Moreira, B. L., Fox, E. A.,
and Watson, L. T. “Quality Model for Digital
Libraries”
.