1
Digital Libraries &
Document Image Analysis
Henry S. Baird
Statistical Pattern & Image Analysis research
Information Sciences & Technologies Lab
ICDAR Aug 4, 2003
-
HSB
2
DLs as seen by a DIA Researcher
15 years in DIA R&D
Lucky to have known/collaborated with:
–
PARC DL enthusiasts: Masinter, Street, Bloomberg, et al
–
UC Berkeley Digital Library project: Wilensky, Fateman, et al
–
CMU Universal Library project: Thibadeau, Hauptmann, et al
–
Xerox Scanning Service Bureaus: Wallis, et al
–
… many others with an interest in DLs
What challenges do DLs pose to DIA R&D?
ICDAR Aug 4, 2003
-
HSB
3
Digital Library Dreams
Electronic networked DLs promise to provide:
–
more books, journals, etc
–
to more people
–
faster
–
at more places & times
than physical libraries can hope to….
The Ideal DL
:
an international, interoperable,
sustainable body of rich cultural
materials in digital form
ICDAR Aug 4, 2003
-
HSB
4
Document Images’ Usefulness in DLs
display, print
raster image
+ retrieve (more or less well)
+ OCRed text
+ retrieve well, reuse,
summarize, translate, …
+
correct
text
+ Web publishing
+ links (
e.g.
HTML)
+ “semantic web”
+ functional tags (
e.g.
XML)
+ reprinting
+ layout format (
e.g.
RTF)
+ index, catalogue
+ metadata (title, author, …)
ICDAR Aug 4, 2003
-
HSB
5
Advantages of Digital Displays
versus
Ink
-
on
-
Paper
Many…
–
networked
--
potentially unbounded content
–
rapidly rewritable
--
supports animation
–
radiant
--
legible in the dark
–
sensitive
--
markable, interactive
Generally thought to be overwhelming, but …
ICDAR Aug 4, 2003
-
HSB
6
Advantages of Ink
-
on
-
Paper
versus
Digital Displays
PAPER
cheap
large, many
high
-
resolution
lightweight
thin
unpowered
stable
DISPLAYS today
expensive
small, few
low
-
resolution
heavy
thick
powered
requires
maintenance
DISPLAYS in future
less expensive
larger, more
higher
-
resolution
lighter
thinner
lower power
eBooks, e
-
paper,
notebooks, laptops,
PDAs, …
A. Dillon, “Reading from Paper versus Screen: a critical review
of the empirical literature,”
Ergonomics
53(10): 1297
-
1326, 1992.
ICDAR Aug 4, 2003
-
HSB
7
The fact is, for many uses
Paper is Still Widely Preferred
“Paper [remains today] the medium of choice
for reading, even when most high
-
tech
[display] technologies are to hand”
—
Sellen & Harper (2002)
Why is this? Paper allows:
–
flexible navigation though documents
–
cross
-
referencing of several documents
–
annotations
–
interweaving of reading and writing
A. J. Sellen & R. H. R. Harper,
The Myth of the Paperless Office
,
The MIT Press, Cambridge, MA, 2002.
ICDAR Aug 4, 2003
-
HSB
8
Document Images are Doubly
Disadvantaged within DLs
They fail to support most uses that
symbolically encoded, tagged data do
They lose many key advantages they
enjoyed on paper
A Threat:
‘If it’s not in Google, I don’t need it!’
Can they be made
as useful
in DLs as encoded data?
Can they sometimes
work better
in DLs than encoded data?
…these are challenges to us, the DIA R&D community.
ICDAR Aug 4, 2003
-
HSB
9
The British Library
‘
The World’s Knowledge’
38.8M items catalogued
website: 18.4M page hits/year
Compare Google:
•
>3B pages
•
150M searches/day
“[Reinforcing] the Library’s role as the pre
-
eminent
global document supplier,
digital scanning from print
and microfilm originals
will give researchers rapid,
high quality delivery from over 100 million research
articles, reports, and conference papers direct to
their desktop.”
--
Lynne Brindley, Chief Executive
2002
-
2003 Annual Report
ICDAR Aug 4, 2003
-
HSB
10
Bibliothèque nationale de France
The Digital Library
–
digitization of both printed books and graphic material
–
primarily in image mode to begin with
–
most out
-
of
-
copyright
Gallica 2000
–
multimedia documents: Middle Ages
-
> early 20th century
–
35,000 printed volumes: images
–
1000 titles full text
–
“one of the largest DLs free of charge on the web”
ICDAR Aug 4, 2003
-
HSB
11
Million Book DL Project
1M books to be scanned by 2005
–
bitonal, 600 dpi
Free
-
to
-
read, universally accessible
Searchable by full text (where OCR is possible)
–
ABBYY Fine Reader OCR
Books in public domain or copyrighted but out of print
Fifteen partners:
–
US, India, China; est. 4000 person
-
years of clerical labor
–
Multinational, multilingual (mainly English)
20Tbyte trusted repository
Research testbed for summarization, OCR, automatic
extraction of metadata, machine translation
Reddy, Raj and Gloriana St. Clair, “The Million Book Project,”
CMU, Dec. 1, 2001.
ICDAR Aug 4, 2003
-
HSB
12
Google Catalogs
“1000’s” of scanned mail
-
order catalogs
free for publishers, ‘few days’ turnaround
–
for a fee: link products to web sites
free to users: download page images
indexed by: vendor, date, page numbers, etc
(not by full text content)
ICDAR Aug 4, 2003
-
HSB
13
Amazon.com plan
‘Look Inside the Book II’
~500k books: in
-
copyright, non
-
fiction
Scan (full color), OCR cover
-
to
-
cover
Full
-
text search, download sample pages
Free but limited access to page images
———
Can Google be far behind…?
search document image files found on Web
David D. Kirkpatrick, “Amazon Plan Would Allow Searching
Text of Many Books,”
The New York Times
, July 21, 2003.
ICDAR Aug 4, 2003
-
HSB
14
Capturing Document Images
To digitize a book: $4
-
$1000 each!
cheaply
: bitonal, low quality, mass scanning, …
expensively
: color, quality control, custom handling, …
“The Price of Digitization,” Proc., NINCH Symposium
(National Initiative for a Networked Cultural Heritage), New
York, April 8, 2003.
Breakdown of costs:
1/3
cataloging, description, indexing
1/3
scanning, OCR, correction, markup
1/3
quality control, file maintenance, admin
NOTE:
DIA can help with
all three
ICDAR Aug 4, 2003
-
HSB
15
Document Image Capture Operations
Usually, large
-
scale batch operations
Sometimes destructive:
–
cut off spines, discard covers, wear & tear
–
hot debate over ‘scan
-
and
-
discard’ policies
Image quality standards are often subjective
–
usually: “completeness”; no missing pages, text
–
seldom: checked for human, machine legibility
–
rarely: guaranteed suitable for future uses
Scan once, for ever:
–
seldom rescanned (Lesk: “not for 5
-
10y”)
M. Lesk,
Practical Digital Libraries: Books, Bytes, & Bucks.
Morgan Kaufmann, San Francisco, CA, 1997.
ICDAR Aug 4, 2003
-
HSB
16
The PARC Rare Book Scanner
•
Bulk scanning w/out
damaging books
•
Zero force on binding
•
Book is open 90 degrees
•
Pages turned manually
•
280 dpi
•
9.25”x11.75” field
•
Throughput
•
8
-
bit grey
450 pages/h
•
24
-
bit color
120 pages/h
Bob Street & Steve Ready, PARC.
ICDAR Aug 4, 2003
-
HSB
17
GUI & IP for Image Capture
•
Capturing Metadata
•
automatic page numbering
1,2,3,.../ i,ii,iii,.../ I,II,III,…
•
section labels
•
comments (manual)
•
Image Processing
•
performed on the fly:
•
contrast, cleaning, etc
•
crop. skew
-
correct
•
processing templates
•
Assuring Quality
•
visual inspection
•
Calibration
•
color test targets
•
per
-
pixel gain/offset map
ICDAR Aug 4, 2003
-
HSB
18
DIA R&D for Image Quality Control
Measuring
document image quality
–
new test target designs
–
image processing algorithms
–
rigorous, quantitative standards
Assuring
quality
–
fast algorithms for on
-
the
-
fly image quality
estimation
Predicting
human & machine legibility
What image quality features correlate
well with human and OCR legibility?
… and with other, later DIA tasks?
K. Summers, “Document Image Improvement for
OCR as a Classification Problem,”
Proc., DR&R
X,
Santa Clara,CA, Jan 2003.
E. H. Barney Smith & X. Qiu, “Relating
Statistical Image Differences & Degradation
Features,”
Proc, 5th DAS
, Princeton, NJ., Aug 2002.
ICDAR Aug 4, 2003
-
HSB
19
When Quality Control Goes Wrong
Front Page, 1852 Edition of the New York Times
The Historical New York Times Project,
CMU/NYT, 1999.
Scanned from microfilm.
ICDAR Aug 4, 2003
-
HSB
20
Extracting & Recognizing Content
These are central DIA R&D goals
But existing doc image understanding systems
cannot guarantee
high accuracy
across the full range of documents:
-
typefaces, h/w styles
-
image qualities
-
layout geometries
-
writing systems
-
languages
-
domains of discourse
S. Rice, G. Nagy, T. Nartker,
OCR: An Illustrated Guide to the Frontier
,
Kluwer Academic Publishers: 1999.
DL’s scholarly & historical docs are often harder
old fashioned
poor & variable
deformed
obsolete
rare
arcane
ICDAR Aug 4, 2003
-
HSB
21
Rare Botanical Reference Book
•
Jepson’s
A Flora of California,
1943.
•
Authoritative, still in demand by scholars
•
Only a few copies are left
•
Difficult to OCR well
•
Scanned at PARC, all page images put
on the Univ. California, Berkeley Digital
Library website
Richly Meaningful
Typographical Book Designs
ICDAR Aug 4, 2003
-
HSB
22
Cut into Word
-
box Images:
layout analysis without OCR
ICDAR Aug 4, 2003
-
HSB
23
Reflow Word Boxes into Textlines
to Fit the Display Geometry
T. Breuel, W. Janssen, K.
Popat, H. Baird, “Paper to
PDA,” Proc., ICPR, Quebec
City, 2002.
ICDAR Aug 4, 2003
-
HSB
24
Make Doc
-
Images Highly Portable,
Legible Everywhere
No OCR errors!
(Only layout errors.)
Preserve meaningful
appearance
Challenges:
reading order
non
-
text
navigation
linking
ICDAR Aug 4, 2003
-
HSB
25
For Text
seems feasible
–
Summarization of doc images w/out OCR
–
Outlining, condensing, linking
–
Reflowing tables
For Non
-
text
seems dauntingly hard
–
Mathematics
–
Chemical formulae
–
Line
-
art drawings
–
Graphics generally
Other ‘Pure
-
Image’ DIA for DLs
Not Dependent on Accurate Recognition
Vitally important to try
since recognition & encoding
are highly problematic
ICDAR Aug 4, 2003
-
HSB
26
Personal Digital Libraries
People are beginning to
–
collect
–
manage
–
share
their own small DLs
Scanned & encoded documents, mixed together
How to assist ‘productive reading’
These users lack specialized skills
DIA tools need to be deskilled to a clerical level
… and to work together far better
Thanks to: Jon Hull et al, Ricoh Innovations; Robert Wilensky et
al, UC Berkeley; Larry Spitz, DocRec; Kris Popat et al, PARC.
ICDAR Aug 4, 2003
-
HSB
27
Interactive Digital Libraries
Today’s DIA tools leave many errors
in recognition, encoding, tagging etc
How can these mistakes affordably be fixed?
Invite volunteer help:
–
e.g.
Gutenberg Project, Open Mind Initiative
Challenge: provide interactive tools to
–
accept corrections on
-
line
–
enforce review, verification
–
efficiently make the most of every correction
–
DIA tools able to benefit from correction
Thanks to: George Nagy, David Stork, Dan Lopresti.
ICDAR Aug 4, 2003
-
HSB
28
Collaborative DLs:
DIA for the Masses
Enable non
-
professionals to collaborate
in improving, manually, on the best that
automatic DIA tools can do,
e.g.
–
one person may correct thresholding
–
another corrects OCR errors
–
yet another adds tags
Offer DIA tools downloadable from the web,
possibly under GPL
-
like licenses
Dimp
?
—
document image processing toolkit
interoperable via common data structures & file formats
Thanks to: Tom Breuel, Kris Popat, Bill Janssen.
ICDAR Aug 4, 2003
-
HSB
29
DIA R&D Opportunities for DLs
Making Document Images as Useful as
Symbolically Encoded Data
Image capture, quality control
Image improvement, rectification, etc
Content extraction, recognition, & analysis
Legibility, presentation, reflowing
Markup, indexing, retrieval, summarization
Personal & interactive DLs
Offering DIA tools to DL users
…
many more, no doubt
ICDAR Aug 4, 2003
-
HSB
30
An Urgent Responsibility?
Vast, irreplaceable, culturally vital legacy collections
of paper documents are competing ineffectively for
attention with billions of digital documents
Thus paper archives are threatened with neglect,
perceived irrelevance, …. & eventually, oblivion?
The DIA community is uniquely qualified
to help the DL community rescue them.
ICDAR Aug 4, 2003
-
HSB
31
Contact
Henry S. Baird
Statistical Pattern & Image Analysis
baird@parc.com
www.parc.com/baird
+1
-
650
-
812
-
4481
FAX
–
4374
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment