Digital Libraries & Document Image Analysis

closebunkieΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

72 εμφανίσεις

1

Digital Libraries &

Document Image Analysis


Henry S. Baird





Statistical Pattern & Image Analysis research


Information Sciences & Technologies Lab

ICDAR Aug 4, 2003
-

HSB

2

DLs as seen by a DIA Researcher


15 years in DIA R&D


Lucky to have known/collaborated with:


PARC DL enthusiasts: Masinter, Street, Bloomberg, et al


UC Berkeley Digital Library project: Wilensky, Fateman, et al


CMU Universal Library project: Thibadeau, Hauptmann, et al


Xerox Scanning Service Bureaus: Wallis, et al


… many others with an interest in DLs



What challenges do DLs pose to DIA R&D?


ICDAR Aug 4, 2003
-

HSB

3

Digital Library Dreams


Electronic networked DLs promise to provide:


more books, journals, etc


to more people


faster


at more places & times


than physical libraries can hope to….

The Ideal DL
:
an international, interoperable,


sustainable body of rich cultural


materials in digital form

ICDAR Aug 4, 2003
-

HSB

4

Document Images’ Usefulness in DLs


display, print


raster image


+ retrieve (more or less well)


+ OCRed text


+ retrieve well, reuse,


summarize, translate, …


+
correct

text


+ Web publishing


+ links (
e.g.

HTML)


+ “semantic web”


+ functional tags (
e.g.

XML)


+ reprinting


+ layout format (
e.g.

RTF)


+ index, catalogue


+ metadata (title, author, …)

ICDAR Aug 4, 2003
-

HSB

5

Advantages of Digital Displays


versus

Ink
-
on
-
Paper


Many…


networked

--

potentially unbounded content


rapidly rewritable

--

supports animation


radiant
--

legible in the dark


sensitive

--

markable, interactive



Generally thought to be overwhelming, but …

ICDAR Aug 4, 2003
-

HSB

6

Advantages of Ink
-
on
-
Paper


versus

Digital Displays

PAPER


cheap


large, many


high
-
resolution


lightweight


thin


unpowered


stable

DISPLAYS today


expensive


small, few


low
-
resolution


heavy


thick


powered


requires
maintenance

DISPLAYS in future


less expensive


larger, more


higher
-
resolution


lighter


thinner


lower power


eBooks, e
-
paper,

notebooks, laptops,

PDAs, …


A. Dillon, “Reading from Paper versus Screen: a critical review

of the empirical literature,”
Ergonomics

53(10): 1297
-
1326, 1992.

ICDAR Aug 4, 2003
-

HSB

7

The fact is, for many uses


Paper is Still Widely Preferred

“Paper [remains today] the medium of choice
for reading, even when most high
-
tech
[display] technologies are to hand”




Sellen & Harper (2002)

Why is this? Paper allows:


flexible navigation though documents


cross
-
referencing of several documents


annotations


interweaving of reading and writing

A. J. Sellen & R. H. R. Harper,
The Myth of the Paperless Office
,

The MIT Press, Cambridge, MA, 2002.

ICDAR Aug 4, 2003
-

HSB

8

Document Images are Doubly


Disadvantaged within DLs


They fail to support most uses that


symbolically encoded, tagged data do


They lose many key advantages they


enjoyed on paper


A Threat:

‘If it’s not in Google, I don’t need it!’


Can they be made
as useful

in DLs as encoded data?

Can they sometimes
work better

in DLs than encoded data?

…these are challenges to us, the DIA R&D community.

ICDAR Aug 4, 2003
-

HSB

9

The British Library



The World’s Knowledge’

38.8M items catalogued

website: 18.4M page hits/year

Compare Google:



>3B pages



150M searches/day



“[Reinforcing] the Library’s role as the pre
-
eminent

global document supplier,
digital scanning from print

and microfilm originals

will give researchers rapid,

high quality delivery from over 100 million research

articles, reports, and conference papers direct to

their desktop.”


--

Lynne Brindley, Chief Executive


2002
-
2003 Annual Report

ICDAR Aug 4, 2003
-

HSB

10

Bibliothèque nationale de France


The Digital Library


digitization of both printed books and graphic material


primarily in image mode to begin with


most out
-
of
-
copyright


Gallica 2000


multimedia documents: Middle Ages
-
> early 20th century


35,000 printed volumes: images


1000 titles full text


“one of the largest DLs free of charge on the web”

ICDAR Aug 4, 2003
-

HSB

11

Million Book DL Project


1M books to be scanned by 2005


bitonal, 600 dpi


Free
-
to
-
read, universally accessible


Searchable by full text (where OCR is possible)


ABBYY Fine Reader OCR


Books in public domain or copyrighted but out of print


Fifteen partners:


US, India, China; est. 4000 person
-
years of clerical labor


Multinational, multilingual (mainly English)


20Tbyte trusted repository


Research testbed for summarization, OCR, automatic
extraction of metadata, machine translation

Reddy, Raj and Gloriana St. Clair, “The Million Book Project,”
CMU, Dec. 1, 2001.

ICDAR Aug 4, 2003
-

HSB

12

Google Catalogs


“1000’s” of scanned mail
-
order catalogs


free for publishers, ‘few days’ turnaround


for a fee: link products to web sites


free to users: download page images


indexed by: vendor, date, page numbers, etc


(not by full text content)


ICDAR Aug 4, 2003
-

HSB

13

Amazon.com plan


‘Look Inside the Book II’


~500k books: in
-
copyright, non
-
fiction


Scan (full color), OCR cover
-
to
-
cover


Full
-
text search, download sample pages


Free but limited access to page images

———

Can Google be far behind…?


search document image files found on Web

David D. Kirkpatrick, “Amazon Plan Would Allow Searching
Text of Many Books,”
The New York Times
, July 21, 2003.

ICDAR Aug 4, 2003
-

HSB

14

Capturing Document Images

To digitize a book: $4
-

$1000 each!


cheaply
: bitonal, low quality, mass scanning, …

expensively
: color, quality control, custom handling, …

“The Price of Digitization,” Proc., NINCH Symposium
(National Initiative for a Networked Cultural Heritage), New
York, April 8, 2003.

Breakdown of costs:

1/3
cataloging, description, indexing

1/3
scanning, OCR, correction, markup

1/3
quality control, file maintenance, admin

NOTE:
DIA can help with
all three

ICDAR Aug 4, 2003
-

HSB

15

Document Image Capture Operations


Usually, large
-
scale batch operations


Sometimes destructive:


cut off spines, discard covers, wear & tear


hot debate over ‘scan
-
and
-
discard’ policies


Image quality standards are often subjective


usually: “completeness”; no missing pages, text


seldom: checked for human, machine legibility


rarely: guaranteed suitable for future uses


Scan once, for ever:


seldom rescanned (Lesk: “not for 5
-
10y”)

M. Lesk,
Practical Digital Libraries: Books, Bytes, & Bucks.


Morgan Kaufmann, San Francisco, CA, 1997.

ICDAR Aug 4, 2003
-

HSB

16

The PARC Rare Book Scanner


Bulk scanning w/out


damaging books


Zero force on binding


Book is open 90 degrees


Pages turned manually


280 dpi


9.25”x11.75” field


Throughput


8
-
bit grey


450 pages/h


24
-
bit color

120 pages/h

Bob Street & Steve Ready, PARC.

ICDAR Aug 4, 2003
-

HSB

17

GUI & IP for Image Capture


Capturing Metadata


automatic page numbering

1,2,3,.../ i,ii,iii,.../ I,II,III,…


section labels


comments (manual)



Image Processing



performed on the fly:



contrast, cleaning, etc



crop. skew
-
correct



processing templates



Assuring Quality



visual inspection



Calibration



color test targets



per
-
pixel gain/offset map

ICDAR Aug 4, 2003
-

HSB

18

DIA R&D for Image Quality Control


Measuring

document image quality


new test target designs


image processing algorithms


rigorous, quantitative standards


Assuring

quality


fast algorithms for on
-
the
-
fly image quality
estimation


Predicting

human & machine legibility

What image quality features correlate


well with human and OCR legibility?


… and with other, later DIA tasks?

K. Summers, “Document Image Improvement for

OCR as a Classification Problem,”
Proc., DR&R

X,

Santa Clara,CA, Jan 2003.


E. H. Barney Smith & X. Qiu, “Relating

Statistical Image Differences & Degradation

Features,”
Proc, 5th DAS
, Princeton, NJ., Aug 2002.

ICDAR Aug 4, 2003
-

HSB

19

When Quality Control Goes Wrong

Front Page, 1852 Edition of the New York Times

The Historical New York Times Project,

CMU/NYT, 1999.

Scanned from microfilm.

ICDAR Aug 4, 2003
-

HSB

20

Extracting & Recognizing Content

These are central DIA R&D goals

But existing doc image understanding systems


cannot guarantee

high accuracy


across the full range of documents:



-

typefaces, h/w styles


-

image qualities


-

layout geometries


-

writing systems


-

languages


-

domains of discourse

S. Rice, G. Nagy, T. Nartker,
OCR: An Illustrated Guide to the Frontier
,
Kluwer Academic Publishers: 1999.

DL’s scholarly & historical docs are often harder

old fashioned

poor & variable

deformed

obsolete

rare

arcane

ICDAR Aug 4, 2003
-

HSB

21

Rare Botanical Reference Book



Jepson’s
A Flora of California,
1943.



Authoritative, still in demand by scholars



Only a few copies are left



Difficult to OCR well



Scanned at PARC, all page images put


on the Univ. California, Berkeley Digital


Library website



Richly Meaningful


Typographical Book Designs

ICDAR Aug 4, 2003
-

HSB

22

Cut into Word
-
box Images:


layout analysis without OCR

ICDAR Aug 4, 2003
-

HSB

23

Reflow Word Boxes into Textlines


to Fit the Display Geometry

T. Breuel, W. Janssen, K.
Popat, H. Baird, “Paper to
PDA,” Proc., ICPR, Quebec
City, 2002.

ICDAR Aug 4, 2003
-

HSB

24

Make Doc
-
Images Highly Portable,


Legible Everywhere

No OCR errors!

(Only layout errors.)

Preserve meaningful


appearance


Challenges:


reading order


non
-
text


navigation


linking

ICDAR Aug 4, 2003
-

HSB

25


For Text
seems feasible


Summarization of doc images w/out OCR


Outlining, condensing, linking


Reflowing tables


For Non
-
text
seems dauntingly hard


Mathematics


Chemical formulae


Line
-
art drawings


Graphics generally

Other ‘Pure
-
Image’ DIA for DLs

Not Dependent on Accurate Recognition

Vitally important to try

since recognition & encoding

are highly problematic

ICDAR Aug 4, 2003
-

HSB

26

Personal Digital Libraries


People are beginning to


collect


manage


share


their own small DLs


Scanned & encoded documents, mixed together


How to assist ‘productive reading’


These users lack specialized skills


DIA tools need to be deskilled to a clerical level


… and to work together far better

Thanks to: Jon Hull et al, Ricoh Innovations; Robert Wilensky et
al, UC Berkeley; Larry Spitz, DocRec; Kris Popat et al, PARC.

ICDAR Aug 4, 2003
-

HSB

27

Interactive Digital Libraries


Today’s DIA tools leave many errors


in recognition, encoding, tagging etc


How can these mistakes affordably be fixed?


Invite volunteer help:


e.g.

Gutenberg Project, Open Mind Initiative


Challenge: provide interactive tools to


accept corrections on
-
line


enforce review, verification


efficiently make the most of every correction


DIA tools able to benefit from correction

Thanks to: George Nagy, David Stork, Dan Lopresti.

ICDAR Aug 4, 2003
-

HSB

28

Collaborative DLs:


DIA for the Masses


Enable non
-
professionals to collaborate


in improving, manually, on the best that


automatic DIA tools can do,
e.g.


one person may correct thresholding


another corrects OCR errors


yet another adds tags


Offer DIA tools downloadable from the web,


possibly under GPL
-
like licenses


Dimp

?


document image processing toolkit


interoperable via common data structures & file formats


Thanks to: Tom Breuel, Kris Popat, Bill Janssen.

ICDAR Aug 4, 2003
-

HSB

29

DIA R&D Opportunities for DLs

Making Document Images as Useful as
Symbolically Encoded Data

Image capture, quality control


Image improvement, rectification, etc


Content extraction, recognition, & analysis


Legibility, presentation, reflowing


Markup, indexing, retrieval, summarization


Personal & interactive DLs


Offering DIA tools to DL users



many more, no doubt

ICDAR Aug 4, 2003
-

HSB

30

An Urgent Responsibility?


Vast, irreplaceable, culturally vital legacy collections
of paper documents are competing ineffectively for
attention with billions of digital documents


Thus paper archives are threatened with neglect,
perceived irrelevance, …. & eventually, oblivion?

The DIA community is uniquely qualified

to help the DL community rescue them.

ICDAR Aug 4, 2003
-

HSB

31


Contact

Henry S. Baird

Statistical Pattern & Image Analysis


baird@parc.com

www.parc.com/baird


+1
-
650
-
812
-
4481

FAX


4374