A. VIDI: Visual Information Dissemination for visually Impaired individuals

businessunknownInternet and Web Development

Nov 12, 2013 (3 years and 11 months ago)

122 views


1

A.

VIDI: Visual Information Dissemination for visually Impaired individuals

S. Chandu Ravela






Department of Computer Science

Allen Hanson






Computer Vision Laboratory

University of Massachusetts







Amherst MA 01003


A visually impaired individual se
eks information in a manner similar to a sighted person. For example, when
a sighted person arrives at a new airport they navigate around by looking at signs, asking attendants for
directions or, in some cases, reading a map. Such information is also valua
ble to the visually impaired. Our
world is populated with visual information. Signs (textual or otherwise) can be seen marking buildings,
streets, entrances, floors and a myriad of other places. The large body of work in developing electronic travel
aids d
oes not address the issue of presenting information to the user that is inherently visual and useful. The
research presented here will have a tremendous impact on the growing population of visually impaired in the
country and indeed the world. Our content
ion is that the field of computer vision has matured to the point
where a completely portable system capable of finding and reading signs and other textual events in the
environment in close to real time is feasible. What a gift it would be if we could pr
ovide the visually
impaired with a way to obtain crucial information from the environment that is now totally inaccessible
(except through intervention of a sighted person).


We present a three year plan to develop a system that runs on a wearable computer
, uses a gyroscopically
referenced head mounted camera, detects and recognizes text and signs in the environment and disseminates
them in response to a user query using synthesized voice.
The proposed systems’ operation begins with the
user
the engaging th
e head mounted camera to acquire information.

A compact representation of the
continuous video of the scene is generated in the form of a mosaic, within which the
system detects text and
signs. We will develop an object detection and recognition system t
hat uses multi
-
scale features that are
invariant to changes in viewing angle and illumination. Recognition is accomplished by building multi
-
resolution spatial networks of distributions of these features and matching them against a database of known
(cust
omized to locale) signs (e.g. international highway signs) using a coarse to fine matching technique that
is both robust and efficient. Text is detected using the same techniques and is recognized using a commercial
OCR system. We will also examine the O
CR’s intermediate representations, and a word dictionary for
tolerance to character recognition errors. In both text and signs information from the gyroscopically
referenced head mounted camera is used to filter the potential set of match candidates for im
proved
efficiency. We will significantly leverage existing work on image retrieval from large databases using image
content, face recognition, image indexing using color properties, and text detection and recognition in
complex images, most of which has h
ad NSF support.


Research is conducted in three phases roughly corresponding to a year each. Both design and evaluation are
conducted in conjunction with Lighthouse International, which specializes in rehabilitation training for the
visually impaired. In

the first phase individual components, which include camera stabilization, mosaics, text
and sign recognition are developed. In the second phase, a head mounted unit, attitude inference, a wearable
computer, a camera, speech synthesizer and user interface

model are developed. Additionally, the use of log
-
polar cameras is examined for fast computation of multi
-
scale features and for its effectiveness in
simultaneously addressing the field of view and resolution issues. The last phase consists of a series of

user
and system evaluations and refinements.





2

B.

Table of Contents

automatically generated…We do nothing here.


3


C.

VIDI: Visual Information Dissemination for visually Impaired individuals

The development of an effective
visual information system

will signifi
cantly improve the degree to which the
visually impaired can interact with the environment. There is, unfortunately, a growing population of visually
impaired individuals and while exact numbers are unknown, a World Health Organization study in 1994
estima
tes approximately 148 million with visual impairment and 48 million completely blind[7]. In the
United States alone, roughly a million people are projected to be legally blind in the year 2000[9] and
approximately three million with serious visual impairme
nt. There are numerous causes of visual impairment
including accidents, congenital blindness, neurological disorders, disease and aging. While some of these
causes can be prevented or cured, however, there is a significant population that experiences varyi
ng degrees
of social dependence. By 2010 the number of blind people alone (not counting those with serious impairment)
is projected at more than 100 million worldwide [54].


There are several difficulties in day
-
to
-
day life that diminish an impaired person
’s independence. Consider the
following true story as an example. On a train trip from Washington D.C. to Amherst, MA we observed a
visually impaired passenger step off at New York’s Penn station. A few minutes later, he re
-
entered the coach
and proceeded
to find his seat. He missed his row, the first row, and continued up the aisle, probing. He
would attempt to sit and then quickly retract because the seats would face the opposite direction. Continuing
in this fashion, about halfway past the coach, he foun
d a row of seats that matched the orientation of his row.
Unfortunately, he nearly bumped into the person in the far (window) seat. At this point we, who knew where
he sat, walked up to him and guided him back. This poignant example is just one of many tha
t illustrate the
difficulty in mobility, orientation and, in general, of extracting information from the environment. In this
document we present a proposal for the development of system that will aid the visually impaired to access
information from the en
vironment by recognizing signs and other textual events and reading them aloud in
response to a query.


Travel aids to the impaired have been a part of our social effort for quite some time and include human
assistants and guide dogs. The cane, introduced
in the 1940s, dramatically improved the mobility of blind
pedestrians with some training. Since then, several electronic travel aids (ETA) have been developed to
address different aspects of mobility and orientation. Table 1 summarizes various electronic d
esigns available
today. These designs can roughly be placed in two categories. The first category (items 1
-
14) relates to
mobility where spatial structure is communicated. The second category (items 15
-
20) relates to orientation by
using Vision, GPS, maps
and other GIS systems.


It has been argued that,
“an ETA should be indisputably significant and should enable independent, efficient,
effective, and safe travel in unfamiliar surroundings”

[9]. Table 1 shows that mobility and orientation related
devices h
ave been evolving but there are several reasons why they have not become more popular than the
long cane. Although the long cane seems of limited capability, in practice it has proven to communicate
surprisingly rich environmental structure that is useful
for travel surface discrimination, clear path
determination, and obstacle detection and characterization. ETAs for mobility employ rudimentary auditory
and tactile transduction methods for communication that can interfere with natural sensing. Further, ETA
s
have been very expensive and difficult to maintain when compared with the long cane. Indeed, mobility and
orientation trainers argue that a significant degree of mobility skills can be acquired using the long cane [56].
For example, at Light House Intern
ational, safety and mobility training is accomplished for 90% of new
students each year within 30 hours of training. For the traveler on the train, it is unclear if an ETA would fare
better than the cane to find his seat. Thus, although improvements to the

cane seem a worthy and obvious
goal, building an effective mobility device is a significant research and engineering challenge.


Device

Input

Output

Functionality

1

CDM
-
90,

Shanghai Acoustical Lab

Ultrasonic

Handheld

Beep rate

Transcode


4

2

Mowat

Pul
se Data International Ltd.


Vibration

Sound

Transcode

3

Polaron

Nurion Industries

Tone Pitch

Transcode

4

Russell Pathsounder

Russell, Lindsay

Ultrasonic

Chest mounted

Vibration

Beeps

Transcode

5

Sensory 6

Brytech Inc.

Ultrasonic

Head mount

Pitch tone

Transcode

6

Walkmate

SafeTech International

Ultrasonic

Handheld


Pitch

Vibration

Transcode

7

Nottingham Obstacle detector

University of Nottingham

Musical Scale

Transcode

8

Pilot Light, Mini
-
Radar

Corso, Francia

Infrared

Handheld

Vibration Sound

Trans
code

9

Laser Cane

Nurion Industries

Laser

Long cane

Tones

Vibration

Transcode

10

Sonic Pathfinder

Perceptual Alternatives

Ultrasonic Array

Head mount

Musical Scale

Selective stereo

Obstacle Detection

11

SonicGuide

Sonic Vision

Wide
-
angle
Ultrasonic

Hea
d Mounted

Binaural sound
synthesis

Object Tracking

Object Texture

12

Guide Cane,

University of Michigan

Ultrasonic

size of a vacuum
cleaner

Tactual. Pulls
along clear path

Obstacle
Avoidance

13

PAM
-
AID

Varty Research

Sonar

Vaccum cleaner

Motion along c
lear
path

Avoidance

14

Strider

Arkenstone Inc

GPS

Map Index

Voice Synthesis

Navigation

15

MoBic

University of Magdeburg

GPS based Attitude

Map index

Voice Synthesis

Landmark
identification
Navigation

16

Nomad

University of Nottingham

Map, GIS

Tactile

Na
vigation

17

Sheffield University

Map, GIS

Tactile

Navigation

18

Talking Signs

Talking Signs Company

Infrared receiver

Voice

Beacon indicating
landmark

19

Movis

University of Hamburg

Cameras

Sound

Acquire, Record,
recall landmarks

Table 1:

A summary of
mobility and orientation electronic travel aids (ETA)


It has been argued that a visually
impaired person seeks the same sort of
cognitive visual information that a
sighted person does [9,56]. For
example, when a sighted person
arrives at a new airport th
ey navigate
around by looking at signs, asking
attendants for directions or, in some
cases, reading a map. Such
information is also valuable to the

visually impaired. Our world is populated with visual information. Signs (textual or otherwise) can be seen
marking buildings, streets, entrances, floors and a myriad of other places. The available body of ETA related
work focuses predominately on mobility and does little to provide access visual information in the form of
Figure 1: System Overview

Mosaic

Query

Speech

Synthesis

Sweep

Text/Signs

Object &

Orientation

Response

Attitude

USER

SYSTEM


5

environmental text. to In this proposal
, we focus on detecting, recognizing and
reading signs and other forms of text. Cognition of text and signs can (and should) be
performed in conjunction with mobility aids such as the long cane or other ETAs.
This information is complimentary to cognition

through other senses because, in
general, one cannot expect to hear signs or text. Although talking signs have been
proposed, they are unlikely to be ubiquitous in the environment, or even become
abundant in the near future. Text and signs are also valua
ble landmarks for
orientation and navigation. A system that can acquire them can co
-
exist with GPS or
tactile maps (see Table 1) and supplement them when they do not work. For example,
GPS does not work indoors or in cluttered environments (NYC). More gene
rally, the
detection/reading of text and signs presents a user with a rich source of information
that can be used beyond mobility and orientation, both in habitual and new
environments. It would be tremendously useful to be able to locate the new restauran
t
that opened on the street, read a menu in a deli, read package labels, or recognize a
road construction sign that suggests an alternate path must be taken. Our traveler
might have been able to use a portable text reader to locate seat signs.


We presen
t a three year plan to develop a
system that runs on a wearable computer,
uses a head mounted camera, detects and
recognizes text and signs in the
environment and disseminates this data in
response to a query using synthesized
voice. The information dissem
inated is
inherently visual and cannot generally be
obtained by other means. Text and signs
can also serve as landmarks for navigation
and increase an impaired person’s
cognition of the environment.

C1.

Proposed Research

Figure 1
1

shows an overview of the pr
oposed system.

The system’s operation begins
with the user standing still and
engaging the head mounted camera to acquire
information.

A compact representation of the continuous video acquired from a
sweep of the scene is generated in the form of a
mosaic
[63]
, within which the
system
recognizes relevant objects (in this proposal, text and signs). Text is passed through
an
OCR[57]

system to generate ASCII labels. Signs are recognized by comparing the
image data with a database of signs and retrieving their
semantics (see next
section)[36
-
39]. Detected objects

are indexed by the head pose (angle of the head
around the neck axis relative to the body). The head pose is measured using
differential
gyroscopic

attitude

inference
. A user can
query

the system using
typed
text from a small
handheld

keyboard
. Note that this form of input can be replaced
with voice input in the long term. Queries can include
specific

terms

or simply
request a
readout

of the text, signs, or both. The responses are delivered to the user
using
speech

synthesis
.


In this proposal, we develop an object detection and recognition framework that
involves building multi
-
resolution spatial networks of color
-
differential feature



1. Note that figures are in color. If your printed copy is not in color, please visit

http://vis
-
www.cs.umass.edu/VIDI/

Figure 3:
Detected text in portion of
mosaic in Figure 2.


Figure 2: Mosaic over an 80
o

sweep along a walkway on Hallock Street looking for Pease Place


6

distributions to detect signs and text. The formulation of the probl
em is in the form of constructing a classifier
that uses coarse resolution representations of color
-
differential features to detect one of the known object
classes. The classifier is built using an expectation maximization algorithm[3,17] to build a mixtur
e of
experts. Within detected regions objects are recognized using a coarse to fine matching using models falling
within the appropriately detected class. Additional work in text detection includes detection of “curved” and
skewed text due to the relative
camera orientation. Recognition at run time is made efficient by using attitude
inference to selectively match known object classes and spatially align models, the use of mixture models for
detection and image indexing techniques.


Information is broadl
y sought under two conditions. The first is when the user is stationary and the second is
when she is in motion. Both these modes of operation result in different paradigms for information
dissemination and are discussed separately. In the first, stationar
y mode, the user engages the system by
sweeping

across a field of view. In this mode, the user is
querying

for information in the environment. For
example, our traveler could come to the aisle, sweep across and then query the system
, “read out text”
. The
s
ystem operates on the entire video sequence and reports on the
recognized text
which include aisle numbers,
“10 degrees left, A 17 B, next token, straight ahead A 16 B…”,
and possibly other text from the environment.
Alternately, a query over a support rep
ertoire of symbols or text could be posed. For example, on a road trip
from Pease Place to the Computer Science exactly eight road crossings and four bus stops are encountered. A
person on the street might ask
“Pease Sign”

and the system might report, “
Pea
se sign detected straight
ahead”
(Figures 2
-
3.) In the second mode, the user is in motion and uses the system to
filter
the evolving
visual environment for desired objects. The use of the system in motion places several constraints on the
system. First, th
e system must execute in a timely manner, second, the system must handle fluctuations due to
the user’s motion. Third, the system interface must not interfere with existing sensory processing or activity
that the user is engaged in. These are significant i
ssues that we look at as long
-
term goals. In contrast, in the
stationary mode the user is explicitly engaging the
system to query the environment for information
and hence the user interface model is somewhat
simpler, but a precursor to a more general one.

Similarly, generating stable representations from
video while the camera is being swept involves
stabilization[61,64], but a little easier to tackle than
when walking. Finally, the user can tolerate a
somewhat higher system latency than while in
motion. F
or these reasons, we believe that focusing
on the stationary mode of operation will yield a
better experience for the duration of funded
research.


This proposal contains a three
-
phase plan for
acquiring and disseminating information in the
stationary mod
e. In the first phase individual
elements are developed; these include scene
stabilization[64], mosaicing[62,63], and text[58] and sign recognition[36
-
39]. In the second phase, the device
is constructed, including gyroscopic attitude inference and the syst
em and user interface for use in the
stationary mode. In the third phase, we continue to develop the integrated system and evaluate it with visually
impaired users. Our work is performed in consultation with Lighthouse International during the design and
e
valuation phases. In the remainder of this document research towards developing the system and its
components is discussed.

Figure 4: System Compone
nts and data flow

Camera

Head

3
-
axis Gyro

Head Set

CPU

Body

3
-
axis Gyro

Mini
Keyboard


7

C1.1.1

Device Construction and System Interface

The proposed system in Figure 1 must be portable and have sufficient computational and IO b
andwidth to
accommodate the demanding tasks. The proposed system data flow is shown in Figure 4 and has five distinct
elements. The first is the processing unit and we propose to use a light laptop that can be placed in a
backpack; in the future this migh
t be replaced by a distributed wearable computer[11,28].

The laptop chosen
must satisfy several constraints: lightweight, significant computational capacity, large memory adress space,
ease of interface with peripherals, and it must execute the NT operatin
g system. We choose this configuration
because the system can be a wearable and drivers are readily available in the market to interface with
peripherals. This saves a significant amount of development time.


The second element is the camera used. The cho
ice of camera is influenced by the range over which text and
signs must be detected and hence the resolution and field of view. The choice is also influenced by size,
weight, usability and lens controllability. We propose to experiment with two kinds of ca
meras. Initially, we
plan to use a standard CCD analog (or IEEE1394 digital) camera with controllable focus and zoom lenses. In
the second case, we plan to investigate the use of a log
-
polar camera[10] with similar zoom/focus
controllability. The log
-
polar

camera has spatially varying resolution. Therefore, it has the potential to
simultaneously address the field of view and resolution issue (see Section C1.1.2). Furthermore, it promises to
increase computational efficiency because it has nice properties wi
th respect to scaling and it is possible that
differential features can be generated (at least for the mosaic) in a much more elegant way. This can be done
by developing mosaics simultaneously corresponding to a high resolution foveal view and a coarse res
olution
peripheral view that very efficiently produces a multi
-
scale representation (see Section C1.1.2). To our
knowledge, this hasn’t been tried before. One question we wish to address concerns the tradeoff between
traditional video formats and the log
-
p
olar representation with respect to object and text recognition
algorithms and their relative effectiveness. It may turn out that two approximately bore
-
sighted cameras with
wide angle and narrow angle lenses are more effective than a log
-
polar retina in d
ealing with field of view and
resolution.


The third element consists of obtaining attitude inference. Attitude inference is useful for several reasons.
First, it can be used to index models, that is, to estimate the kind of objects that might be present
, given that
we know the direction in which the camera is pointed. Second, it is used to index the objects recovered from
the scene. Third, it can be used to bring object models into alignment with the observation of the scene (see
Section C1.1.3). Our app
roach utilizes two three
-
axis gyros.

The first is placed on the head mount and the
second is attached to the body. Since we are using rate gyros, attitude is inferred by sampling and integration
of generated precession pulses over time. Such integration is

prone to the well
-
known gyro drift problem,
especially with simpler commercial grade gyros. Also gyros need to be mounted on stable platforms to obtain
accurate measurements. In the stationary mode, we believe the gyros will provide sufficiently accurate
information. The gyro output at the time when the user engages the system is used as initial reference for
subsequent integration. Since sweeps are of short durations, and the user is stationary, integration is not likely
to produce large drift errors.


Th
e fourth element is the input device for query generation and user control; initially we will use a small hand
held keyboard. As mentioned earlier, we choose this input mechanism for its simplicity. We also believe it
can be efficient. We choose typed text

as the mechanism for querying in this proposal so as to focus on the
performance of the visual aspects of the algorithm, but this aspect of our proposal can be extended to voice
input using currently available voice recognition technologies. Finally, the
fifth element consists of
synthesized speech output in response to a query. Here we plan to use a commercial speech synthesis system.

C1.1.2

Compact high resolution scene representations

In the stationary mode, the visual
sweep

results in an accumulation of vi
deo frames into a compact yet high
resolution representation of the scene called a mosaic [45]. A sequence of video frames contains a significant
amount of redundant data, especially when the scene being viewed has a low level of independent activity. A

8

la
rge area, high
-
resolution image can be constructed from a single narrow field
-
of
-
view video camera by
mosaicing sequential frames. As the camera’s field
-
of
-
view sweeps out larger and larger areas, the mosaic
can be built up by simply saving the new image
information associated with each additional frame (see Figure
2). Video mosaics have a number of advantages over traditional kinds of imagery [21,22,53,62,63]:




Video mosaics are a very compact representation of long video sequences.



Video mosaics preserve

three
-
dimensional information when the video sensor motion contains a
translation component.



Video mosaics can be generated in real
-
time.



During construction, independently moving objects can be detected and modeled separately.


Using mosaics in place of

the video sequence significantly reduces both
storage requirements and the computational overhead required to detect
significant structures. Many of the current successful image mosaic
algorithms generate 2D mosaics (either a 360
-
degree panorama or a full

sphere omni
-
directional image) from a camera rotating around its nodal
point [5,45,51,60,63].


For the system proposed here, a large field of view is necessary to gather
information in the environment around the person and a high resolution is
necessary
for good detection and recognition accuracy. The generation of
a suitable mosaic that satisfies these two constraints and that will serve as
the basic image representation in the proposed system is a low risk component of our work. We have already
devel
oped an approach (under other NSF funding) to automatic construction of a dynamic and multi
-
resolution panoramic (mosaic) representation from image sequences
taken during camera rotation and zoom [61,63]. We utilize a
pyramid
-
based image matching algorithm

and robust parametric
estimation of pose parameters to generate a precise cylindrical
panoramic mosaic (which can be unwarped into a planar image).
Moving objects are detected and separated from the mosaic image
based on motion information; they are then
represented as separate
objects. If required, a multi
-
resolution representation can be built for
the more interesting areas of the scene by means of camera zooming.
The method is fast, robust and automatic. No camera calibration,
feature extraction, image
segmentation or complicated nonlinear
optimization processing is required. We point out that because the
expected rotation of the camera is not through the camera’s nodal
point (that is, the camera’s optical axis is not aligned with the
rotational axis of
the head) it is possible to extract three
-
dimensional
information during mosaic construction (if required) [18, 32,51].


The interesting element of mosaic generation comes with the use of
the log
-
polar camera[10]. The log
-
polar transformation is defined as

follows. Let
(x, y)

be the spatial coordinates. Using polar notation

(


with the mapping
x

=


cos
(


and y
=



sin
(

the log
-
polar coordinates

(

can be written as




q


and


=
ln
(

/


).
Where



is the
minimum radius and
q

the angular res
olution. Thus, this mapping implements a space
-
variant representation
of the image. The retinal mapping on a CMOS array is shown in Figure 5. This representation can be
converted to the familiar Cartesian image at frame rate. The space
-
variant representati
on simultaneously
provides a high
-
resolution narrow field of view and a low
-
resolution wide field of view, making it ideal for
our application. An example is depicted in Figure 6. At the center one can observe a high resolution and
Figure 6: A spatially variant image
generated from log
-
polar sensing,
Example courtesy of Gulio Sandini

Figure 5: Log
-
polar retina
developed by Gulio Sandini.


9

towards the periphery lo
wer resolution. This representation can be used to generate mosaics at multiple scales
by generating separate mosaics for individually selected foveal regions. In this research multi
-
scale features
are generated from the mosaic both for detecting both sign
s and text (see Section 1.1.3). However, generating
multi
-
scale features can be computationally expensive. The log
-
polar nature of the camera allows us to
generate multi
-
scale mosaics, possibly at almost near frame rate, thereby simplifying feature comput
ations
tremendously. The development of multi
-
scale mosaics is an exciting issue.

C1.1.3

From Image Retrieval to Sign Recognition

The system operates on the mosaic image , and detects and recognizes signs and text. Signs (including signs
with text, such as Figu
re 3, and Figure 9) can be thought
of as patterns that need to be found in the image and a
pattern recognition framework is well suited. At the
Vision Lab and the Multimedia Indexing and Retrieval
Group, we have been examining the use of color[8] and
appea
rance based representations [36
-
40]for image
retrieval and this problem is structurally similar to the
sign recognition problem. The existing approach to
retrieval assumes the availability of a database of images
against which a query image is matched. Sim
ilarly,
database images are ranked by the match measure in response to the query and returned to the user in rank

order. Thus, the recognition problem is the same as retrieval at rank 1. In the appearance
-
based approach,
distributions of intensity surface
features are compared to measure similarity. The features are based on multi
-
scale Gaussian derivatives of images (MGDF) [12,19,24,26,35,38
-
40,46,47]. Some of the low order features
are shown in Table 2. These features are computed by filtering the image w
ith an appropriate Gaussian
derivative after suitable energy normalization[40].


Differential Feature

Common Term

Deformation Tolerance

1

I

Intensity

Not Used

2

Ix, Iy, Ixx, Ixy, Iyy

Spatial derivatives[19]

Rotation by steering[13]

3

I
2
x

+ I
2
y

Gradient

Magnitude[12,24]

Rotation

4

I
xx

+ I
yy

Laplacian[4,6,12,24]

Rotation

5

I
2
x
I
xx

+ I
2
y
I
yy

+ I
x
I
y
I
xy


Rotation

6

I
2
xx

+ I
2
yy

+ I
2
xy


Rotation

7

I
x

I
y

I
xy

-

I
x
2

I
yy

-

I
y
2

I
xx

Isophote Component[12,19,24]

Illumination, Rotation

8

I
x

I
y

(I
yy

-

I
xx
) + I
xy

(I
x
2

-

I
y
2
)]

Flowline Component[12,19,24]

Illumination, Rotation

9

tan
-
1

(Ix/Iy)

Gradient Orientation[36
-
39]

Illumination, Scale

10

tan
-
1

[ (Isophote+Flowline)/


(Isophote
-
Flowline) ]

Shape Index[20,30, 36,37]

Illumination, scale, Rotation

Table
2: Low order differential features and their deformation tolerances

Multi
-
scale Gaussian differential features (MGDF) capture an image at several center frequencies and
bandwidths or, from a spatial perspective, provide a regularized approximation to the l
ocal intensity surface in
the Taylor series sense[37,38]. MGDFs have been used to construct local object representations[35,39,46]
Figure 7: Some Examples of Signs

Figure 8: Examples of MGDF representations for retrieval. The first image is the query image and the
next four images are the
highest ranked images retrieved from the database.


10

Figure 9: High
-
resolution view of
detec
ted bus stop sign, shown by a cross
.

(using spatial networks of responses around a point) and global ones[30,37,38,47] (distributions of
responses) and have been

used for several tasks including object recognition [35,47], image retrieval[30,36
-
40,46], image matching[26,40] and feature detection[12,24].

In our existing work, we developed a retrieval system for global appearance matching using distributions of
MGD
Fs and their transformations so as to seek features with tolerance to illumination and scale change.
Specifically, we have developed a representation using histograms of principal curvatures[12,20,24,30,36
-
38]
(of the intensity surface) and orientation[36
-
38]. The principal curvatures are invariant to rotations and their
ratios are relatively insensitive to scale changes. Further, this feature is invariant to monotonic intensity
changes. Orientation (of the gradient) presents information
complementary to cu
rvature, is insensitive to scale and
moderate amounts of illumination. Two images are compared
by comparing their histograms. In other work other differential
invariants shown in Table 2 have been used[38]. We have
applied this technique to retrieve textur
es, trademarks (binary
and gray
-
scale) and faces. For example in Figure 8 four sets of
retrievals are shown, each containing 4 of the top ranked
images from the database. In each “strip” the first image is the
query and the remainder are the retrievals in
rank order.
Evaluations suggest that this method is superior to using Gabor
filters [25] in textures, moment based representations in binary
shapes[41,59] and principal components[55] as well as
correlation based methods in face recognition based on
implem
entations shown in [50]. In face recognition, on the FERET[33] and ORL[31] face sets we have
demonstrated a 97% recognition rate using the evaluation criteria established in [50].

The success of these results encourages us to design a framework for si
gn detection using MGDFs. The
problem is very similar, except that a sign could be anywhere in the image, and hence the search is over the
entire mosaic. For example, in Figure 7, two commonly occurring signs are shown. The first a bus stop and
the second
a cross
-
walk sign that a pedestrian might encounter.
In Figures 9 and 10 the locations where the corresponding
signs in Figure 7 match are shown. While the MGDF approach
seems to provide good results, we believe there are limitations.
Approaches to date th
at are too coarse tend to over
-
generalize
(i.e. havehigh bias); one such example is the use of global
distributions of features (over the entire object) [30,37,47].
Other approaches are too fine. They tend to under
-
generalize
(i.e. have low bias) and beco
me mired in the variability of the
object class; one such example is that of [35,39,46] which uses
a set response vectors. Neither method extends very well and
both typically do not have high recognition rates due to the
false positives and/or false negat
ive responses from the
recognition system. Finding the right representation both for
detection and recognition is therefore an important question.


In our approach to color based retrieval we have demonstrated the use of spatial relationships of color for
Figure 10: Example of a detected sign in
the field of view.

Figure 11: Color image retrieval of logos. The query is outlined in the first picture and the remainder are
retrieved from a database. Note the surf picture is mad
e by the same manufacturer and a the same color
arrangement is used.


11

finding logos. Logos can be found in many places including advertisements and product packages. For this
proposal, signs can be visualized as logos as well. By definition a logo has a spatial pattern of colors and
exploiting this fact has yielded good resu
lts. In the approach presented by Das[8] we represent objects (logos)
by a graph whose nodes represent the dominant color and whose arcs connect adjacent dominant colors. To
retrieve images their color adjacency graphs are matched. This “spatial representa
tion” is good trade
-
off

between local and global representations (as discussed briefly above) and accommodates large coordinate and
moderate illumination deformations. In Figure 11 the first image shows a region of interest that was selected
by the user as

the query and the subsequent images are those that were retrieved from the database in order of
measured similarity. This technique has also been extended to other specialized databases such as flower
patents and birds.


Our proposal for detecting
and re
cognizing signs
adopts a consistent object
detection and recognition
paradigm drawn from the
strengths of the MGDF and
color work. Additional
work on text detection also
uses similar notions and is
discussed separately in the
next section. To recognize
sig
ns, let us look at the
problem from the top. The
user sweeps across the
scene and a mosaic is
generated. Text and signs
that could be present at an
unknown orientation, scale,
and under different
illumination conditions
need to be recognized. In
this sect
ion we focus on
signs. One approach could be to simply match the signs by applying a windowing operator over the image.
Further since the signs can be present at several different sizes we might choose to run multiple window sizes.
This approach is ineffic
ient and although some operations are unavoidable, several steps can be taken to make
recognition efficient.


Our approach is depicted in Figure 12. When an image (or mosaic) is presented for analysis a multi
-
resolution
region of interest is windowed over

the image. First inferred attitude (or pose) from the gyroscope is used to
select a subset of known signs. For example, if the person is looking down then the “Bus Stop Sign” would be
disallowed in certain portions of the image, and instead the system mig
ht allow a Zebra crossing sign. Second,
color and MGDF distributions within a window are either rejected or classified as being one of several of
possible object classes. Classes are pre
-
determined and include text, text with signage (and hence thought of
as a sign), similar looking signs (such as cross
-
walks) and others. Third, classified windows are matched with
corresponding subsets of models. Using a series of steps involving measurement of dominant orientation and
surface parameters[14,15], use of hea
d orientation, size and intensity normalization, and built
-
in feature
invariances (to deformations) a model and window are brought into alignment with respect to each other.
Finally, a coarse to fine match of feature distributions between the image and mod
el is conducted to measure
similarity. In the remainder of this section these operations are elaborated upon.


Text

Patch

Feature
Distributions

Classify

Reject

sign

Model

Align

Spatial

Match

Group

Unwarp
Rectify

OC
R

“Bus Stop, PVTA, No Parking”

Figure 12: Illustration of
Detection and Recognition
Framework for Text and Signs.


12

The classifier (Figure 13) acts as an
object detector that reduces
recognition complexity by clustering
objects into classes and detecting any
instance without having to match each
one of them. Note that classification
is not being used to achieve
deformation tolerance; we believe
that as much as possible, invariances
(or tolerances) should be provably
constructed in the feature, so as to
allevia
te the complexity of building
(or learning) the classifier. Rather, the
classifier is designed to group
“similar” looking objects. For
example, consider the Pease Place
sign shown in Figure 3. Although each street sign is distinct, there is an appearance s
imilarity across different
street signs. Usually, the sign has a flat colored background with a strip of text on it. It makes little sense to
match each one of these signs if we are attempting to answer the question, “Is there a street sign in the field of

view?.” The proposed classifier will take the distributions of features (color and MGDF) and produce as
output the class C with the maximum a posteriori (MAP) estimate. The formulation we choose to adopt is the
construction of a mixture of experts that us
e the feature distributions as vectors and compare it with learned
class distributions. Learning is accomplished using a maximum likelihood framework using Kullback
-
Liebler
divergence[3,17] given the conditionals {P(distribution|Object), P(distribution|Not
Object)} and priors from a
large collection of natural images. Further, the individual classifier outputs are merged using the “softmax”
function [3,17]. There are several other possible supervised and non
-
supervised clustering choices including
principal
components, independent components, support vector machines or back
-
propagation[3,17]. The
choice of algorithm is a work item in the research. It should be noted that the use of classification for
detection is, in part, motivated by recent successes in lea
rning and probabilistic techniques[23,29,43,48,52].
However, the manner in which we proceed relies on the use of “good” features; that is features that by
construction exhibit tolerance to deformations. Further, in building the mixture model we are adoptin
g an
approach based on language models[34], and seeking to answer the question, how likely is it to generate a
query (the window) from a statisctical “language model” (distributions). This framework is rather uniform
and applicable to both text (ASCII) and

image retrieval.



Combining information from the head pose (or attitude) and the classifier produces an a posteriori estimate of
the likelihood of a known object model. Recognition is achieved by matching representations of object
models with windows con
taining large estimates (e.g. the most likely locations to look in detail). The
matching function/algorithm must satisfy three constraints. First, it must be a good measure of similarity.
Second, it must tolerate object deformations with respect to the mod
el and third, it must be efficient. In order
to address these issues it is necessary to discuss the features, their representations and the similarity measure
and below we elaborate on these points.


In general an object representation can be parameterize
d by resolution, region, feature, and, for MGDFs, the
scales of the feature. At a coarse resolution, the object is represented in its entirety and at fine resolutions, by
a spatial network of regions. Thus the first aspect of the representation is the use
of multi
-
resolution object
models. The use of multi
-
resolution models is not new[4,6] and its importance in matching and representation
has been well studied. We are motivated by similar considerations and use coarse resolutions to act as fast
pre
-
filters
for matching objects at finer resolution. The next element, spatial partitioning, is related to the
resolution of the object. At a coarse resolution the entire object forms the representation, while at fine
resolution spatial partitions of the object contr
ibute to its representation. The use of spatial partitioning has
Training Images

Differential


Distributions

Color
Distributions

3

2

1

N

C

Figure 13: Mixture of Experts framework to learn and combin
e
multiple object classifiers


13

shown to be extremely significant in our work. For example, in existing image retrieval work, we have used
histograms of GMDFs for face recognition [36]. In this context, instead of using one

distribution to represent
the entire face, we produced spatial partitions of three regions roughly corresponding to the forehead, mid
-
face and chin regions. Doing so has consistently improved results. On the FERET[33] data set our results in
recognition a
ccuracy jumped from 93% to 97% [36].


In this proposal we want to examine the more general issue of spatial networks of feature distributions. In
particular, we examine the following question. Is it better to partition uniformly[36], or do we use regions

around interesting object features, thereby generating object specific features[36,52]? For example, is it better
to partition a face, as discussed above, rather than using regions around the eyes, mouth, nose, etc? The latter
has the advantage of reduced

model complexity because only the most interesting aspects are represented,
and tolerance to coordinate deformations because spatial relationships of regions can be used to build affine
invariants[41]. However, when we consider multiple object models, i
t becomes difficult to match them in an
image in an economical way because each model comes with its own network. Uniform partitioning on the
other hand has can have a uniformly high model complexity but matching becomes easier. Since the spatial
network i
s a fixed across objects due to uniform partitioning,
multi
-
dimensional indexing[1,2,16,42,44] can be applied to
implement matching efficiently. Tolerance to deformations can
be achieved up to similarity transformations, by using a multi
-
resolution match t
o handle scale, and using head pose and
surface orientation estimates to handle rotations. It is not
immediately clear which is better and in the context of sign
recognition is one of the issues we plan to investigate.

Irrespective of the choice of represe
ntation of the spatial
network, feature distributions are the basic representation for an
object region. We seek that this representation be tolerant to
illumination, rotation, view and scale. Color distributions are
relatively stable to deformations; howe
ver, in the case of
differential distributions a more careful choice is required at the
feature level.


GMDFs can be made (2D) rotationally invariant [12,13,24] in a
rather nice way. However, since we need multiple scales
(regularization at a single scal
e is weak), the choice of scales
becomes important. This can be explained using an example.
Suppose we filter sign using a Gaussian (derivative) at
scale(standard deviation)

around some point. To compare a
reduced version of the sign by scale factor s (s
<1) appropriately,
we need to filter the corresponding point by a scale factor s


But since s is unknown a scale
-
space search might ensue which
is computationally very expensive. To address this problem, we adopt a natural scale selection paradigm
where t
he image structure is used to estimate the most
natural

choice of scales of the filter, thus eliminating
scale
-
space search.


In our present work on scale estimation, we develop a framework wherein the moments of differential
invariants at several scales
are examined, and extrema therein marked as a significant scale. Thus instead of
choosing scales 1,3,5 (say) we choose them to be functions of the natural scale. This approach is very similar
to the approaches by Lindeberg et. al[24], however, we induce a
more general scale selection heuristic by
picking a large order of features as opposed to the first or second order choices, such as the outer moment
matrix or Laplacian made by Lindeberg[24]. For example, in Figure 14 the application of natural scale
sele
ction using maximization of the covariance (of differential features) is shown. The top graph plots the
Figure 14: Natural Scale selection as
extrema of differential feature
covariances


14

Figure 15: Sign with
detected Text

objective function vs. the scale of the Gaussian used, and the bottom graph shows the neighborhood that the
choice of scale corresponds to. In this exam
ple it results in extracting “blobs” by finding a scale that roughly
“resonates” with the object size. Other experiments on faces reveal that it can be used to isolate structures
such as eyes, nose and lips[36]. The use of natural scale selection and rotat
ional invariants gives features that
are quite stable to similarity transformations, and also tolerant to moderate view changes. In practice, this is
achieved by filtering the image at multiple scales. In Section C1.1.2 we discussed the use of a log
-
polar
camera for generating a multi
-
scale image directly, as opposed to filtering the acquired image. Thus, this
operation, which can otherwise be computationally expensive, can be performed efficiently. Since the features
have tolerance to deformations, so do t
heir distributions. However, illumination is another aspect and here we
propose to investigate several techniques including, a) intensity normalization, b) picking specific invariants
such as the curvature (or ratios of principal curvatures) which have mon
otone intensity invariance properties,
or c) make it part of learning the feature distribution by using independent components of exemplars
generated at several illuminations. In summary our existing approach to detection and recognition involves
the foll
owing components



MGDF and color distributions are by design made tolerant to deformations.



Spatial network of distributions and coarse
-
to
-
fine representations for robust, efficient matching.



Learning object classifiers using hierarchical mixtures of expert
s approach.



Use of intensity surface properties.



Use of head pose for model
-
image alignment

C1.1.4

Text Detection and Recognition

OCR[57] technologies have proven to be quite successful in clean machine
readable documents such as this one. However, they have no
t been successful
when applied to images containing sparse text. Note that we are not considering
handwritten text, which is a separate problem. Poor OCR performance is in part
due to font size variability and in part due to the difficulty of isolating tex
t
patches within an image. In existing work in the context of image and video
indexing at UMASS, we have developed techniques for finding text in images,
extracting text from the image, and recognizing it using commercially available
off
-
the
-
shelf OCR[57].

Our current approach[58] makes the assumption that text
can be thought of as a texture that occupies distinct spatial frequencies. The key
parts of this system are as follows. An image is parsed at multiple resolutions to
extract
strokes
, defined as respo
nses to first order Gaussian derivatives at
multiple scales. Strokes with spatial cohesion are accumulated into
chips

and
bounding boxes are produced around chips[58]. These chips are then submitted to an OCR engine. In Figures
3, 15 and 16 examples of the

performance of the existing detection technique is shown. In Figures 3 and 15
the hypothesized text is marked by red boxes and in Figure 16, it is segmented from the image. Note that
although there are false positives, in Figure 3, these are eliminated du
ring the OCR stage.


Total


in Images

Total
Detected

Total OCRed

Char

21820

91%

84%

Word

4406

86%

77%


Table 3: Text Detection and Recognition Performance on 48 images

In our current work, experimental results (see Table 2) indicates that for a large va
riety of images such as
package labels, signs, and advertisements 86% were ultimately detected and 77% of the original words could
be successfully recognized[58].



15

The existing approach to text detection is limited to nearly horizontal text. The proposed

framework for text
detection is outlined in Figure 12. First, a window operating the mosaic is classified as possibly containing
text. This classifier is learned using the same methodology described in Section C1.1.3, and is feasible for the
same reason t
hat detection is possible in the existing method. That is, text can be thought of as a “texture”
pattern with a certain spatial cohesion. Once classification is complete, adjacent text windows are merged to
generate strips and these strips are rectified be
fore passing to the OCR. This is done to include “curved” text
and text on surfaces that might be skewed with respect to the camera. Our approach is as follows. First, we
will compute spatial cohesion along a parametric curve. This is done by using the sca
le selection heuristic
(discussed earlier) to obtain a filter scale that roughly “resonates” with the scale of a character. Smoothing it
any more will cause the texture to disappear. For text with inter
-
character spacing that is smaller than the size
of th
e character, this will result in the generation of a blob that extends to the size of the word. That is, there
will be continuity in the
sense of closeness of
differential feature
response. The local
orientation at sampled
points along this curve,
together

with the points are
used to enforce a C
2

continuity along the blob
and hence curved paths as
well as skewed text can be
recovered. Then we
unwrap them so they can
be fed to an OCR.
Evidence that this approach
is feasible already exists
[27] where, in the
context
of handwritten word
segmentation, it is shown
that recovering the natural
scale segments the words.
The key question is to
reduce the number of
scales used to estimate the natural “character” scale by using heuristics of expected size at a certain
distance
from the camera.


C1.1.5

User Evaluation

Lighthouse International will play a major role in the evaluation and will develop a pool of evaluators with
varying degrees of impairment. They will assist in evaluating the VIDI system with emphasis on the
fol
lowing non
-
inclusive questions:




What range of distances is best from a user’s point of view and how well does the current system function
within that range? How can any discrepancies be addressed?



What signs are most useful for determining orientation.



U
nder what conditions does text reading fail (and why)?



Is it necessary to have perfect recognition and speech synthesis or is the user able to tolerate translation
imperfections. Can the user be provided

feedback
, for example if a sign matches coarsely or
is of too
small a resolution can the system ask the user to “move forward or turn head right”, Further, we wish to
Figure 16: Text detection using existing techniques


16

Figure 17: Gantt chart describing research and evaluation plan

Data
Collection

Buffer

Sign

Stabilization

Text

Technical Evaluation

Text

Sign

System

Technical Evaluation

System

Refinement

Final Report

User Evaluation/LI

User Evaluation/LI

Ravela/RA1

Ravela/RA2

Ravela/RA3

Ravela/

Lighthouse

RA
-
123

System
Design

System
Design

0


6


12



18


24


30 36

Phase 1

Phase 2

Phase 3

Mosaics

Data
Collection

examine if we can lookup OCR output and intermediate representations through a dictionary to examine
the confidence of recognition and based
on this request the user re
-
scan, perhaps more slowly.



How can the user efficiently save a sign or piece of text and what is the recall accuracy and reliability.



How easy is it to use the hand
-
held keyboard in conjunction with a mobility device such as the

cane.



How useful is the information provided by VIDI?



How easy is the system to use?



C2.

Work Plan

The chart shown in Figure 17 describes the work plan for research and evaluation. Our plan can be roughly
divided into three phases. In the first phase, cove
ring the first year of effort, the basic algorithms for sign
recognition, text detection and their interface to a commercial OCR engine will be developed. During the
same period, work on camera stabilization is conducted and finally, during the first six m
onths, data will be
collected from throughout the UMASS Amherst campus. This data will pertain to signs and textual
information that is available in the external environment. In the second half of the first phase, we will
continue work on generating compac
t scenes in the form of mosaics and integrating these with the algorithms
being developed for recognition of text and signs. Following this there will be a technical evaluation of the
algorithms in the laboratory and it is expected that the buffer period a
t the end of the first phase will produce
the first set of disseminable
results.


The second phase begins with
the actual device
construction.
Again, several
tasks are executed
simultaneously.
One task is to
develop the head
mount consisting
of the head
set,
gyro, and camera.
The second task
consists of
development of the
wearable. These two
developments are
integrated with the core stabilization and mosaicing algorithm. At the end of the first part of the second phase
we plan to produce a basic system t
hat can be field tested with respect to a) how stabilization and mosaicing
works, b) the response times to detection of text and signs (from a small starting database) and c) the ease of
use for a user. This will, no doubt suggest several refinements that
will be carried out as they are discovered.
Although our planning and evaluation partners at Lighthouse International will be involved from the very
inception, it is at this stage that we bring the system to user trials. The main purpose of these trials is

to collect
data and obtain feedback as to system usability. The issues highlighted in Section C2.2.5 will be addressed.
Again we plan to disseminate the results of this phase through publications and demonstrations of the device
and the user study. In the

final phase of the project (lasting an anticipated eight months) we conduct a series
of system refinements and user trials in cooperation with Lighthouse International, and examine both places
where the system can be refined as well as improvements in the

techniques. The result of this phase will be a

17

comprehensive report that details our successes and failures, and will suggest directions for further
investigation.

C3.

Long
-
term Impact

The success of the research presented here will have a tremendous impact
on the growing population of
visually impaired in the country and indeed the world. Our contention is that the field of computer vision has
matured to the point where a completely portable system capable of finding and reading signs and other
textual even
ts in the environment in close to real time is feasible. What a gift it would be if we could provide
the visually impaired with a way to obtain crucial information from the environment that is now totally
inaccessible (except through intervention of a sig
hted person).


The design we propose here was developed specifically to be extensible. For example, the object detection
and recognition system is very general and the underlying technology has been used in a wide variety of
applications [4,6,26,27,30,35,
36
-
39,46,47,58]. Of particular interest here, though, is that this system can be
extended to include face detection and recognition, for example, or unique landmark detection for enhanced
mobility. The underlying mosaic representation is capable of produ
cing three dimensional representations of
the environment [63] in near real time (approximately 5 frames per second on a Pentium III laptop in an
environmental monitoring application [52]) although how the user would interact with this data is an
interesti
ng problem in transcoding. However, such an environment could convey important mobility
information at a range well beyond current devices.


We view the effort proposed here as the beginning of a long term effort to apply work in computer vision to
the pr
oblem of providing the visually impaired with a true 'view' of the world. We expect that as the system
evolves, its interaction with the user would grow much more sophisticated. For example, the system should
know when one or more of its components fail
and it should interact with the user at an appropriate level
when this occurs: "I've detected a sign off to the right but can't read the text." If the user feels the text is
important, then after notifying the system of this it should suggest an action th
at the user could take to
improve the likelihood that the text could be read: "If you can, move approximately ten feet to your left and
turn your head to the right." As 'smart devices' become more widespread (e.g. the 'smart home') the user can
'mesh' int
o the surrounding network through the wearable computer; this symbiotic relationship should
provide an extremely rich environment.


Finally, we observe that the VIDI concept provides a very rich development environment for basic computer
vision research.
It will augment and strengthen our NSF research on environmental monitoring NSF Grant #
EIA 9726401 through development of faster and more robust methods for creating mosaics from video data.
It will provide yet another application for the object detectio
n and recognition system originally developed
under DARPA and NSF funding [ DARPA ESC/AXS F19628
-
95
-
C
-
0235, NSF IRI
-
9619117, Multimedia
CDA
-
9502639] initially for retrieval of images from large databases using image content. Finally, it will
provide a fou
ndation on which other capabilities can be built, including work on separation of object from
background, efficient representations of the 3D world, and how information of this kind can be efficiently and
effectively transcoded into other sensory modalitie
s for use by the visually impaired.



18


D.

References


[1]

N. Beckmann, H.
-
P. Kriegel, R. Schneider and B. Seeger,
The R* tree: an efficient and robust access
method for points and rectangles
, ACM SIGMOD, pp. 322
-
331, 1990.

[2]

J. L. Bently,
K
-
d trees for semidynamic
pointsets
, ACM Symposium on Computational Geometry, pp.
152
-
161, 1995.

[3]

C. M. Bishop,
Neural Networks for Pattern Recognition
, Oxford University Press, 1994

[4]

P. J. Burt and E. H. Adelson,
The Laplacian Pyramid as a Compact Image code
, IEEE Transactions on
Co
mmunications, 1983,9(1), pp. 113
-
121.

[5]

S. E. Chen,
QuickTime VR
-

an image based approach to virtual environment navigation
, Proc.
SIGGRAPH 95:29
-
38, 1995.

[6]

J. L. Crowley and A. C. Sanderson,
Multiple Resolution Representation and Probabilistic Matching of 2
-
D gray
-
scale shape
, IEEE Trans. PAMI, vol. 9, no. 1, pp. 113
-
121, 1987

[7]

Global Initiative for the Elimination of avoidable blindness,

United Nations Fact Sheet N° 213, Revised
February 2000

[8]

M. Das, R. Manmatha and E. M. Riseman ,
Indexing Flower Patent Im
ages using Domain Knowledge
,
IEEE Intelligent Systems, vol. 14, no.5, pp. 24, 1999

[9]

Electronic travel aids: New Directions for Research, Working Group on Mobility Aids for the Visually
Impaired and Blind
, Committee on Vision, National Research Council,
National Academy Press,
Washington, D.C. 1986

[10]

F. Ferrari, J. Nielsen, P. Q. and Sandini, G.,
Space variant imaging
, Sensor Review, 15(2):17

20, 1995

[11]

S. Finger, M. Terk, E. Subrahmanian, C. Kasabach, F. Prinz, D. Siewiorek, A. Smailagic, J. Stivorek,
and L
. Weiss,
Rapid Design and Manufacture of Wearable Computers
, Communications of the ACM,
Vol. 39, No. 2, February, 1996.

[12]

L. M. J. Florack,
The Syntactic Structure of Scalar Images
, University of Utrecht, 1993

[13]

W. T. Freeman and E. H. Adelson,
The design and

use of steerable filters
, IEEE Trans. Patt. Anal. and
Mach. Intelligence, 13(9), pp. 891
-
096, 1991.

[14]

J. Garding,
Shape from texture and contour by weak isotropy
, Artificial Intelligence, vol. 64, pp. 243
-
297, 1993

[15]

M. M. Gorkani and R. W. Picard,
Texture Or
ientation for Sorting Photos 'at a glance'
, Proc. 12th Int.
Conf. on Pattern Recognition, pp. A459
-
A464, October 1994

[16]

A. Gutman,
R
-
trees, A Dynamic Indes Structure for Spatial Searching
, ACM SIGMOD, pp. 47
-
57, 1984

[17]

S. Haykin,
Neural Networks: A Comprehen
sive Foundation
, Prentice Hall, 1999

[18]

H. Ishiguro, M. Yamamoto, and Tsuji,
Omni
-
directional stereo for making global map
, Proc. IEEE
ICCV'90, 540
-
547, 1990.

[19]

J. J. Koenderink,
The Structure of Images,
Biological Cybernetics, vol 50, pp. 363
-
396, 1984.

[20]

J. J.

Koenderink and A. J. Van Doorn,
Surface Shape and Curvature Scales
, Image and Vision
Computing, 10(8), 1992


19

[21]

R. Kumar, H. Sawhney, J. Asmuth, J. Pope and S. Hsu
, Registration of Video to Geo
-
referenced
Imagery
, Proc.
IAPR ICPR98, vol. 2: 1393
-
1400, 1998

[22]

R
. Kumar, P. Anandan, M. Irani, J. Bergen and K. Hanna
, Representation of scenes from collections of
images
, In IEEE Workshop on Presentation of Visual Scenes, 10
-
17, 1995.

[23]

A. Lakshmi
-
Ratan, O. Maron, W. E. L. Grimson, T. Lozano
-
Perez
, A Framework for Lear
ning Query
Concepts in Image Classification
, CVPR99(I:423
-
429), 1999.

[24]

T. Lindeberg,
Scale
-
Space Theory in Computer Vision
, Kluwer Academic Publishers, 1994

[25]

W. Y. Ma and B. S. Manjunath,
Texture
-
Based Pattern Retrieval from Image Databases
, Multimedia
Tool
s and Applications,2(1), pp. 35
-
51, January 1996

[26]

R. Manmatha,
Matching Affine
-
Distorted Images
, Ph.D. Dissertation, University of Massachusetts at
Amherst, 1997, Allen Hanson, Chair.

[27]

R. Manmatha and N. Srimal,
Scale space technique for word segmentation in

handwritten documents
,
Proc. Second International Conference on Scale
-
Space Theories in Computer vision, pp.22
-
33, 1999.

[28]

S. Mann,
Wearable Computing: A first step towards personal imaging,
Computer 30(2), 25
-
32.

[29]

B. Moghaddam and A. Pentland,
Probabilisti
c Visual Learning for Object Representation

IEEE,
Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 7, July 1997

[30]

C. Nastar,
The image shape spectrum for image retrieval
, Technical Report 3206, INRIA, June 1997.

[31]

Olivetti Research Labs,

Face Dataset
,
http://www.cam
-
orl.co.uk/facedatabase.html

[32]

S. Peleg, M. Ben
-
Ezra
, Stereo panorama with a single camera
, Proc IEEE CVPR'99: 395
-
401, 1999

[33]

P. Jonathon Phillips, H. Moon, S. A. Rizvi a
nd P. J. Rauss, The FERET Evaluation Methodology for
Face
-
Recognition Algorithms, (To appear) PAMI.

[34]

J. Ponte and W.B. Croft,
A Language Modeling Approach to Information Retrieval.

Proceedings of the
21st International Conference on Research and Developmen
t in Information Retrieval, 275
-
281, 1998

[35]

R. Rao and D. Ballard,
Object Indexing Using an Iconic Sparse Distributed Memory
, Proc. IEEE
International Conference on Computer Vision, pp. 24
-
31, 1995

[36]

S. Ravela,
Multiscale representations for face recognition
,
SPIE Conference on Technologies for Law
Enforcement (To Appear), Nov5, 2000, Boston MA.

[37]

S. Ravela, S. and C. Luo,
Appearance
-
based Global Similarity Retrieval of Images. In Advances in
Information Retrieval
, W. Bruce Croft (Editor), Kluwer Academic Publish
ers 2000.

[38]

S. Ravela, R. Manmatha,
Gaussian Processes in Image Filtering and Representation
, Encyclopedia of
Electrical and Electronic Engineering, John Webster (Editor), John Wiley, 1999

[39]

S. Ravela, S. and R. Manmatha,
Retrieving Images by Appearance
, Proc.

of the International Conf. on
Computer Vision, (ICCV), Bombay, India, Jan 1998.

[40]

S. Ravela, R. Manmatha, and E. M. Riseman,
Retrieval from Image Databases using Scale Space
Matching.

Proc. of the European Conf. on Computer Vision ECCV '96, Cambridge, U.K.,

pages 273
--
282,Springer, April 1996

[41]

T. H. Reiss,
Recognizing Planar Objects Using Invariant Image Features
, Springer
-
Verlag, 1993.

[42]

J. T. Robinson,
The k
-
D
-
B
-
Tree: A Search Structure for Large Multidimensional indexes
, ACM
SIGMOD, pp.10
-
18, 1981


20

[43]

H. Rowley,

S. Baluja, T. Kanade,
Neural Network
-
Based Face Detection
, IEEE
-
Transactions on Pattern
Analysis and Machine Intelligence (PAMI), Vol. 20, No. 1, January, 1998

[44]

H. Samet,
The design and analysis of Spatial Data Structures
, Addison
-
Wesley, 1989.

[45]

H.S. Sawhne
y, R. Kumar, G. Gendel, J. Bergen, D.Dixon, V. Paragano
, VideoBrush
TM
: Experiences with
consumer video mosaicing
, Prof. IEEE Workshop on Applications of Computer Vision (WACV), pp. 56
-
62, 1998

[46]

C. Schmid, R. Mohr,
Local Grayvalue Invariants for Image Retrie
val
, PAMI(19), No. 5, pp. 530
-
535,
May 1997.

[47]

B. Schiele and J. L. Crowley,
Object Recognition Using Multidimensional Receptive Field Histogra
ms,
Computer Vision
-

ECCV '96, Bernard Buxton and Roberto Cipolla (Eds.),vol. , Lecture Notes in
Computer Science,

Springer, April 1996

[48]

H. Schneiderman and T. Kanade
A Statistical Model for 3D Object Detection Applied to Faces and Cars
IEEE Conference on Computer Vision and Pattern Recognition, IEEE, June, 2000

[49]

Schultz and R. Stevenson,
Extraction of high resolution f
rames from video sequences
, IEEE Trans.
Image Processing, 596), pp. 996
-
1011, 1996

[50]

T. Sim, R. Sukthankar, M. Mullin, and S. Baluja
, High
-
Performance Memory
-
based Face Recognition
for Visitor Identification
, 1999 (see http://www.ri.cmu.edu/pubs/pubs_2772.ht
ml )

[51]

H.
-
Y.
Shum and
R.
Szeliski,

Panoramic Image Mosaics
,

Microsoft Research,

Technical Report,

MSR
-
TR
-
97
-
23
, 1997.

[52]

K. Sung and T. Poggio,
Example
-
Based Learning for View
-
Based Human Face Detection
, IEEE PAMI,
Vol. 20, No. 1, 39
-
51, 1998

[53]

R. Szeliski and

S. B. Kang,
Direct methods for visual scene reconstruction
, In IEEE Workshop on
Presentation of Visual Scenes, pp. 26
-
33, 1995

[54]

B. Thylefors, A.
-
D. Négrel, R. Pararajasegaram, K. Y. Dadzie
Global data on blindness
. Bulletin of the
World Health Organization
, 1995,73(1): 115
-
121

[55]

M. Turk and A. Pentland,
Eigen Faces for Recognition
, Jrnl. Cognitive Neuroscience, Vol. 3, pp. 71
-
86,
1991

[56]


M. Yablonski,
Personal Communication,

Lighthouse International, NY, NY,

[57]

WordScan, OCR Software
, CAERE Corp.

[58]

V. Wu, R. Manma
tha, E. M. Riseman
, TextFinder: An Automatic System to Detect and Recognize
Text in Images
, Transactions on Pattern Analysis & Machine Intelligence ,November 1999, pp. 1224
-
1228.

[59]

J. K. Wu, B. M. Mehtre, Y. J. Gao, P. C. Lam, A. D. Narasimhalu,
STAR
--

A
Multimedia Database
System for Trademark Registration
, vol. 819, pp. 109
--
122, Lecture Notes in Computer Science:
Application of Database, 1994

[60]

Y. Xiong, K. Turkowski,
Registration, calibration, and blending in creating high quality panoramas
,
Prof. IEEE W
ACV'98: 69
-
74, 1998.

[61]

Zhigang Zhu, Guangyou Xu, Xueyin Lin,
Panoramic EPI Generation and Analysis of Video from a
Moving Platform with Vibration
, Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 23
-
25, Colorado, vol 2, pp. 531
-
537, June, 1999, Fort Collins.


21

[62]

Z. Zhu, E. M. Riseman, A. R. Hanson, H. Schultz,
Automatic Geo
-
Correction of Video Mosaics for
Environmental Monitoring
, Technical Report TR #99
-
28, Computer Science Department, University of
Massachusetts at Amherst, April,

1999

[63]

Z. Zhu, G. Xu, E. M. Riseman, A. R. Hanson
, Fast Generation of Dynamic and Multi
-
Resolution 360°
Panorama from Video Sequences
, Proceedings of the IEEE International Conference on Multimedia
Computing and Systems, Florence, Italy, June 7
-
11, 1999, vo
l.1 , pp 400
-
406.

[64]

Z. Zhu, G. Xu, Y. Yang, J. S. Jin
, Camera stabilization based on 2.5D motion estimation and inertial
motion filtering
, IEEE Int. Conf. on Intelligent Vehicles, Stuttgart, Germany, October 28
-
30, 1998, vol.
2, pp 329
-
334














22


E.

Bi
ographical Sketches

S. (Chandu) Ravela

Computer Science Dept.

University of Massachusetts

Amherst, Massachusetts 01003

Tel: (413)545
-
0728; Fax: (413)545
-
1249

ravela@cs.umass.edu


Education:

(Ph.D.), Computer Science, University of Massachusetts at Amher
st, Mar 2002, expected.

M.S., Computer Science, University of Massachusetts at Amherst, 1994

B.S., Computer Engineering and Computer Science, Regional Engineering College, 1991


Professional Employment

1993
-



Research Assistant, Computer Science Departme
nt,




Univ. of Massachusetts, Amherst, MA


1998
-

1999



Software Engineer, Corel Corp.




Ottawa, Canada.


May 1997


Sept 1997

Research Consultant, Sovereign Hill Software,




Hadley, MA



May 1991


Aug 1991

Network Engineer, ALTOS India Pvt. Ltd,




New Delhi, India.


Summary of Research Interests

For the past seven years Chandu Ravela has been involved in Computer Vision research. His research career
started with development of stereo based reflexive avoidance systems, and continues through rese
arch in
tracking, visual servo control, navigation, image matching, model
-
image registration and pose computation,
augmented reality, multi
-
scale representations, face detection and recognition and image databases. His
interests lie in vision problems, pri
marily, object detection and recognition methods.


Publications Related to Proposed Work:

1.

Ravela, S. and On Viewpoint Control

2.

Ravela S. and Hanson A.,

3.

Ravela, S. and Luo, C., (2000) Appearance
-
based Global Similarity Retrieval of Images. In
Advances in I
nformation Retrieval, W. Bruce Croft, Editor, Kluwer Acasdemic Publishers 2000.

4.

Ravela, S. and Manmatha, R., (1999) Gaussian Processes in Image Filtering and Representation.
Encyclopedia of Electrical and Electronic Engineering ed. John Webster, John Wiley
, 1999.

5.

S. Ravela, R. Manmatha and E. Riseman, Image Retrieval by Syntactic Characterization of
Appearance, US Patent 5,987,456, Nov. 16, 1999


23

6.

Ravela, S. and Manmatha, R. (1998). Retrieving Images by Appearance. Proc. of the
International Conf. on Computer

Vision, (ICCV), Bombay, India, Jan 1998.


Other Publications:

1.

S. Ravela, T. Schnackertz, R. Grupen, R. Weiss, E. Riseman, and A. Hanson, (1995) Temporal
Registration for Assembly,
Proc. Workshop on Vision for Robots
, pp. 64
-
69, Pittsburgh, August
6, 1995.


1.

S. Ravela, B. Draper, J. Lim,R. Weiss, (1995) Adaptive Tracking and Model Registration Across
Distinct Aspects,
Proc. IEEE International Conference on Intelligent Robots and
Systems
(IROS), Vol. 1, pp. 174
-
180, Pittsburgh, August 5
-
9, 1995.

2.

S. Badal, S.

Ravela, B. Draper, A. Hanson, (1994) A Practical Obstacle Detection And
Avoidance System
, Proc. IEEE Workshop on Applications of Computer Vision
, pp. 97
-
104,
Sarasota, Florida, Dec. 5
-
7, 1994.


Graduate Research Supervision:

Chen Luo (1998
-
2000)





Alli
son Clayton (2001
-
)

Ravi Uppala (2001
-
)





Deepak Karuppiah (2001
-
)





Piyanuch Silapachote (2001
-
)






Undergraduate Supervision:

NSF REU Program:

Jerod J. Weinman(Rose
-
Hullman, 2000),

Michael Remillard (Sienna College, 2000),


Research:


Joe Dave
rin (1998
-
1999),

Adam P. Jenkins (1997
-
1998)


Collaborators in the last 48 months:


Mary V. Andrianopoulos, (UMASS Amherst)

W. Bruce Croft(UMASS Amherst)

Andrew Fagg (UMASS Amherst)



Roderic A. Grupen (UMASS Amherst)


Allen R. Hanson (UMASS Amherst)


David Jacobs (NEC Research Institute)



R. Manmatha (UMASS Amherst)



Edward M. Riseman (UMass Amherst)


Rohini Srihari (SUNY Buffalo)



Zhongfei Zhang (SUNY Buffalo)


PhD Thesis Advisor:

Dr. Allen Hanson


24



F.

Budget

G.

Current and Pending Suppo
rt

H.

Computing Facilities

The Computer Vision Research Laboratory at the University of Massachusetts maintains a large computing
base and a wide assortment of support equipment. All of the computers are connected via 100BaseT Ethernet
to the Computer and In
formation Science Department computing facilities at large. Equipment currently
available in the lab includes:




Computer Facilities:


1

DEC Alpha 5/4100 with 4 400 MHz CPUs, 1 GB physical memory


1

DEC 500/400 graphics workstation


5

SUN workstations


1

S
GI PowerSeries 340GTX 4 node multiprocessor


1

SGI Onyx InfiniteReality

7

PC workstations (Linux and NT), including 486, PI, PII and PIII based machines


1 dual Pentium 400 MHz portable computer with instrument interface cards





CCD Sensor Calibration Facili
ty:

The UMass CCD calibration facility is capable of performing
highly accurate geometric and radiometric calibration for a wide variety of electronic optical CCD
sensors and lenses. The major system components are:




2 meter horizontal translation stage w
ith a 1m x 1m glass target. In this system a target point can
be located in x,y,z
-
space to 25 microns over a 2 meter range.



5 axis (3 translations and 2 rotations) sensor pointing stage with a 5 microns translation accuracy
and 0.001 degree rotation accu
racy.



calibrated, integrating sphere light source that is stable to less than 0.1% per hour, and is spatially
uniform across the aperture to less than 0.1%


The system is capable of providing precise calibration data for recovering intrinsic and extrinsic,

and
radiometric calibration parameters. After calibration a point in space with a known location can be
projected onto the image plane of a 1k


1k CCD array to within 0.25 pixels; and the response by a
CCD with a 10bit dynamic range to a target with a

known radiance can be predicted to less than 1
gray level.