Wavelet-Gradient-Fusion for Video Text Binarization

ticketdonkeyAI and Robotics

Nov 25, 2013 (3 years and 11 months ago)

150 views

Wavelet
-
Gradient
-
Fusion for Video Text Binarization
a
Sangheeta Roy,
b
Palaiahnakote Shivakumara,
c
Partha Pratim Roy and
b
Chew Lim Tan

a
Tata Consultancy Services, Kolkata, India

b
School of Computing, National University of Singapore, Singapore

c
Laboratoire d’Informatique, Université François Rabelais, Tours, France

a
roy.sangheeta@tcs
.com

b
{shiva, and tancl}@comp.nus.edu.sg

c
partha.roy@univ
-
tours
.fr


Abstract

Achieving good character recognition rate in video
images is not as eas
y as achieving the same from the scanned
documents because of low resolution and complex
background in video images.

In this paper, we propose a new
method
using
fusion

of

horizontal, vertical and diagonal
information obtained by the wavelet and the
gradient on text
line images to enhance the text information. We apply k
-
means with k=2 on row
-
wise and column
-
wise pixels

separately
to
extract possible text information. The union
operation on row
-
wise and column
-
wise clusters
provides

the
text candidate
s
information
. With the help of Canny of the
input image, the method identifies the disconnections based
on mutual nearest neighbor criteria on end points and it
compares the disconnected area with the text candidates to
restore the missing information. Ne
xt, the method uses
connected component analysis to merge some sub
-
components based on nearest neighbor criteria. The
foreground (text) and background (non
-
text) is separated
based on new observation that the color values at edge pixel
of the components ar
e larger than the color values of the pixel
inside the component. Finally, we use Google Tesseract OCR
to validate our results and the results are compared with the
baseline thresholding techniques to show that the proposed
method is superior to existing m
ethods in terms of recognition
rate on 236 video and 258 ICDAR 2003 text lines.


Keywords
-

Wavelet
-
Gradient
-
Fusion, Video text lines, Video
Video

text restoration, Video character rcognition

I.


I
NTRODUCTION


Character recognition in document analysis is th
e most
successful application in the field of pattern recognition.
However, if we test the same OCR engine on video scene
character, the OCR engine reports poor accuracy because the
OCR was developed
mainly

for scanned document images
containing simple bac
kground and high contrast but not for
the video images having complex background and low
contrast.
It is evident from the natural scene character
recognition methods [1
-
6] that the document OCR engine
does not work for camera based natural scene images due

to
failure of binarization in handling non
-
uniform
background
and non
-
illumination
. Therefore, poor character
recognition
rate (67%) is reported for ICDAR
-
2003 competition data [7].
This shows that despite high contrast of camera images, the
best accuracy

reported is 67% so far, thus achieving better
character recognition rate for video images is still an elusive
goal for the researchers because of the lack of good
binarization method which can tackle both low contrast and
complex background to separate fo
reground and background
accurately [8]. It is noted that character recognition rate varies
from 0% to 45% [8] if we apply OCR directly on video text,
which is much lower than scene character recognition
accuracy. Our experimental result of the existing bas
eline
methods such as Niblack [9] and Sauvola et al.[10] show that
thresholding techniques give poor accuracy for the video and
scene images. It is reported in [11] that the performance of
these thresholding techniques is not consistent because the
charact
er recognition rate changes as the application and
dataset change. In this paper, we make an attempt by
proposing a new method for separation of foreground (text)
and background (non
-
text) such that the OCR engine provides
better accuracy than reported
acc
uracy in the literature
.

There are several papers that addressed video text
binarization and localization problem based on edge, stroke,
color and corner information to improve character recognition
rate. Ntirogiannis et al. [12] have proposed a binarizat
ion
method based on baseline and stroke width extraction to
obtain body of the text information and convex hull analysis
with adaptive thresholding is done for obtaining final text
information. However, this method focuses on artificial text
where pixels h
ave uniform color but not on both artificial and
scene text where pixel do not have uniform color values. An
automatic binarization method for color text areas in images
and video based on convolutional neural network is proposed
by Saidane and Garcia [13]
. The performance of the method
depends on the number of training samples. Recently, edge
based binarization for video text image is proposed by Zhou
et al. [14] to improve the video character recognition rate.
This method takes Canny of the input image as

input and it
proposes a modified flood fill algorithm to fill the gap if there
is a small gap on the contour. This method works well for
small gaps but not for big gaps on the contours. In addition to
this, the method’s primary focus is graphics text and
big font
but not both graphics and scene text.

Therefore, from the above discussion, it can be concluded
that there are methods to improve video character recognition
rate through binarization but these methods concentrate on big
font, graphics text in vi
deo but not on both graphics and scene
text where we can expect much more variation in contrast and
background compared to graphics text. Therefore, improving
video character recognition through binarization irrespective
of text type, contrast and backgrou
nd complexity is
challenging. Hence, in this work, we propose a new Wavelet
-
Gradient
-
Fusion (WGF) method based on fusion concept with
wavelet and gradient
information

and a

new way of obtaining
text candidates to overcome the above problems.

II.

PROPOSED ME
THOD

While we note that there are several sophisticated methods
for text line detection in video irrespective of contrast, text
type, orientation, contrast and background variation, we use
our method [15]
using Laplacian approach and skeleton analysis

to s
egment the text lines from the video frames.

Therefore, the
output of our text detection method is the input to the proposed
method in this work. The multi
-
oriented text lines segmented
from the video frames are converted to horizontal text lines

based on
the direction of the text lines. Hence, non
-
horizontal
text lines are treated as horizontal text lines to make
implementation easier. The proposed method is structured into
four sub
-
sections. In Section A, we propose a novel method to
fuse wavelet and grad
ient information to enhance the text
information in video text lines. Text candidates are obtained by
a new way of clustering on enhanced image in Section B. The
possible text information is restored with the help of Canny of
the input image and the text c
andidate image in Section C.
Finally, in Section D, the method to separate foreground and
background is presented based on color features of the edge
pixel and inside component pixels.


A.

Wavelet
-
Gradient
-
Fusion Method

It is noted that wavelet decomposition is good for
enhancing the low contrast pixel in the video frame because of
multi
-
resolution analysis which gives horizontal (H), vertical
(V) and diag
onal (D) information and gradient operation of the
same direction on video image gives fine detail of the edge
pixel in video text line image. To overcome the problems of
unpredictable video characteristics, the work presented in [16]

suggested

the use of
fusion of

the values given by the low
bands of the input images to increase the resolution of the
image. Inspired by this work, we propose an operation that
chooses the highest pixel value among low pixel values of
different sub
-
bands corresponding to wavelet and gr
adient at
different levels as a fusion criterion. It is shown in Figure 1
where one can see how the sub
-
bands of wavelet fuse with the
gradient images and the final fusion image is obtained after
fusing three Fusion
-
1, Fusion
-
2 and Fusion
-
3 images. For
exa
mple, for the input image shown in Figure 2(a), the method
compares the pixel values in the horizontal wavelet (Figure
2(b)) with the corresponding pixel values in the horizontal
gradient (Figure 2(c)) and it chooses the highest pixel value to
obtain the f
usion image as shown in Figure 2(d). In the same
way, the method obtains the fusion image for the vertical
wavelet and the vertical gradient as shown in Figure 2(e)
-
(g),
and the diagonal wavelet and the diagonal gradient images as
shown in Figure 2(h)
-
(j).

The same operation is performed on
the above three obtained fused images to get the final fused
image as shown in Figure 2(k) where we can see that the text
information is sharpened compared to the results shown in
Figure 2(d), (g) and (j).


B.

Text Candidates

It is observed from the result of the previous section that
WGF

method widens the gap between text and non
-
text pixels.
Therefore, to classify text and non
-
text pixels, we use k means
clustering wit
h k=2 in a novel way by applying on each row
and column separately as shown in Figure 3(a) and (b) where
the result of row
-
wise clustering lose some text information
while the result of column
-
wise clustering does not lose text
information. Here the cluste
r that has the higher mean between
the two is considered the text cluster. This is the advantage of
the new way of row
-
wise and column
-
wise clustering as it
helps

in restoring the possible text information. The union of
row
-
wise and column
-
wise clustering
results is considered as
text candidates to separate and text and non
-
text information as
shown in Figure 3(c) where it is seen that the union operation
includes other background information in addition to text.


(c) Union of (a) and (b)

Figure 3. Tex
t candidates for text binarization

(a). k
-
means clustering row
-
wise

(b) k
-
means clustering column
-
wise


(g) Fusi
on
-
2 of (e) and (f)

(a). Input text line image

(k) Fusion of Fusion
-
1, Fusion
-
2 and Fusion
-
3

Figure 2. Intermediate results for WGF method

(e) Vertical Wavelet (f) Vertical Gradient

(b) Horizontal Wavelet

(c) Horizontal Gradient

(j) Fusion
-
3 of (h) and (i)

(h) Diagonal Wavelet (i) Diagonal Gradient

(d) Fusion
-
1 of (b) and (c)

Input text line image

Wavelet

Gradient

H

V

D

H

V

D

Fusion
-
1

Fusion
-
2

Fusion
-
3

Final fused image

Figure 1. Flow diagram for the wavelet
-
gradient
-
fusion


C.

Smoothing

It is observed from the text candidates that the shape of the
charact
er is almost preserved and it
may

contain other
background information. Therefore, the method considers the
text
candidates image as the reference image to clean up the
background. The method identifies disconnections in the
Canny of the input image by
testing mutual nearest neighbor
criteria on end points as shown in Figure 4(a) where
disconnections are marked by red color rectangles. The mutual
nearest neighbor criteria is defined as follows: if P1 is near

to
P2 then P2 should be near to P1
,

where
Poin
t

P1 and Point P2
are the two end points.

This is because Canny gives good edge
information for video text line images but at the same time it
gives lots of disconnections due to low contrast and complex
background. The identified disconnection area is mat
ched with
the same position in the text candidates image locally to
restore the missing text information since the text candidates
image does not lose much text information compared to the
Canny edge image as shown in Figure 4(b) where almost all
component
s are filled by flood fill operation. However, we can
see noisy pixels in the background. To eliminate them, we
perform projection profiles analysis which result in a clear text
information with clean background as shown in Figure 4(c).



D.

Foreground and Background Separation

The method considers the text in the smoothed image
obtained from the above step C as connected co
mponents and
it analyses by fixing a bounding box to merge the sub
-
components, if any, based on the nearest neighbor criteria as
shown in Figure 5(a). For each component in the merged
image, the method extracts the maximum color information
from the input
image corresponding to pixels in the
components of the merged image. It is found from the results
of maximum color extraction that the extracted color values
refer the border/edge of the components. This is valid because
usually colour values at edges or n
ear edges have higher values
than those at the pixels inside the components
if there exist
holes inside the component
. This observation helps us to find a
hole for each component by making low values as black and
high values as white as shown in Figure 5(b
). After separating
text and non
-
text, the result

is fed to OCR [17] to test
recognition results. For example, for the result shown in Figure
5(b), the OCR engine recognizes the whole text correctly as
shown in Figure 5(c) where recognition result is in quote.

III.

E
XPERIMENTAL
R
ESULTS

As there is no standar
d database to evaluate the proposed
method performance, we create our own data of video which
include 236 text lines selected from different news video
sources and 258 text lines selected randomly from ICDAR
-
2003 competition scene images. In total, 494 tex
t line images
are considered. To measure the performance of the proposed
method, we use character recognition rate. For comparative
study, we implement two baseline methods of binarization [9,
10] and the methods are evaluated in terms of recognition
rate.

The sample results for the proposed and existing
methods on both video and ICDAR data are shown in Table 1
where we consider input images with low contrast, complex
background, distorted text and different fonts. It is noticed
from the recognition results

in quote in Table 1 that the OCR
engine recognizes almost all the results given by the proposed
method while Niblack method gives better results than the
Sauvola method and worse than the proposed method. For
Sauvola’s method, the OCR returns none (“ “) f
or almost all
the input images.
The reason for the poor result lies
in the use
of thresholds to binarize because it is hard to fix optimal
thresholds for video text lines due to unpredictable
characteristics. On the other hand, the proposed method does
not

fix any threshold and it takes advantage of the Wavelet
-
Gradient
-
Fusion and
color features for foreground and
background separation. However, the proposed method
sometimes fails for too low contrast images as shown in last
row of video and ICDAR data in T
able 1. Therefore, there is a
scope for further improvements.

The OCR engine is used

to calculate the recognition rate
for the input images without binarization (
“Before” column in
Table 2 and Table 3)

and the results are reported in Table 2
and Table 3 f
or video data and ICDAR data. The OCR engine
gives slightly better results for ICDAR data than video data.
This is true because ICDAR data contains high contrast image
and complex background whereas video data is of low
contrast and contains complex backgr
ound. The results
reported in Table 2 and Table 3 show that the proposed
method provides better improvements such as

16.08% for
video data and 15.79% for ICDAR compared to recognition
results before binarization.

Table 1.

Sample results of the proposed and existing methods

Video

Input

Proposed (WGF)

Niblack [9]

Sauvola
[10]



“1
-
800 EH
-
7
0000”


“K•{.
-
Gab
605 0*”

“••1 .;*
000 0”


”$70°°”


“ECE?”

“FHEIEE”

(b) Foreground and background is separated

(c)


s
uccessive year”

Figure 5. Foreground and background separation by analyzing
the color values at edge pixel and inside the components

(a
). Color values of edge pixels and inside character

(c) Clear conn
ected components

Figure 4. Process of smoothing

(b) Disconnections are filled and identified noisy pixels

(a).
Gap identification based on mutual nearest neighbor criteria




“INFOHME”


“ “

“ “


”Rapld”


“Rama’”


“ “



ODOUR”


“ “


“ “



“and Connect”


“ “


“••n¤•¤¤¤n•
c•¤”



“$IILIGa gl”


“EZTJIEIIH”


“ “

ICDAR 2003
Competition data



DISCOVER”


“ “


“ “



“skimmed”


“ “


“ “



The”


“ “


“ “



“EXIT”


“ “


“ “



“AT HE
-
E * T I
CS”


“ “


“ “



“G E E K”


“ “

“ “




fa(UE‘HTS



“GENTS”



A C}££a`|'1


Table
2.

Recognition rate of the proposed and existing methods
on
video data (in %)


Before

After

Improvements

Proposed

48.49

64.57

16.08

Niblack

47.03

-
1.46

Sauvola

17.26

-
31.23

Table
3.

Recognition rate of the proposed and existing methods on
ICDAR data (in %)


Before

After

Improvements

Proposed

51.62

67.41

15.79

Niblack

42.30

-
9.32

Sauvola

19.98

-
31.64

IV.

C
ONCLUSION

In this work, we have proposed a new fusion method based
on wavelet sub
-
bands and gradient of different directions. We
have shown that this fusion helps in enhancing text
information. We used k
-
means clustering algorithm in
different
row
-
wise and column
-
wi
se

way to obtain text
candidates. The mutual nearest neighbor concept is proposed
to identify the true pair of end pixels to restore the missing text
information. To separate foreground and background, we
explore the color values at edges and inside the co
mponents.
The experimental results of the proposed method and existing
method show that the proposed method outperforms the
existing methods in terms of recognition rate. However, the
reported recognition rate is not very high as in document
analysis becau
se the tesseract OCR is not font independent and
robust, we are planning to explore learning based method to
improve the recognition rate on large dataset.

A
CKNOWLEDGMENT

This work is done jointly by National University of
Singapore (NUS), Singapore and
D
épartement Informatique
-

Polytech'Tours, France
.

This research is
also
supported in part
by A*STAR grant 092 101 0051 (WBS no. R252
-
000
-
402
-
305).

R
EFERENCES

[1]

D. Doermann, J. Liang and H. Li,

Progress in Camera
-
Based
Document Image Analysis

, In Proc.
ICDAR, 2003, pp 606
-
616.

[2]

J. Zang and R. Kasturi
, “
Extraction of Text Objects in Video
Documents: Recent Progress

, In Proc. DAS, 2008, pp 5
-
17

[3]

K. Wang and S. Belongie,

Word Spotting in the Wild

, In
Proc. ECCV, 2010, pp 591
-
604.

[4]

X. Tang, X. Gao, J. Liu and H. Zhang,

A Spatial
-
Temporal
Approach for Video Caption Detection and Recognition

, IEEE
Trrans. Neural Network, 2002, pp 961
-
971.

[5]

M. R. Lyu, J. Song and M. Cai,

A Comprehensive Method for
Multilingual Video Text Detction , L
ocalization, and
Extraction

, IEEE Trans. CSVT, 2005, pp 243
-
255.

[6]

A. Mishara, K. Alahari and C. V. Jawahar,

An MRF Model for
Binarization of Natural Scene Text

, In Proc. ICDAR, 2011, pp
11
-
16.

[7]

L. Neumann and J. Matas,

A Method for Text Localization and

Recognition in Real
-
World Images

, In Proc. ACCV, 2011, pp
770
-
783.

[8]

D. Chen and J. M. Odobez,

Video text recognition using
sequential Monte Carlo and error voting methods

, Pattern
Recognition Letters, 2005, pp 1386
-
1403.

[9]

W. Niblack,

An Introduction
to Digital Image Processing

,
Prentice Hall, Englewood Cliffs, 1986.

[10]

J. Sauvola, T. seeppanen, S. Haapakoski and M. Pietikainen,

Adaptive Document Binarization

, In Proc. ICDAR, 1997, pp
147
-
152.

[11]

J. He, Q. D. M. Do, A. C. Downton and J. H. Kim,

A
Compar
ision of Bianarization Methods for Historical Archive
Documents

, In Proc. ICDAR, 2005, pp 538
-
542.

[12]

K. Ntirogiannis, B. Gotos and I. Pratikakis,

Binarization of
Textual Content in Video Frames

, In Proc. ICDAR, 2011, pp
673
-
677.

[13]


Z. Saidane and C. Garcia
,

Roubst Binarization for Video Text
Recognition

, In Proc. ICDAR, 2007, pp 874
-
879.

[14]

Z. Zhou, L. Li and C. L. Tan,

Edge based Binarization of
Video Text Images

, In Proc. ICPR, 2010, pp 133
-
136.

[15]

P. Shivakumara, T
.

Q. Phan and C. L. Tan,

A Laplacian
A
pproach to Multi
-
Oriented Text Detection in Video

, IEEE
Trans
.
PAMI, 2011, pp 412
-
419.

[16]

G. Pajares and J. M. Cruz,

A wavelet
-
based image fusion
tutorial

, Pattern Recognition, 2004, pp 1855
-
1872.

[17]

Tesseract.
http://code.google.com/p/tesseract
-
ocr/
.