Web Document Analysis-

peachpuceΤεχνίτη Νοημοσύνη και Ρομποτική

6 Νοε 2013 (πριν από 4 χρόνια και 7 μέρες)

101 εμφανίσεις

Scott Clements, Monash University Software Engineering, Copyright 2003.

Web Document Analysis
-

Improving Search Technology using Image
Processing



Scott Clements

Bachelor of Software Engineering

Monash University

www.csse.monash.edu.au/~sdcle1/



Supervisor: Dr. Sid Ray

Scott Clements, Monash University Software Engineering, Copyright 2003.

Interests and Expertise

Dr. Sid Ray



Image Processing expert


Scott Clements


Internet Technology


Software Engineering


Database Management


Interface Design

Scott Clements, Monash University Software Engineering, Copyright 2003.

Union of Expertise


Engineering a product which uses:



Image processing & Internet Technology




Scott Clements, Monash University Software Engineering, Copyright 2003.

Primary Goals



To improve search quality using Image
Processing.




To investigate



Image histogram matching to find similar
images



Colour predominance in images




Scott Clements, Monash University Software Engineering, Copyright 2003.

Secondary Goals



To Engineer a product which has industry
potential.


Project Management


Interface Design


Database Management


Information Retrieval

Scott Clements, Monash University Software Engineering, Copyright 2003.

Background



Popular search technology
[mcbryan94, brin98, pinkerton00]



Text based



Quality of results can be poor



Difficult to find images



Multimedia search technology
[ogle95, smith97]



Text, image and video based



Poor interface design


Aimed at Image Processing experts



Good use of Databases Management systems

Scott Clements, Monash University Software Engineering, Copyright 2003.

Software Engineering Methods












Stages



Initial program: Grey
-
scale image matching



Refinement 1: Colour image matching



Refinement 2: Colour predominance image matching


System Refinement stages
Integration
Implement
Test
Document Results and
Findings
Initial program
Scott Clements, Monash University Software Engineering, Copyright 2003.

Image processing technique





Data types:



Histogram Data



Colour Predominance Data

Image
Image processing
Data
Input
Output
Scott Clements, Monash University Software Engineering, Copyright 2003.

System Architecture








Pre-
processing
- Image analysis
- Format information
- Add information to
database
Database
Post-
processing
- Send query information
- Process query
information
- Calculate search
results
- Return search results
User
Scott Clements, Monash University Software Engineering, Copyright 2003.

System Architecture continued

Pre-
processing
-
C
-
Monash
Image Lib.
-
PHP
/HTML
Database
MySQL
Post-
processing
-
PHP
/ HTML
User
Scott Clements, Monash University Software Engineering, Copyright 2003.

Colour histogram matching


Method:


Using:


Group 16 configuration


Total difference Algorithm


Requirements


Database design


Histogram analysis


Investigate:


Interface design


Relevance Feedback



Scott Clements, Monash University Software Engineering, Copyright 2003.

Histograms (Group 16 Configuration)

Colour Histograms


-
Count the number of occurrences of

each colour intensity


-
256 intensities for each RGB

component. (24bit image)


-
Insert this information into the database

Problem: Excessive amount of information

Solution: Convert to Group 16 Configuration.

Scott Clements, Monash University Software Engineering, Copyright 2003.

Database Design

r2red
identification
r1
r2
r3
r4
r5
r6
r7
r8
....
r15
r16
r2blue
identification
b1
b2
b3
b4
b5
b6
b7
b8
....
b15
b16
r2green
identification
g1
g2
g3
g4
g5
g6
g7
g8
....
g15
g16
r2
identification
name
...
**Reserved Space**
Scott Clements, Monash University Software Engineering, Copyright 2003.

Algorithm


Aim: To find other similar images


Method: Compare each of the histograms



with the query histogram



Algorithm: Total difference



Scott Clements, Monash University Software Engineering, Copyright 2003.

Total Difference Algorithm

-
Query Image versus images in the database

-
Compare each histogram

-
Find the positive difference between each
histogram (Total Difference)

-
Convert 0
-
300% range to a similarity rating
between 0
-
100%

-
Return the results which are within a user
defined similarity rating

Scott Clements, Monash University Software Engineering, Copyright 2003.

Interface Design

Scott Clements, Monash University Software Engineering, Copyright 2003.

Relevance Feedback

User Feedback:


Clicking the similarity button


Proving interest in a particular image


Relevance


Sorting results:



most similar to least similar

Scott Clements, Monash University Software Engineering, Copyright 2003.

Results and Findings














Method

Accuracy

Grey
-
scale Histogram matching

64%

Colour Histogram Matching

84%


Test Set:


Real life photos


Computer generated images







Weakness


Grey
-
scale histogram matching. (Unacceptable results)


Images with many different colours


Spatial Arrangements


Needing to resize the images. (standardisation for histograms)


Scott Clements, Monash University Software Engineering, Copyright 2003.

Colour predominance



Assign each pixel a colour value (if possible)




Found that RGB was not suitable in this case




HSB was much easier to find colour ranges




Method: Using an image program find the Hue,
Saturation and Brightness ranges for each colour.


Scott Clements, Monash University Software Engineering, Copyright 2003.

Algorithm Design

Analysis



Count each occurrence of a certain colour




Convert the occurrence result to a percent of
predominance between 0
-
100%


Query



Query the database to find images which have
predominant colours.

Scott Clements, Monash University Software Engineering, Copyright 2003.

Database Refinement

r3red
identification
r1
r2
r3
r4
r5
r6
r7
r8
....
r15
r16
r3blue
identification
b1
b2
b3
b4
b5
b6
b7
b8
....
b15
b16
r3green
identification
g1
g2
g3
g4
g5
g6
g7
g8
....
g15
g16
r3
identification
name
...
**Reserved Space**
r3predominance
identification
red
magenta
purple
blue
cyan
green
yellow
orange
dark
bright
Scott Clements, Monash University Software Engineering, Copyright 2003.

Interface design

Scott Clements, Monash University Software Engineering, Copyright 2003.

Interface design continued

Scott Clements, Monash University Software Engineering, Copyright 2003.

Relevance Feedback



Not fully suitable for Colour predominance



Use a subset of Relevance Feedback to
improve useability



Sort the result from most to least relevant

Scott Clements, Monash University Software Engineering, Copyright 2003.

Results and Findings


Test set:


Real life photos


Computer generated images





Easy method to understand for users


Less information stored in the database


Accurate and efficient method to use


Algorithm

Similarity results

Colour Predominance

86%

Scott Clements, Monash University Software Engineering, Copyright 2003.

Conclusion and Applications

Small to Medium sized system



Example: local image database



Colour histogram matching



Colour predominance


Medium to Large system



Example: Internet search engine



Only Colour predominance


More efficient


Less information to store about images


Easy to understand


Scott Clements, Monash University Software Engineering, Copyright 2003.

Future Research



Parallelism in image analysis



Alternative image data for histogram
matching (e.g. HSB)



Replace or extend Monash Image Library
(MIL) to directly support popular internet
image formats.



Improve the documentation for colour image
manipulation in MIL.


More extensive testings of colour
predominance


Addition of predominance levels

Scott Clements, Monash University Software Engineering, Copyright 2003.

Questions?

Are there any questions?