Defining Metrics to Automate the Quantitative Analysis of Textual Information within a Web Page

panicyfewInternet and Web Development

Nov 18, 2013 (3 years and 10 months ago)

71 views

by Professor Vasile AVRAM, PhD

Informatics in Economy Department

Academy of Economic Studies


Bucharest ROMANIA

Defining Metrics to Automate the Quantitative
Analysis of Textual Information within a Web Page

1 Search Engines


1
st

Collect information (keywords, url, content, links in/out etc);

2
nd

Analyze the collected information:


-

ranked


-

indexed

3
rd

Store in the database (compressed?)

Search

Engine

Crawler

follow

links

Spider

find

pages

Web

Pages

Downloads

Pages

Indexer

Analyze

Information

Database

Results

Engine


Cataloged

Information


Search request

Search results

2 Ranking and SEO

Common page ranking criteria:


-
Location



position of the keyword;


-
Frequency



the frequency with which the search term
appears on the page;


-

Links



the type and number of links on a web page;


-

Click
-
throughs



the number of click
-
throughs has the site
versus click
-
throughs the other pages that are shown in the
page ranking.

2 Hiding information by exploiting CSS features

Figure

1

Aspect

of

a

webpage

with

CSS

enabled

(left)

and

CSS

disabled

(right)

2 Hiding information by exploiting CSS features

Figure

2

The

source

of

the

web

page

2 Hiding information by exploiting CSS features

Figure

3

What

a

spider

sees

in

the

page

4 Determining the effective amount of text information (EATI)
within a web page




Figure

4

A

snapshot

of

the

page

(IECapt
;

[
6
])

4 Determining the effective amount of text information (EATI)
within a web page

TIES
ATIOCR
EATI




Effective Amount of Text Information (EATI) is determined as a
ratio between the amount of text information (we denote this
by ATIOCR) obtained by applying an optical character
recognition (OCR) to the snapshot of the web page (figure 4)
over the text information extracted by spider (denoted by
TIES) as shown in figure 3.

(1)

4 Determining the effective amount of text information (EATI)
within a web page




The value of the ratio can be:


-
less than 1
, case in which the page contains hidden
information in reverse proportion with value of the metric (as
less the metric is as huge the hidden text amount is);


-
equal to 1,
the ideal case when what shown is what contained;


-

greater than 1,
case in which we have extra text information
and signals that the page have images containing text
information which, in most cases, not considered when ranking.
As big as much extra text we have.

4 Determining the effective amount of text information (EATI)
within a web page

The working procedure used to valuate the metric involves the
following three steps and corresponding type tools:

1
st
. Use a spider to extract the text information within a
webpage and determine TIES value required in formula (1). The
spider we build is based on theory in [4] and libraries available
at [6] and our functions to clean up the extracted text;


2
nd
. Use a snapshot application program that can be called
within a robot body to take a snapshot of the page involved in
step one and save as an image format;


3
rd
. Apply an OCR tool (here applied ReadIRIS Pro 11) to the
image saved at previous step and obtain the recognized text
required to determine ATIOCR in (1).

A.

Determine textual information contained by graphic elements
(TIG) metric

The procedure used to determine the textual information contained
by graphic elements (I denote that by TIG) within a web page is:


1
st
.

Use a spider to extract the graphic elements (images, pictures,
shapes etc) together with their positional coordinates and
recompose a working web page of the same size as the original
and containing only that graphic elements positioned at their
proper coordinates;


2
nd
. Use a snapshot application program that can be called within a
robot body to take a snapshot of the page involved in step one
and save as an image format accepted as input by OCR tool;


3
rd
. Apply an OCR tool to the image saved at previous step and
obtain the recognized text required to determine the textual
information contained by graphic elements (TIG) value.

B.
Determining the quantity of textual information shown to the
user (QTISU) metric

TIG
ATIOCR
QTISU


(2)

C.
Determining the text information shown to the user (TISU)
from tags metric

100


TIES
QTISU
TISU
(3)

-

TISU=100

what is shown = what extracted by the spider (no hidden
information used);

-

TISU<100


the percent of hiding textual information from the one
contained by tags. As less is as much hidden textual information is.

(4)

D.
The percent of textual information revealed by graphic
elements to the user (TIRGU) metric


100


ATIOCR
TIG
TIRGU
TIRGU=100


the entire text information shown to the user is
contained only by the graphic elements;

TIRGU<100


the percent of textual information revealed to the
user by graphic elements. As less is as much shown textual
information comes from tags.

5 Conclusions

References

[1]Jorge Cardoso (ed),
Semantic Web Services: Theory, Tools
and Applications
, IGI Global © 2007 Books24x7.
<http://common.books24x7.com/book/id_20775/book.asp>

[2] Vasile Avram, “Effective Amount of Text Information (EATI)
in a Web Page


A Proposal for a New Metric and Method to
Determine”, The proceedings of the 9th international
conference on Informatics in Economy may 2009, Editura
Economică, ISBN 978
-
606
-
505
-
172
-
2, pp 163
-
168

[3] Jerri L. Ledford


SEO Search Engine Optimization Bible
,
Wiley Publishing 2008

[4] Google
-

Hidden text and links, Webmaster Tools,
www.google.com

[5] Michael Schrenk
-

Webbots, Spiders, and Screen Scrapers:
A Guide to Developing Internet Agents with PHP/CURL
, No
Starch Press. 2007 Books24x7.
<http://common.books24x7.com/book/id_22218/book.asp>

References

[ [6]
http://www.sourceforge.org



Open Source PHP libraries
for robots development

[7] P.J. Deitel, H.M. Deitel


Internet and World Wide Web How
to Program, fourth edition, Prentice Hall 2008, pages 160
-
190

[8] World Wide Web Consortium
-

The Specification of
Standards for HTML, XHTML, CSS, XML:
http://www.w3.org

[9] Vasile Avram


Internet Technologies for Business:
Documents and Websites
-
structure and description
languages
,
http://www.avrams.ro/lecture
-
notes.htm

[10] Yahoo! Search Content Quality Guidelines,
www.yahoo.com

[11] SEO tools
-
Search Engine marketing, www.seologic.com