A Personalized Web Search System by Information Recovery and SVM

spraytownspeakerAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

98 views

ASK
-
IT International Conference


October 2006


1


A Personalized Web Search System by Information Recovery and
SVM

Gerardo Vega
-
Ama
bilis, Jaime Mora Vargas, Miguel González
-
Mendoza

ITESM
-
CEM. Carretera Lago de Guadalupe Km. 3.5, Atizapán de Zaragoza, Estado de
México, C.P. 52926, México

A00309752, jmora@
itesm.mx

LAAS
-
CNRS. 7, avenue du Colonel Roche, 31077 Toulouse Cedex 4, France.

mgonza@laas.fr




Abstract


One of the main activities for any internet user is to search for information. To do this, there
are many search engines available in order to retr
ieve web pages having related information to
keywords or phrases given by the user. Even tough, search engines are very fast, the enormous
amount of information available converts the web
-
searching into a manual process in which
the user manually selects t
he most useful and relevant documents. This process is usually very
time
-
consuming. In this work, we propose a method to help users to make a personalized on
-
line classification of the search results, based on Support Vector Machines as a front end to
popu
lar search engines to classify the results. We use a fixed set of categories for initial
classification allowing the system to learn new categories provided by user's feedback.


Keywords

Web se
a
rch,
Support Vector Machines
, Agent systems
.
ASK
-
IT International Conference


October 2006


2

1.
Introduction

R
ecently, the web mining technologies have grown considerably. Web mining services are
able to improve the scalability, accuracy and flexibility of recommender systems. Personalized
web search consist of a set of actions that improves the web experience of
users.

Nowadays, it is well known that Internet is the principal way people look for important
information, [14]. The number of search requests in an engine, like Google, can be of up to
250 millions in a normal day. Search engines are very popular betwee
n Internet users, making,
in consequence, information search one of the earliest activities people try when they first start
using the internet. According to PEW studies and American Life Project, [14], concerning
Internet, users have confidence in their s
earch skills. Some statistics of this report stated that:



84 % of Internet users use a search engine,



92 % of search engines

users declare to trust in web search capabilities, and 52 % state
results are very trusting.



87 % of users declare that searches ar
e successful in their majority and 17 % state that
they always find information they need.

Even if the results are satisfactory, they can be improved by intelligent tools that can raise the
percentage of success in finding information. It is clear that eve
ry Internet document cannot be
read to decide if it is useful or not. The principal problem at this moment is to decide whether
a document is useful or not to our search, due to the high amount of retrieved information,
using the actual search engines.

Sin
ce the late 1990’s, the Web has known an exponential growth, starting with the
development of a graphical interface for hypertexts, also called World Wide Web, introducing
a lot of Web browsers as Mosaic (later Netscape). From 1991 to 1994, the amount of t
raffic
on the first http server was multiplied by 1000. In accordance to the Internet System
Consortium report, [11], there are more than 280 million servers registered nowadays.

While the Web grows in size and diversity, it has also acquired a great valu
e as an active
knowledge data
-
base, due to its automatic evolution. For the first time in history, the number
of authors comes closer in magnitude to the number of readers.

This wealth in Web contents has made very difficult to reach the value of the infor
mation.
Current search
-
engines allow finding pages containing a group of fixed words or phrases. For
every consultation, millions of qualified pages can exist satisfying the research containing
user’s keywords.

This growth of information in the Web impose
s new difficulties; currently search engines are
not customized for individual needs of a normal user. They can return the same information,
independently of the user making the search. In this way, if a search is made with the word
“networks”, some users
must be interested in artificial neural networks and others in wireless
networks. However, the tool will present the same pages since the keyword fixing the search
is inside the text. In a normal search, a user can obtain a group of interesting pages from
the
web search, but he may select the interesting ones manually from the universe of pages,
spending a lot of time. This is the reason to think on customized searching algorithm taking
into account user’s special needs.

This paper is organized as follows:
Section 2 provides a general background in information
retrieval and text categorization methods as well as intelligent agents and Support Vector
Machines. Section 3 discusses our method in order to perform personalized and intelligent
web searches. Finall
y Section 4 gives the conclusions and discusses further work.


ASK
-
IT International Conference


October 2006


3

2
.
Rel
ated Work

There are different works related with web search
systems but some of the most importa
nt

are
the following:
works
that
organize the search results into categories,[1
8
], [
20
]. Th
ese kinds of
systems consist of two main components: a text classifier which classifies web pages and a
user interface that presents the web pages in a category structure (it allows manipulation). The
text classifier is a SVM classifier which is trained us
ing manually classified web pages, then,
it is able to classify new pages on
-
line.


Shen, [19
], showed that logs can become an important resource for Web mining, since, by
analyzing the query log, a new relationship between Web pages and implicit links is
defined.
The work compares the link
-
based method and the content
-
based method and the results show
that implicit links can improve the classification performance (using SVM) compared to the
baseline method or the explicit

links based methods. Sergei, [21
],

proposed the use of
interactive search engines in order to get relevance feedback that improves web search results.
The work uses web
-
graph distance between two documents as a robust measure of their
relative relevancy. This approach provides the user an
easy way to select the desired category
and also prevents spam search results.

3
.
B
a
ckground

The web can be seen as a great encyclopedia that contains millions of pages. The immediate
access to the whole scientific literature is the dream of the Web resea
rchers. Web Search
engines make available a great quantity of scientific literature from a variety of accessible
sources in seconds. In this enormous information repository of documents there is interesting
information for all, the problem is to find it ov
er the web.

Web search engines has its basis on the Information Retrieval Systems, [4], in which index
keywords are used to perform a consultation; the answer is a compilation of documents
arranged by status. The languages of consultation, offered by the m
ost popular search engines,
allow looking for web pages in which the keywords or phrases are contained.

One of the first applications of a web based Information System Recovery was ARCHIE, who
was supporting searches between servers of files using the Fil
e Transfer Protocol (FTP). In the
mid
-
nineties search engines on the web started to be applied. New applications as Altavista,
revealed some peculiarities called hypertexts: documents are connected by semantic forms
across them called labels (or tags), [4]
. Hypertext, can be duplicated and in many instances

they lie on their content to drop a better status in the consultations by fix words

The solution to the great volume of information can be described as a searching tool helping
the user to find the info
rmation that really approaches to his/her needs. Text categorization
systems can help since they group similar documents in such a way that the user finds the
customized personalized documents.

3
.1
Se
arch Engines


Search engines can be classified in accord
ance to their evolution in three stages, [2]:



The first generation (1995
-
1997): They use principally the content of only one page
and are similar to the traditional Recovery Information Systems.
Examples are
AlstaVista, Excite, WebCrawler.



The second gener
ation (1997
-
2005): They use information out of the page as analysis
of links and hypertexts. These ones are the most used at present, the first web search
tool of this type is Google.



The third generation (2005
-
??): this stage scarcely is initiating, searc
h engines are in
experimental stage. They are based on relevant information for the user not in
ASK
-
IT International Conference


October 2006


4

searching keywords.
The experimental engines use semantic analysis, determination
based on the context, selection of multiple data bases between other methods.

Since web pages are more than only text documents, web classification methods have to
consider different typical characteristics of the web pages such as the links and tags of HTML.
Given to the enormous size of the web, the search for simple text is not s
ufficiently selective
to obtain a manageable result of a search. For example, the Google index is composed of more
than 8,000 million URL directions, one of the best in their kind constituting one of more
detailed collections of the most useful pages in In
ternet. The heart of the Google software is
PageRank, a classification system of web pages developed by the founders Larry Page y
Sergey Brin in Stanford University, [6].

PageRank is based on the web’s democratic nature and uses his extensive structure of
ties as
an indicator of the individual page value. Google interprets a link from the page A towards the
page B as a vote of the page A to the page B. Google also checks other characteristics away
from the number of votes that a page receives. Votes express
ed by important pages have more
weights and help to turn to other pages also in “importance”. The classification given by
PageRank is thus democratic in the sense that there is no preference to any page especially.

3
.2

Text C
ategorization


One of the princ
ipal techniques to handle and to organize information is called Text
Categorization (TC). The TC is used to classify news, to find information in the web, or to
separate a not wished e
-
mail. The target of the TC is the classification of documents in a fixe
d
predefined number of categories. Every document can be in zero, one, or multiple categories.
Using machine learning, the goal is to train a system to classify from examples in a supervised
way. Since categories can get over, every category can treat each

other as a binary problem of
classification.

One of the problems in text classification is to covert the document composed of
characters on an adapted representation for the classification algorithms. Research in
information recovery, [4] suggests that wo
rds work well as representation units. This is a text
representation across the pair attribute
-
value. Every different word corresponds to a
characteristic, and the value

TF
(
d
)
is the number of occurrences in the document

d
.
To avoid
useless representations
, a word is considered only if it is present more than three times. The
representation is thus a vector space model. Documents are represented as vectors on a
Euclidean multidimensional space. Every axis on this space corresponds to a token (word) in
the t
ext. This scheme of representation leads to a high dimensional space of characteristics, it
can contain 10,000 or more dimensions. It should be considered that not all the axes are
important. Many conjunctions and prepositions (and, of, before, from,…) app
ear many times
on a document but are irrelevant for text categorization. Also they may be less frequent terms
that are important for the classification process. A technique called Inverse Document
Frequency

(
IDF
)
,

[4], seeks to consider rare terms as impor
tant as frequent terms. If

D

is the set
of Documents and

D
t


is the set of documents that contain the term

t
,
a way of representing

IDF

is:


IDF
(
t
)

= log
(
1+|D|
)
/|D
t
|
)

(1)

If

|
D
t
| << |
D
|
then

t

will have a big

IDF

factor. Another important aspect is to normalize the
resultant vector dividing by the document length. This is in order to consider that for the same
quantity of term matching, long documents are no more important than short documents.

In
this way the value of a feature is

TF
(
D
)

*IDF
(
t
)

/|D|.

In order to reduce the effects of large differences in term frequency, some authors have
used

ltc

weight, [5]. Using

ltc

weight words having high frequency won’t dominate. This is
based on the reas
oning in which it is important that a word appears several times, but the
ASK
-
IT International Conference


October 2006


5

difference between appearing 50 or 51 times is not as great as the difference between
appearing one or no times. The

ltc

weighting for a feature i in a document k for a feature set

M

is:

.




(2)

Another problem in the representation is to reduce the dimensional space. Several approaches
may be used: expected entropy loss could be very useful in this kind of problem.

Expected entropy loss is a statistical measure that have been appl
ied to feature selection in
information retrieval systems. The goal of feature selection is to select a small number of
features that can carry on as much information as possible. Entropy loss is calculated
separately for each feature. It assigns a low val
ue to those features that are common to positive
and negative samples, but assigns a high value to those features that are a discriminant. This
technique is used because it eliminates terms that carry less information.

Let

C

be the event indicating whether

the document is a member of the specified category.
Let

f

denote the event that the document contains the specified feature. Let

~C

and

~f

be their
respective negations and

P

their probability. The theory of expected entropy loss states that the
prior
entropy of the class distribution is:

e =

P
(
C
)

lg
(
P
(
C
))



P
(
~C
)

lg P
(
~C
)
.

(3)

The posterior entropy of the class when the feature is present is:

e
f

=



P
(
C | f
)

lg
(
P
(
C | f

))



P
(
~C | f

)

lg P
(
~C |f

)
.

(4)

Likewise, the posterior entropy of the class

when the feature is absent is:

e
~f

=



P
(
C | ~f

)

lg
(
P
(
C | ~f

))



P
(
~C |~ f

)

lg P
(
~C |~f

)
.

(5)

Thus, the expected posterior entropy is
:
e
f
P
(
f

)

+ e
~f
P
(
~f

)

and the expected entropy loss:

e



e
f
P
(
f

)



e
~f
P
(
~f

)
.

(6)


3.3

Intelligent
Agents

Google w
orks with a combination of hardware and advanced software. In one hand, the
offered searching speed over the web can be attributed in part to the efficiency of search
algorithm and in part to the thousands of low cost PCs interconnected, in order to create

a
parallel hyper
-
searching machine. In the other hand, efforts to offer customization integrated
to the current search engines are based on the previous categorization of the documents. The
personalized search of Google offers search results based on a pr
ofile that contains the
interests of the user [7]. This approach requires hardware infrastructure that can be very
expensive. Another way to tackle the problem is to have a personal assistant having as goal to
perform the search using the current engines a
nd later classify the list of results. At the top of
the list, the user may found the customized and thus the preferred documents.

To offer this functionality the goal is to work with intelligent algorithms as agents and
artificial neural networks, Support

Vector Machines in this case. In one hand, Autonomous
Agents, [1], can adapt very easily to changes and can communicate to other agents. In the
other hand, Support Vector Machines have learning capabilities, very useful to achieve this
goal.



ASK
-
IT International Conference


October 2006


6

3.4

Support

Vector M
achines

Support Vector Machines (SVMs) constitute a well
-
known technique for training separating
functions in pattern recognition tasks and for function estimation in regression problems. In
several problems it has shown its generalization capabil
ities. The mathematical problem and
solutions was settled by Vapnik and Chervonenkis, [17]. In classification tasks, the main idea
can be stated as follows; given a training data set characterized by
patterns
,
belonging to two possible classes

,
there ex
ists a solution
represented by the following optimization problem:


Minimize

(7)


Subject to:








(8)

and a decision function:


(9)

The solution to the problem formulated in eq. (7,8,9) is a vector

α
i
*
≥0

for which the

α
i
*

strictly
grater than zero are the Support Vectors (SVs). While solving the optimization problem,
constraints and are satisfied and thus the problem is minimized
.

The above formulation stands for a standard Quadratic Optimization pro
blem (QP). It
is possible to define the problem in matrix notation:


Minimize

(10)


Subject to:




(11)

Implementations of SVMs include [14] and [9]. The goal is to have a fast training method.

3.4
.1
Using SVM for text cl
assification

So
me

of the advantages of using SVM for text categorization are, [12]:



Able to work with high dimensional spaces. As we have seen, the dimension of text
classifiers is in the order of 10,000. SVM are well adapted to perform well at this high
dimensional spac
e.



Few irrelevant characteristics: One manner to avoid a high dimensional space is to
suppose that the greater part of entry characteristics are irrelevant. The selection of
characteristics tries to determine these irrelevant characteristics.
Unfortunately
, in t
ext

categorization few irrelevant characteristics exist.



The information in the characteristics vector is scarce: For every document, the
corresponding vector contains few inputs major to zero. Some works, [13], suggest
that SVM are adapted to work w
ith cases that have scarce information.



Problems raised by SVM are linearly separable: Using SVM a linearly separable
problem is solved by using a small number of units.

ASK
-
IT International Conference


October 2006


7

4
.
Proposed Method

In this work we aim to realize a system based on intelligent agents

that offer personalized
searching by means of Support Vector Machines (SVMs). An essential part of the agent will
be the learning and the use of techniques for texts categorization, in such a way that it could
present the arranged results, in accordance w
ith user’s preferences. While the agent learns the
user preferences, it will be capable of arranging every new search with a major precision. The
typical flow for the use of the system is as follows:

1.

The system displays one input field for the search.

2.

The

user introduces his search argument.

3.

The system sorts the results in accordance to the user selected categories.

4.

The user can read the summary of each one of the found web pages.

5.

The user surfs on the links of interest for him.

The user can select documen
ts for be used as training samples.


4.1

Interf
ace with search engines

As stated later, in order to index the web, is necessary a great infrastructure of hardware and
software. Google has more than 8 billion pages. Instead of creating such a system, we pre
tend
to take advantage of the existing infrastructure and to realize an interface on the way to accede
to the search engine of Google.

We use an interface to popular search engines to initiate a search. Two of the must popular
engines, Google and Yahoo, of
fer a Web Services interface to their engines. We use this API
to access the services using the SOAP (Simple Object Access Protocol) protocol over HTTP
and exchanging information in XML. Web Services provides a standard way to interoperate
between differ
ent applications. One of the main advantages of the Web Services is that they
could be used in a large variety of platforms and provide a loosely coupled architecture. In this
case we use a Java API to access this web services. The web services deliver the

results of the
search in a structured format.


4.2

Tr
a
ining

The first step in the training is to create a set of samples from which the system will learn.
Using the web services API, we conducted a search using the keywords to describe a category
(e.g. us
ing “music”, “education”). The search results are stored in a database. After the
document was obtained, it is necessary to extract its content and to convert it in an adapted
representation for the learning process. For each document, the text was extrac
ted using
HTMLParser, [10]. Sun

et. al
.,

[16], suggests that a better classification is obtained if the text
from the body, the title and anchors of the document is considered. We use text, title and
anchor, but only the words that were in a vocabulary da
tabase (with approximately 80,000
words) were considered as a valid feature. Unlike Sun work, we use a single vector space
model to represent body, title and anchor text, without distinguishing where the word was
found. From this feature, we used the expe
cted entropy loss to select the features that carries
more information. With this selected, feature we create a vector that represents each
document. Selecting the feature with higher entropy, the vector was reduced from 9,000
dimensions to a vector of app
roximately 500 dimensions.

For the SVM training phase, we used the LIBSVM software (Chang y Lin, [3]) using the RBF
Kernel. There are two parameters while using RBF kernels that must be chosen properly: C
and
γ . The goal is to identify good (C,γ) so that the classifier can accurately predict unknown
ASK
-
IT International Conference


October 2006


8

data. A common used technique is to separate training data into two parts, one of which is
considered unknown in training the classifier. Then the prediction accur
acy on this set can
more precisely reflect the performance on classifying unknown data. An improved version of
this procedure is cross
-
validation. Chang and Li recommend a “grid
-
search” on C and γ by
using cross
-
validation. Basically, pairs (C,γ) are tried

and the one with the best cross
-
validation accuracy is selected.

In this way, training is performed based on documents that the user chooses while surfing. In
other words, documents generating some interest in the user will serve as training cases to
lear
n the user preferences. These documents are stored in the database using either a
previously defined category or a new user’s defined category. Training phase is performed in
batch and, next time the users make a query, the system can classify document inc
orporating
user’s selection.


4.2

Results

The following results show the validation accuracy obtained when using cross
-
validation:

Table
1
.

Validation Accuracy using cross
-
validation for fixed categories

Category

C

Γ

Validation
Accuracy

M
ovies

32

8

98.7889%

Music

2048

2

99.308%

Home

512

0.5

97.7509%

Education

128

8

96.5398%

Business

128

8

94.8097%

Health

128

2

90.8304%

Science and Technology

2048

0.5

96.7128%

Society

32

8

95.6747%

The results show that SVM provide excellent general
ization for Web documents
classification.

5
.
Conclusions

In this paper we show that Support Vector Machines can be used to classify web searches
results with good results for intelligent web searches as personalization. This method uses
characteristics of
SVM as high dimensionality and generalization capabilities. The users can
make a query to the system which returns the results sorted according to the selected
categories. Once the document is classified, it is easier for the user to decide weather a
docum
ent is relevant or not. Future work can include handling multiple document formats and
multi
-
agent systems.

Multiple document format handling is necessary because a large part of academic and
scientific information not only exists in a simple format as HTM
L.

The categorization and personalization can have different results, depending on the technical
tools used. It is possible to think about creating an architecture of multiple agents in such form
that it offers to the user the best answer, which will be b
eyond the answer and individual
capacity of each agent.



ASK
-
IT International Conference


October 2006


9

5
.
References

1.

Bigus, Schlosnagle, Pilgrim, Mills, Diao. ABLE: A toolkit for building multiagent
autonomic systems. IBM systems Journal Vol 41, No 3 2002.

2.

Broder Andrei, A taxonomy of web search. ACM

SIGIR Volume 36, Issue 2 Fall 2002.

3.

Chih
-
Chung Chang y Chih
-
Jen Lin, LIBSVM: a library for support vector machines. 2001
Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm
.

4.

Chakrabart
i Mining the Web, Discovering Knowledge from Hypertext Data. Morgan
Kaufmann Publisher 2003.

5.

Debole Franca, Sebastiani Fabrizio. Supervised term weighting for automated text
categorization Proceedings of the 2003 ACM symposium on Applied computing.

6.

Googl
e
http://www.google.com
.

7.

Google:
http://labs.google.com/personalized
.

8.

Google.
http://www.google.com/googleblog/2004/11/googles
-
index
-
nearly
-
doubles.html
.

9.


A
riel
Ga
rcí
a
-
Ga
mbo
a
. Increasing training speed of Support Vector Machines by
Barycentric Correction Procedure in WSEAS Transactions on Systems, Issue 3, Volume 3,
May 2004. I
SSN 1109
-
2777.

10.

HTMLParser
http://sourceforge.net/projects/htmlparser
.

11.

Internet Systems Consortium
http://www.isc.org
.

12.

Joachims. Text categorization with support

vector machines: learning with many relevant
features. In Proc. of the ECML1998, páginas 137

142, Chemnitz, DE 1998.

13.

Kivinen, Warmuth, Auer. The perceptron algorithm vs winnow: Linear vs. logarithmic
mistake bounds when few input variables are relevant.

Conference on Computational
Learning Theory 1995.

14.

Osuna E, Freund R, Girosi F, Training Support Vector Machines: An application of face
detection, Proceedings of Computer vision and Pattern Recognition, 1997, pp. 130
-
136.

15.

PEW/Internet http://www.pewintern
et.org/report_display.asp?r=80 and
http://www.pewinternet.org/PPF/r/146/report_display.asp
.

16.

A. Sun, E.
-
P. Lim, and W.
-
K. Ng, Web classification using support vector machine, in
P
roceedings of the fourth international workshop on Web information and data
management. ACM Press, 2002.

17.

V. Vapnik and A. Chervonenkis, Theory of Pattern Recognition, Akademie
-
Berlag, 1979.

18.

Hao Chen, Susan Dumais
,
Bringing order to the Web:

automatical
ly categorizing search
results
.
Proceedings of the SIGCHI conference on Human factors in computing systems,
145
-

152

, 2000.

19.

Dou Shen, Jian
-
Tao Sun, Qiang Yang, Zheng Chen.
A comparison of implicit and explicit
links for web page classification
. Proceedings of the 15th international conference on
World Wide Web, 643


650, 2006.

20.

Bill Kules

and
Jack Kustanowitz
,
Categorizing we
b search results into meaningful and
stable categories using fast
-
feature techniques
.
Proceedings of the 6th ACM/IEEE
-
CS
joint conference on Digital libraries, 210


219, 2006.

21.

Sergei Vassilvitskii, Eric Brill.
Using web
-
graph distance for relevance feedba
ck in web
search
.
Proceedings of the 29th annual international ACM SIGIR conference on Research
and development in information retrieval, 147


153, 2006.