shopping asssistant - Digital Repository & E-Learning System of ...

toadspottedincurableInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

63 views


1

SHOPPING ASSISTANT

PROJECT REPORT

Submitted in partial fulfillment of the requirements

For the award of B.Tech Degree in Computer Science & Engineering

of the University of Kerala

By

ANUSHA M P

DEEPA P

DIVYA RAJ

Eighth Semester, B.Tech

Computer Science and

Engineering



DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

COLLEGE OF ENGINEERING

TRIVANDRUM




2

COLLEGE OF ENGINEERING TRIVANDRUM



D
EPARTMENT OF COMPUER SCIENCE &

ENGINEERING

CERTIFICATE


This is to certify that the
project report entitled “
SHOPPING AS
SISTANT
” is a
bonafide record of the project done by
ANUSHA M P
,

DEEPA P
and
DIVYA
RAJ
,
Eighth semester, during the year 2010, in partial fulfillment of the
requirements for the award of B.Tech Degree in Computer Science & Engineering
of the University of
Kerala



3


ACKNOWLEDGEMENT




The culmination of our project requires more than a word of thanks to all those who
were part of the efforts in undertaking this project and preparing the report. First of all we thank the
Almighty God for hearing our prayers a
nd showing the benevolence to us. We express our deepest
gratitude to our project guide Mrs. Liji P I and project coordinator Mrs. Shreelekshmi R. for their
inspiring and untiring help throughout the course of the project work. We also express our sincere
gratitude to the Head of the Department Prof. Rajasree M.S. for the help and co
-
operation extended to
us in the completion of the project. We would also like to take this opportunity to thank all the staff
members for their support and help in completing o
ur project.








ANUSHA M P

DEEPA P

DIVYA RAJ











4

A
BSTRACT

Search engines have become a vital part of internet.
The internet provides a wealth of
information scattered all over the world. A Web search engine is a tool designed to search for
informa
tion on the World Wide Web.
Most of the search engines used nowadays
makes

use of
the exact match procedure.

Here the engine will crawl the web and develop a database of
keywords. For

exact matches, the engine will list the items in the order of some evalu
ation
criteria of the sites. Better search results can be provided if the case based reasoning approach, a
field of artificial intelligence, is used. Fuzzy databases use case based reasoning to look up the
entries. The basic motivation behind this project
is to find the best match for a given item by
specifying related attributes. More importantly it must be done in real time.

This is different from the search engine we described above which relies on exact match. They
can be used only if we know the exact
name of the item.

The application helps customers purchase computers. The goal was to have the computer
application emulate how a sales person might act. That meant asking simple, easy to understand
questions and then making a recommendation. At its most
simple, the computer would ask the
customer to rate enter the desired price and performance. The shopping assistant application
would then recommend a computer along with a list of several alternates should they not like the
recommended system.

More import
antly, it must be done in real time.

5


CONTENTS

1.

INTRODUCTIO
N………………………………………………………………
……….
6

1.1

PROJECT OVERVIEW
…………………………………………………

……….
7

1.2

PROBLEM DEFINITION
……………………………………………….....
.............
8

2.

REQUIREMENT ANALYSIS
………………………………………………...
...............
9

2.1

SYSTEM SPEC
IFICATIONS
……………………………………………
…………
9

3.

SYSTEM FEATURES
…………………………………………………………
……….10

3.1

CASE BASED REASONING
……………………………………………...
.
...........10

3.2

WEBCRAWLER DESIGN
………………………………………………...
............
14

3.3

INDEXING DESIGN
………………………………………………………
……….
17

3.4

SEARCHING DESIGN
………………………………
……………………
……….
18

3.4.1

SIMILARITY BASED MATCHING
……………………………....
...........
18

3.4.2

NEAREST NEIGHBOUR ALGORITHM
…………………………
……..
19

4.

IMPLEMENTATION
………………………………………………………………….21

5.

USECASE DIAGRAM
…………………………………………………………………24

6.

DESIGN

………………………………………………………………………………25

7.

SCREEN SHOTS
……………
………………………………………………………….26

8.

FUTURE ENHANCEMENTS
……………………………
……………………………30

9.

CONCLUSION
…………………………………………………………………………31

10.

REFERENCES….
………………………………………………………………………32




6


1.

INTRODUCTION

Search engine using case based reasoning finds the best match for a given item b
y specifying its
related attributes. This is different from other search engines which rely on exact match

.This
uses case based reasoning, an A I technique.



Shopping Assistant is a web search engine used to build a realistic and useful application,
eval
uating artificial intelligence tools for use in retail
. The main aim of this search engine is
to
build a product search system that used CBR to guess at what product a customer was searching
for
.

Many products are differentiated by features.

For example a
laptop have features like ram,
hard disk
, make
r
, model, speed
, warranty etc
. It is uncommon for a customer to know every
single item they need to make their primary purchase work.

Many products rely on services.
Many of these are problems that can be ameli
orated by the intelligent application of technology
.

This paper and the project it describes focus on the product selection problem and, in particular,
how case based reasoning can be used to help shop for certain types of items.



The major goals are: T
o develop a realistic CBR
-
based application. Specifically,

to build
a sales advisor system that helps computer shoppers determine which products to

purchase. The sales advisor will be targeted to the e
-
commerce environment and will be
implemented as a stat
eless Java applet. The success of the application will be judged on
the accuracy of its recommendations, although attention will also be paid to other aspects
of expert systems, most notably system maintenance.

Also

to investigate issues in data
representa
tion and to illustrate ways to model data that makes application development
easier. It is my belief that substantially more deployed expert systems fail because of data
representation than because of the underlying expert system technology.

And
to
underst
and those features that make a CBR engine successful.

A web search engine work by storing information about web pages, which they retrieve
from the WWW itself. The pages are retrieved by a Web crawler (sometimes also known
as a spider)


an automated Web b
rowser which follows every link it sees. The contents

7

of each page are then analyzed to determine how it should be indexed. Data about web
pages are stored in an indexed database for use in later queries.

Another approach that is somewhat common is to use

manually built cross sell tables. A database
table would hold a list of products to recommend if a customer purchased a specific product. For
example, if you bought a
camer
a
,

related items table might tell the system to recommend
batteries, film and a car
rying case. This approach can give fairly good recommendations but
requires a large amount of data entry and is prone to data maintenance errors.

The PC Shopping Assistant uses a general
-
purpose case based reasoning engine that relies on a
dynamic, brute
-
f
orce, k
-
nearest neighbor algorithm. The nearest neighbor algorithm decides how
similar two items are by, oddly enough, using a variation of the Pythagorean theorem.
Conceptually, it plots all the items on a graph and then determines which item is closest t
o what
you’re looking for. The closer the item, the more similar it is. The most similar item is
considered to be the best match. There are other names for this such as sparse matrixes and
vector space models (which, to the best of my limited knowledge are

pretty much just CBR), but
the concept is pretty simple.

When a user enters a query into the search engine, the engine
examines the index and provides a listing of best matching web pages according to its criteria.

The key to a good CBR system is the desi
gn of the user interface. The most common approach is
to give the user a list of all traits and ask them to specify the values for each one.

This is a first step towards semantic search. Most of t
he search engines used nowadays
uses

the exact keyword matc
h procedure. But, here the best match for a given item can be
found by specifying its related attributes. This paper and the project it describes focus on
the product selection problem and, in particular, how case based reasoning can be used to
help shop f
or certain types of items,

namely electronic
appliances.


1.
1
PROJECT OVERVIEW

Search engines have become a vital part of internet. The internet provides a wealth of
information scattered all over the world. A Web search engine is a tool designed to search

for
information on the World Wide Web. Most of the search engines used nowadays makes use of
the exact match procedure.

Here the engine will crawl the web and develop a database of

8

keywords. For exact matches, the engine will list the items in the order o
f some evaluation
criteria of the sites. Better search results can be provided if the case based reasoning approach, a
field of artificial intelligence, is used. Fuzzy databases use case based reasoning to look up the
entries. The basic motivation behind t
his project is to find the best match for a given item by
specifying related attributes. More importantly it must be done in real time.

This is different from the search engine we described above which relies on exact match. They
can be used only if we kno
w the exact name of the item.

The application helps customers purchase computers. The goal was to have the computer
application emulate how a sales person might act. That meant asking simple, easy to understand
questions and then making a recommendation.
At its most simple, the computer would ask the
customer to rate enter the desired price and performance. The shopping assistant application
would then recommend a computer along with a list of several alternates should they not like the
recommended system.

More importantly, it must be done in real time.


1.2

PROBLEM DEFINITION

Create an efficient search engine which uses case based reasoning, a very relevant area in the
field of artificial intelligence. The search engine produces the best match rather than
the exact
match for a given item.








9


2. REQUIREMENT ANALYSIS

This phase deals with understanding the problem, the goals and constraint etc. requirement
analysis starts with some general “statement of need” or a high level “problem statement”.
During an
alysis the problem domain and environment are modeled in an effort to understand the
system behavior, the constraints of system, its inputs and its outputs etc. The basic purpose of the
activity is to obtain any thorough understanding of what the software
needs to provide. The
understanding of requirements leads to requirement specification. The analysis produce large
amount of information and knowledge with possible redundancies properly organizing and
describing the
requirements is the important goal of t
his activity.

2.1 SYSTEM SPECIFICATIONS

HARDWARE SPECIFICATIONS

Processor: Any processors of the present generation. Higher capability processors running web
servers provide fast reply.

Main Memory: 256MB and above.

Operating System: Windows XP, Linux (Po
rtable across platforms).

SOFTWARE SPECIFICATIONS

Programming Environment: Java jdk
-
6u6.

Database: MySQ
L









10



3.

SYSTEM FEATURES

3.1

CASE BASED REASONING


In case
-
based reasoning (C
BR) systems expertise is embodied in
a library of past cases. Each case typically contains a description of the problem, plus a solution
and/or the outcome. The knowledge and reasoning process used by an expert

to solve the problem is not recorded, but is
implicit in the solution.

To solve a current problem: the problem is matched against the cases in the case base, and
similar cases are retrieved. The retrieved cases are used to suggest a solution which is reused and
tested for success. If necessary, the s
olution is then revised. Finally the current problem and the
final solution are retained as part of a new case
.

Although case
-
based reasoning is used as a generic term, the

typical case
-
based reasoning
methods have some characteristics that distinguish the
m

from th
e other approaches listed here.
First, a typical case is usually assumed to have a

certain degree of richness of information
contained in i
t, and a certain complexity with
respect to its in
ternal organization. That is, a
feature vector holding som
e values and a

corresponding class is n
ot what we would call a typical
case description. What we refer to

as typical case
-
based methods
also has another characteristic
property: They are able to

modify, or adapt, a retrieved solu
tion when applied in a diff
erent
problem solving context.

Paradigmatic case
-
based methods also utilize general background
knowledge


although

its richness, degree of explicit represen
tation, and role within the CBR
processes varies.

Core methods of typical CBR systems borrow a lot
from cognitive psychology
theories.


Different types of case based reasoning tools are present. Here we using memory based
reasoning
.



Memory
-
based reasoning:


This approach emphasizes a collection of cases as a large
memory,

and reasoning as a process of a
ccessin
g and searching in this memory.
Memory

organization and access is a focus of the case
-
based methods
. The utilization of
parallel


11

processing techniques is a characteristic of these
methods, and distinguishes this
approach

from the others. The access
and storage metho
ds may rely on purely syntactic
criteria or

they may attempt to utilize general domain knowledge
.


CBR COMPONENTS


A general case
-
based expert system consists of




A Case Base:

A case base functions as a repository of prior cases. The cases

are indexed
so

that they can be quickly recalled when necessa
ry. A case contains the general
descriptions

of old problems.



Retriever:

The Retrieve task starts with a (partial) problem description, and ends when

a
best matching previous case has been found
. Its subtasks are referred to as Identify

Features, Initially Match, Search, and Select, exe
cuted in that order. The identifi
cation

task basically comes up with a set of relevant probl
em descriptors, the goal of the
matching

task is to re
turn a set of cas
es that are suffi
ciently similar to the new case
-

given a

similarity threshold of some kind, and the selection task works on this set of cases
and

chooses

the best match (or at least a fi
rst case to try out).



Adap
ter:

An adapter examines the diff
erences b
etween these cases and the current
problem.

It then applies rules

to modify the old solution to fi
t the new problem. Adapter

looks at how the problem was solved in the retrieved

case. The retrieved case holds
information

about the method used for solving t
h
e retrieved problem including a
justifi
cation

of the operators used,
sub goals

considered, alternatives generated, failed
search paths,

etc. Derivational reuse then reinstantiates the retrieved method to the new
case and "replays"

the old plan into the ne
w context (usually general problem solving
systems can be

seen here as planning systems). During the replay successful alternatives,
operators, and

paths

will be explored first while fi
led paths will be avoided; new
sub
goals

are pursued

based on the old o
nes and old
sub plans

can be recursively retrieved for
them. An example

of derivational reuse is the Analogy/Prodigy system that reuses past
plans guided by

commonalties of goals and initial situations, and resumes a means
-
ends
planning regime

if the retri
eved plan fails or is not found.


12



Refi
ner:

The refi
ner task takes the result from ap
plying the solution in the real
environment

(asking a teacher or performing the task in the

real world). This is usually a
step

outside the CBR system, since it
-

at least f
or a system in normal operation


involves

the application of a suggested solution to the real problem. The results from
applying

the solution may take some time to appear, depending on the type of
application. In a

medical decision support system, the suc
cess or failure of a treatment
may take from a

few hours up to several months. The case may still be learned, and be
available in the

case base in the intermediate period, but it has to be marked as a non
-
evaluated case. A

solution may also be applied to a

simulation program that is able to
generate a correct

solution. This is used in CHEF, where a s
olution (i.e. a cooking recipe)
is applied to an

internal model assumed to be strong enough to give the necessary
feedback for solution

repair.



Executer:

Once a

solution is critiq
ued, an executer applies the refi
ned solution to the

current problem.



Evaluator:

In CBR, the case base is updated no matter how the problem was solved. If

it
was solved by use of a previous case, a new case may be built or the old case m
ay

be
generalized to subsume the present case as well. If the problem was solved by other

methods, including asking the user, an entirely new case will have to be constructed. In

any case,
a decision needs

to be made about what to

use as the source of lear
ning.
Relevant

problem descriptors and problem solutions are obvious candidates. But an
explanation

or another form of justifi
cation of why a solution is a solution to the problem
may also be

marked for inclusion in a new case. In CASEY and CREEK, for exam
ple,
explanations

are included in retained
cases, and reused in later modifi
cation of the
solution. CASEY

uses the previous explanation structure to search for other states in the
diagnostic model

which explains the input data of the new case, and to look
for causes of
these states

as answers to the new problem. This focuses and speeds up the explanation
process,

compared to a search in the entire domain model.
The last type of structure that
may be

extracted for learning is the problem solving method, i.e.

the strategic reasoning
path,

making the system suitable for derivational reuse.




13

CBR CASE DIAGRAM



In a CBR expert system, the system searches its memory (the case base) seems more human
-
like than
rule
-
based systems. to see if it's seen this situation

before. If it finds a similar situation, it uses that
situation's diagnosis.

It seems more human
-
like than rule
-
based systems.

Case based reasoning is good for finding those items most like the specified criteria. This is
useful for finding primary items
but is not nearly so good at cross selling and recommending
accessories as collaborative filtering is.














14



3.2
WEB CRAWLER DESIGN


A search engine crawler is a program or automated script that browses the World
Wide Web in a methodi
cal manner in order to provide up to date data to the particular search
engine. Here the process of web crawling involves a website URL that need to be visited, called
seed, and then the search engine crawler visits the web page and identifies all hyperlin
ks on the
page, adding them to the list of pages to crawl. URLs from the list are revisited occasionally
according to the policies in place for the search engine. Web Crawler is a Web Service that
assists users in their Web navigation by automating the tas
k of link traversal, creating a
searchable index of the web, and fulfilling searcher’s queries from the index. Conceptually,
WebCrawler is a node in the Web graph that contains links to many sites on net, shortening the
path between users and their destina
tions.

The goal was to have the computer application emulate how a sales person might act. That meant
asking simple, easy to understand questions and then making a recommendation. At its most
simple, the computer would ask the customer to rate enter the de
sired price and performance.
The shopping assistant application would then recommend a computer along with a list of
several alternates should they not like the recommended system.



Shopping Assistant is a web search engine used to build a real
istic and useful application,
evaluating artificial intelligence tools for use in retail. The main aim of this search engine is to
build a product search system that used CBR to guess at what product a customer was searching
for. Many products are differen
tiated by features. For example a laptop have features like ram,
hard disk, maker, model, speed, warranty etc. It is uncommon for a customer to know every
single item they need to make their primary purchase work. Many products rely on services.
Many of th
ese are problems that can be ameliorated by the intelligent application of technology.


A simplification of the Web experience is important for several reasons: First WebCrawler
saves users time when they search instead of trying to guess at a path of

links from page to page.
Often, user will see no obvious connection between the page he is viewing and the page he
seeks. For example, he may be viewing a page on one topic and desire a page on completely

15

different topic, one that is not linked from his
current location. In such cases, by jumping to
WebCrawler
-

either using its address or button on the browser
-

the searcher can easily locate his
destination page. Second, the WebCrawler’s simplification of the Web experience makes the
Web a more friendly a
nd useful tool. The policies of the search engine can be different for each
search engine, and may be a cautionary action to ensure that some of the pages that have been
added to the index before have not become spam. The crawler “clicks” on a link and off

it goes
to read, index and store another Web site.


The software spider often reads and then indexes the entire text of each Web site
into main database of the search engine it is working for. But here the crawler reads the entire

text of each site and indexes and stores only the required features to the database. For example,
for searching electronic equipment, say laptop, the crawler searches the corresponding site and
extracts the features of the laptop and stores it in the dat
abase. A software spider is like an
electronic librarian who cuts out the table of contents of each book in a library in the world, sorts
them into a gigantic master index, and then builds an electronic bibliography that stores
information on which texts r
eference which other texts.



16




A software spider visits your site, it notes any links on your page to
other sites.

In any search engines vast database are recorded all links between sites. The search
engine k
nows which sites you linked to, and more importantly, and which ones linked to you.
Many engines will even use the number of links to your site as an indication of popularity, and
will then boost your ranking based on this factor. Here the crawler doesn’t
rank the
pages;

it
goes

through
the pages and extracts the necessary features required.

And then stores the extracted
features into the database. Search crawlers are also smart enough to follow links they find on
pages. They may follow these links as they
find them, or they will store them and visit them
later.

Current search engines like "Google" returns exact match for a given search item. Google

has
now launched a new site called "Froogle" which for a given search item, lists down the

featu
res
of that
item produced by diff
erent companies. They have an option for listing the

items cost wise.
In the near future, they will be adding some additional features which will

h
elp us to give a

17

detailed specifi
cation of the item we are searching for and return the
item

which matches our
specifi
cation more closely. The underlying logic for this is nearest
neighbor algorithm which is
the basic principle used in our CBR engine. In a real time search engine,

if we are using the
nearest
neighbo
r

algorithm, we may not get

the results in real time. So

we have to use k
-
nearest
neighbor

algorithm in which only k items are used for computing

similarity.


3.3
INDEXING DESIGN


Search engine indexing collects, parses, and stores data to facilitate fast and
accurate information retrieval. Index design incorporates interdisciplinary concepts from
linguistics, cognitive psychology, mathematics, informatics, physics and computer science.


The purpose of storing an index is to optimize speed

and performance. Without an
index, the search engine would scan every document corpus, which would require considerable
time and computing power. For example, while an index of 10,000 documents can be queried
within milliseconds, a sequential scan of ever
y word in 10,000 large documents could take hours.
The additional computer storage required to store the index, as well as the considerable increase
in the time required for an update to take place, are traded off for the time saved during
informational r
etrieval.



DATABASE DESIGN LOGIC


The database for search engine consists of a table. Table is used to store the features
extracted. It
consists

of the following attributes
cid, maker, model, hd, price, speed, ram,
warranty and url.

SQL

for creating this table is shown below.

CREATE TABLE
COMPUTERTAB(



CID
VARCHAR(
20) PRIMARY KEY,



MAKER
VARCHAR(
20),



MODEL
VARCHAR(
20),


18



HD NUMBER,



PRICE NUMBER,



S
PEED NUMBER,



RAM NUMBER,




WARRANTY INTEGER,


URL VARCHAR(20));


These features are extracted from the web pages and stored in the database. For further
computation these are fetch
ed from the database and u
sed.




3.4

SEARCHING DESIGN


3.
4.1
SIMILARITY BASED MATCHING

Suppose we want to find all items that match at least seven of ten criteria. With similarity
-
based
matching, we can find items that match most of our search criteria, but not all of it. Suppose
that
some of our data is incomplete, some of the fields we use in our search criteria were left blank.
With similarity
-
based matching, we can find items even when data is missing. Suppose we want
to see all items that are around a certain price. With simil
arity
-
based matching, we can find items
that are close to what we want without having to specify a hard cut off. Suppose we want to sort
our items by a combination of attributes rather than on a column
-
by
-
column basis.

If we knew exactly what we were looki
ng for and had that item's "key"
-

a part's part number, a
person's social security number, an order's order number, etc, we won’t use similarity based
matching. Similarity
-
based matching is meant to be used when you don't know what options you
have or whe
n you want to have your items ranked.


19

Real world systems have used case based reasoning to help lawyers find cases, TV viewers find
interesting TV shows, customers find the perfect computer, help desk workers find answers (by
far the most common applicatio
n), technicians configure X
-
ray machines, credit card companies
find good credit risks, auditors find fraud, mechanics maintain aircraft and music lovers find
songs.


Similarity based matching makes it easy to rank items. And suppose certain attributes are

more
important than others. With similarity
-
based matching, we can assign weights to each attribute,
allowing some data attributes to be more important than others



3.
4.2
NEAREST NEIGHBOUR ALGORITHM

This i
s the algorithm we use for finding the best match. If an item has numerous attributes, say,
for example, that our items have price, performance, reliability and size. We want to find an item
of price 1, performance 10, reliability 10 and size 1, which as a

graph point would be (1,10,10,1).
Item A might be (4,4,4,4), B might be (7,7,2,8), C might be (10,9,6,10) and D might be (1,6,8,3)

(1,10,10,1)
-

(4,4,4,4) =
-
3, 6, 6,
-
3

a
2

+ b
2

+ c
2

+ d
2

= e
2

-
3
2

+ 6
2

+ 6
2

+
-
3
2
= e
2

9 + 36 + 36 + 9 = e
2

e
2

= 90

e = 9.49


This shows that the maximum distance is now 18, and item A is 47% similar to what we're
looking for.

If, on every request, you manually compute the distances and return the best match, that's a brute
force nearest neighbor search. A k
-
nearest neighbor se
arch returns the k closest items, meaning
that if you ask for the top five matches, you get the top five matches.


20

We plot an n
-
dimensional graph with each attribute on a corresponding axis. Now we plot the
target item on the graph. Then we use an extension

of Pythagorean theorem for finding out the
item nearest to the target item. Suppose each item has a weight associated with it. We consider
that also while finding the best match. The time complexity of this algorithm is O (n*n).

%similarity between items
= 1
-
(distance between items /maximum distance)



















21


4. IMPLEMENTATION

4.1
WEB CRAWLER IMPLEMENTATION

The crawler in the Shopping Assistant possesses the design principles and features mentioned
above. The input to the crawler is a list of URLs

of product sites namely, rediff.com,
shopper.cnet.com, ebay.com etc. The crawler process or scan the web page of the input URL and
it downloads every hyperlinks and URLs in the current processing web page.

After completing
the download of the web page of
the input web page URL, we have to give another HTML link
of another web page to the web crawler. The crawler then do the same processing steps as
mentioned above to the input URL.

The logic for identifying the pages containing details of electronic applia
nces varies from site to
site.

Therefore, we used a factory deign pattern where Downloadfactory returns the

downloader
specific to a site. The filling of url list is implemented in the site specific downloader. The
download() function is the PageDownloa
d
e
r takes each url from the list and downloads the
pages.

The crawler crawl a number of different web sites and stores the downloaded web pages in a
database. Web Crawler’s implementation is, from the perspective of a Web user , entirely on the
Web; no clie
nt
-
side elements outside the browser are necessary. The service is composed of two
fundamental parts: crawling, the process of finding documents and constructing the index; and
serving, the process of receiving queries from searcher
s and using the index to

determine the
relevant results. Though WebCrawler’s implementation has evolved over time, these basic
components have remained the same.


4.2
INDEXING IMPLEMENTATION

Each site has its own feature extractor for each product, for example Rediff PCExtractor,

ShopperPCExtractor,etc.The feature extractor extracts the values of the attributes of a product
using start and end
markers.For eg, RediffPCExtractor extracts the values of price,hard disk
capacity,RAM,etc using markers designed for pages that contain de
tails of laptops in rediff.com.


22

The FeatureExtractorFactory returns site
-
product specific extractor.The extracted details are
inserted into the database using DAO(Database Access Object). This also is implemented using

Factory design pattern.There are DAO
s for
pc
,camera,etc.

4.3 SEARCHING IMPLEMENTATION

Searching is implemented using Nearest Neighbour algorithm.
The user will input some or all of
the features that the item must possess.

The search engine will return items that satisfy the
expected features

most.

The similarity(in percentage) of all items in the database with the given query is calculated using
an extension of Pythagoras theorem.

The most similar 20 items are displayed.



4.4 FUNCTION SPECIFICATION



Spyder::start()


Initiate
s downloading of pages and feature extraction.

PageDownloader:: download()


Establishes a connection with the web server and downloads the page corresponding to
the url given.

DownloaderFactory::getDownloader()


Returns an object o
f PageDownloader, corresponding to the given site.

SiteDownloader::fillUrlList()


Builds the url list of the given site. Each site has got a particular pattern for urls. The
site
-
specific downloader uses this pattern for filling the url list

FeatureExtractorFactory:: createExtractor()


Returns the SiteProductExtractor(which extends FeatureExtractor)

FeatureExtractor::extractFeature()


23


Extracts all features. The feature extractor extracts the values of the attributes

of a
product using start and end markers.

FeatureExtractor::extractProduct()


Calls the function extractFeature() and sets all the attributes of the product .

DAOFactory::getDAOFor()


Returns the productDAO for the category.

P
roductDAO


All the
functions in this class implement data base operations,
viz
insert,

update,

fetch,

delete

etc.

Querrier::list()


Searches the data base and list all the items that satisfy the given input. The
similarity
(in percentage) of all items in the database with the given query is calculated using an
extension of Pythagoras theorem.

ProductSource


This class stores the site, category and the specification page.

Product


This cl
ass stores the all attribute of a product
.







24

5. US
ECASE DIAGRAM


















USER

Query

R
esult

Querrier

Database

Feature Extract
or

Downloa
der

URL queue

Web

server


25



6. DESIGN









26


7. SCREENSHOTS








27









28










29










30

8. FUTURE ENHANCEMENTS

Currently the shopping assistant is used onl
y fo
r shopping laptops. The object oriented feature
of
the

project allows easy enhancement so that more products like camera,

mobile etc can be added.

Also, more product sites like ebay.com
, shopper.cnet.com

can be crawled.


The similarity is calculated
using Pythagoras theorem.

So,

only the numerical features are
considered. String attributes can be taken into account for finding the similarity if Natural
Language Processing techniques are implemented.

Also each feature can be given a
priority/weight tha
t can be considered while searching
.


















31


9. CONCLUSION

Current search engines like "Google" returns exact match for a given search item. Google

has
now launched a new site called "Froogle" which for a given search item, lists down the

featu
r
es
of that item produced by diff
erent companies. They have an option for listing the

items cost wise.
In the near future, they will be adding some additional features which will

h
elp us to give a
detailed specifi
cation of the item we are searching for and
return the item

which matches our
specifi
cation more closely. The underlying logic for this is nearest
neighbor algorithm which is
the basic principle used in our CBR engine. In a real time search engine,

if we are using the
nearest neighbo
r

algorithm, we
may not get the results in real time. So

we have to use k
-
nearest
neighbor algorithm in which only k items are used for computing

similarity.















32


10.

REFERENCES

1.
htt
p://www.fawcette.com/javapro/2002_10/magazine/features/bwetzel

2.
Rapid Retrieval Algorithms for Case
-
Based
Reasoning by Richard H. Stottler and Andrea L.
Henke

3. The Anatomy of a Large Scale Hyper textual Web Search engine
-

Sergey Brin and Lawrence
Page

4.
http://www.aiai.ed.ac.uk/links/cbr.html

5.

http://www.ai
-
cbr.org