CONTEXTUAL DESKTOP SEARCH ARCHITECTURE

startwaitingInternet and Web Development

Nov 18, 2013 (3 years and 11 months ago)

62 views



CONTEXTUAL DESKTOP

SEARCH ARCHITECTURE


PROJECT REPORT



Submitted in partial fulfillment of the requirements

For the award of B.Tech Degree in Computer Science & Engineering

of the University of Kerala




By

Noel Mathew

Prathik George


Eight semester, Com
puter Science and

Engineering







DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

COLLEGE OF ENGINEERING

Thiruvananthapuram

2010







DEPARTMENT OF COMPUER SCIENCE AND
ENGINEERING



DEPARTMENT OF COMPUER SCIENCE & ENGINEERING

COLLEGE OF ENGINEERING

TRIVAND
RUM




CERTIFICATE


This is to certify that the project report entitled

Contextual
Desktop Search Architecture


is a bona fide record of the project done by
NOEL MATHEW
and

PRATHIK GEORGE

under our guidance and
supervision, in partial fulfillment of the
requirements for the award
of B.Tech Degree in Computer Science & Engineering of the
University of Kerala
, during the year
2010.



Project Guide







Mr. GOPAKUMAR G.

Lecturer

Dept. of Computer Science &
Engineering

College of Engineering

Trivandrum


Proj
ect Coordinator






Mrs. SHREELEKSHMI R.

Asst. Professor

Dept. of Computer Science &
Engineering

College of Engineering

Trivandrum



Head of the Department






Dr. RAJASREE M.S.

Professor and Head

Dept. of Computer Science &
Engineering

College of Engine
ering

Trivandrum



Submitted for the Project Viva Voce examination held on ………………




Internal Examiner







External Examiner




AB
STRACT


Unlike web search, a desktop search lacks the local hyperlinks which form the structural
search algorithm like PageRa
nk. A desktop search is usually limited to text
-
based methods,
hence placing the onus on to the user to provide descriptive search queries in order to obtain
the required data. This often reduces the quality of results.

Search engine crawl the file system

and index the data without knowing the properties
encoded by the data or into the views provided by the application. As a result, the search
engines returns documents that are not relevant to the search or miss it even though they are
relevant. This call
s for an architecture which extends the search feature to analyze the data in
a contextual manner.

We can design the above search architecture by categorizing into
2

basic algorithms:



Contextual Search Algorithm(which includes Indexing)



Resource Monitor or

a Relation Window

Temporal context provides desktop search an alternate method with which we have to
understand the data as perceived by the user.
T
he need arises for the new technology of
search engines which sees the data with the eyes
of the user
.

Apar
t from the indexer, we need a component known as the Relation Window (RW) which
monitors the file events which occurs in the last n
-
seconds. It keeps track of the read and
writes events of the files

thus helping to build a relationship graph from files
.

Th
e
Contextual Desktop Search Architecture

is a

tool that departs from the traditional
search desktop paradigm to incorporate these contextual relationships
and increase the
relevance of the search results.















ACKNOWLEDGEMENT




The successful compl
etion of our project requires more than a word of thanks
to all those who were part of the efforts in undertaking this project and completing it. First of
all we thank the Almighty God for hearing our prayers and benevolence shown to us. We
express our dee
pest gratitude to our project guide
Mr. Gopakumar G

for his inspiring and
untiring help throughout the course of the project work.






We express our heartfelt and profound gratitude to
Dr. Rajasree M.S
, the
HoD of the Department of Computer
Science and Engineering for providing all necessary
equipments and facilities required for the completion of this project. We also express our
sincere gratitude to our Project in
-
charge,
Mrs. Sreeleks
h
mi R
, for her help and co
-
operation. Our special thank
s to all the staffs of computer department for their moral support
and guidance throughout.



We would also like to take this opportunity to thank our parents and friends
who were there to guide us and to help us to keep our spirits up during the

entire
development of this project.








Noel Mathew

Prathik George























TABLE OF CONTENTS


TITLE










PAGE NO:

LIST OF FIGURES









1

LIST OF SYMBOLS & ABBREVIATIONS





2

CHAPTERS

1. INTRODUCTION









3


OBJECTIVES









3


SCOPE OF THE PROJECT






4


2. LITERATURE SURVEY








5

3. SYSTEM STUDY OF THE PROPOSED PROJECT




10

4. SYSTEM ANALYSIS








11

5. SYSTEM DESIGN









14


5.1 ARCHITECTURAL FRAMEWORK





15


5.1.1 DATA FLOW DIAGRAM






15


5.1.2 USE CASE DIAGRAM







16


5.2 COMPONENT DESIGN







17


5.3 DATABASE DESIGN







19


5.4 INTERFACE DESIGN







21

6. IMPLEMENTATION METHODOLOGY






26


6.1 PROBLEM STATEMENT






26


6.2 PROBLEM DESCRIPTION






26


6.3
FEATURES OF THE PROJECT






27

7. PSEUDO CODE









28

8. SOFTWARE DESCRIPTION







31

9. TESTING AND VALIDATION







32

10. LIMITATIONS OF THE PROJECT and SCOPE FOR


FURTHER WORK








33

11. CONCLUSION









34


APPENDIX A: WORKING ENVIRON
MENT





35


a.) SOFTWARE SPECIFICATION






35


b.) HARDWARE SPECIFICATION






35

APPENDIX B: SCREEN SHOTS







36

REFERENCES









43



1


LIST OF FIGURES

LIST










PAGE
NO:

Figure 2.1:
Search query and result processin
g





8

Figure 5.1: DFD of CodeSear








1
5

Figure
5.2: Use Case Diagram







1
6

Figure
5.3: Content
-
Based Indexing







1
7

Figure 5.4: Content
-
Based Searching







1
8

Figure 5.5: Content
-
Side Architecture






1
9

Figure 5.6: Design of Forward Index






20


Figure 5.7: Design Of Inverted Index






20

Figure 5.8: Sample Menu







2
1


Figure 5.9: Sample List Box








2
2

Figure 5.10: Translation Acceleration Processing





2
3

Figure
B
.1:
Startup form








36

Figure B.2: Easy to use menus







3
6

Figure
B.3: Indexing process
-
1







3
7

Figure
B.4: Indexing process
-
2







3
7

Figure B.5: Relationship Window
-
1







3
8

Figure B.6: Relationship Window
-
2







3
8

Figure B.7: Index Monitor
-
1






3
9


Figu
re B.8: Index Monitor
-
2








3
9

Figure B.9: Index Monitor
-
3







40


Figure B.10: Index Monitor
-
4







40

Figure B.11: Content
-
based and Contextual Result





4
1

Figure B.12: Opening files directly from application




4
1


Figure B.13: A sample relationship graph





4
2






23

Figure B.14: About us









4
2


2



LIST OF SYMBOLS & ABBREVIATIONS



LIST OF ABBREVIATIONS


CoDeSeAr



-


Contextual Desktop Search Architecture

RM




-


Re
source Monitor

RW




-


Relationship Window

RG




-


Relationship Graph

GUI




-


Graphical User Interface

DFD




-


Data Flow Diagram

SQL




-


Structured Query Language

3



CHAPTER 1

INTRODUCTION




The information boom which has b
een witnessed in the recent years has
brought its own set of problems with it. One of the most critical issues associated with this is
getting the exact piece of information from the ocean of knowledge available.
Initially, there
was only a limited content

to be searched. All we needed for information retrieval was a
straightforward

method which would just scan through the available contents and get the
relevant result.







The old searching techniques were basically simple. There was a database
which con
tained the index of the information available. A search would mean just
performing a linear search through the database and then returning the result. Since the
amount of searching to be done was less, the results were more or less accurate.


Things have c
hanged now.
Even though the linear search methods have been
replaced by different types of search which performs faster searching, t
he large amounts of
data have rendered the normal search
results

to be unreliable.

In many cases, the result set
returned wo
uld be containing a large number of false positives and true negatives.



OBJECTIVE
:


In addition to the cost of searching the database, there is also the risk of not
getting the relevant result or in other words, getting the wrong results. And this irrel
evancy
only keeps on increasing as the volume of data handled goes up. Such irrelevant results end
up in their own rights as huge losses for the company. Usually, this element is regarded as
the more serious shortcoming of improper search techniques. Clear
ly, there is a need for a
better mechanism. This mecha
nism should view the data in a manner different from the
conventional search mechanism. More stress should be given on the way the user perceives
the data and how the data are related to each other.

The

main objective is to give out
relevant

and
accurate

results even if it is at expense of time.



4



SCOPE OF THE PROJECT




The CoDeSeAr is a concept that has got wide applications in this world which
is witnessing huge amounts of kno
wledge transfer.
It is also

very relevant in th
e

current
world where man wants the machine to give results he like and not force himself to like the
results the machine gives him.



The contextual architecture focuses more on fine tuning the system so that

the
user usage patterns are better understood and recorded. This usage patterns are then used to
generate the contextual result.

Some of the typical areas where the contextual search can be
very useful are:



Research
: Very few files are named according to

the content it contains. A lot of supporting
files are usually given reference numbers. So, when a main subject is searched for, the
supporting files get automatically listed. This prevents the user from having to remember
the exact file name and/or manua
lly check each file to get what they want.



Corporate world
:

Again, a lot of data files are given reference numbers, something that is
usually unrelated to the content it contains. The contextual search will pick out the relevant
files.



Individual users
:

There are high chances that individual users don’t remember the exact
name of the file he
/she

wants, but remember the other related files. In this case, a search of
the related file will contextually give out the
file that he/she wants.

The main attracti
on of this project is that the results returned are more accurate. There is a
tradeoff between the speed and the relevancy, but as studies have shown, relevancy is given
preference even if it means waiting for a couple of seconds more.









5



CHAPTER 2

LITERATURE SURVEY

Desktop search

is the name for the field of search tools which search the
contents of a user's own computer files, rather than searching the Internet. These tools are
designed to find information on the user's PC, including web browser hi
stories, e
-
mail
archives, text documents, sound files, images and video.

Desktop search is emerging as a concern for large firms for two main reasons:
untapped productivity and security. A commonly cited statistic states that 80% of a
company's data is loc
ked up inside unstructured data


the information stored on an end
user's PC, the files and directories they've created on a network, documents stored in
repositories such as corporate intranets and a multitude of other locations. Moreover, many
companies
have structured or unstructured information stored in older file formats to which
they don't have ready access.



There are a variety of commonly used Desktop Search applications available.
Some of the prominent ones are Windows Search, Copernic Desktop Se
arch, Google
Desktop, Beagle etc.
In the
general case
, the desktop search has got two main parts:


1) Indexing

The purpose of storing an index is to optimize speed and performance in
finding relevant documents for a search query. Without an index, the sear
ch engine would
scan every document in the corpus, which would require considerable time and computing
power. For example, while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents coul
d take hours.
The additional computer storage required to store the index, as well as the considerable
increase in the time required for an update to take place, are traded off for the time saved
during information retrieval.

Index Design Factors

Major fac
tors in designing a search engine's architecture include:

6


Merge factors


How data enters the index, or how new files and directories are added to the index
during traversal, and whether multiple indexers can work asynchronously. The
indexer must first chec
k whether it is updating old content or adding new content.
Traversal typically correlates to the data collection policy.

Storage techniques


How to store the index
data
, that is, whether inform
ation should be data compressed
or filtered.

Index size


How
much computer storage is required to support the index.

Lookup speed


How quickly a word can be found in the inverted index. The speed of finding an entry
in a data structure, compared with how quickly it can be updated or removed, is a
central focus of co
mputer science.

Maintenance


How the index is maintained over
time.

Fault tolerance


How important it is for the service to be reliable. Issues include dealing with index
corruption, determining whether bad data can be treated in isolation, dealing with b
ad
hardware,
partitioning,

as well as
replication.

Challenges in Parallelism

A major challenge in the design of search engines is the management of
parallel computing processes. There are many opportunities for
race conditions

and coherent
faults. For exam
ple, a new document is added to the corpus and the index must be updated,
but the index simultaneously needs to continue responding to search queries. This is a
collision between two competing tasks. Consider that authors are producers of information,
and
a
RM

is the consumer of
this information, grabbing the change.

The forward index is the
consumer of the information produced by the
file system
, and the inverted index is the
consumer of information produced by the forward index. This is commonly referred
to as a
producer
-
consumer model
. The indexer is the producer of searchable information and users
are the consumers that need to search. The challenge is magnified when working with
distributed storage and distributed processing. In an effort to scale with
larger amounts of
7


indexed information, the search engine's architecture may involve
distributed computing,

where the search engine consists of several machines operating in unison. This increases the
possibilities for incoherency and makes it more difficul
t to maintain a fully
-
synchronized,
distributed, parallel architecture.


Two types of Indexes are commonly used:

-
The Forward Index

The forward index stores a list of words for each document. The following is a
simplified form of the forward index:

Forward

Index

Document

Words

Document 1

the,cow,says,moo

Document 2

the,cat,and,the,hat

Document 3

the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as
file systems are
being traversed
, it is better to immediately store

the filenames per directory
. The forward
index is essentially a list of pairs consisting of a
filename and its path
. Converting the
forward index to an inverted index is only a matter of sorting the pairs by the words. In this
regard, the inverted index i
s a word
-
sorted forward index.

-
Inverted indices

Many search engines incorporate an
inverted

index

when evaluating a
search
query

to quickly locate documents containing the words in a query and then rank these
documents by relevance. Because the inverted i
ndex stores a list of the documents containing
each word, the search engine can use direct
access

to find the documents associated with
each word in the query in order to retrieve the matching documents quickly. The following is
a simplified illustration o
f an inverted index:

Inverted Index

Word

Documents

8


the

Document 1, Document 3, Document 4, Document 5

cow

Document 2, Document 3, Document 4

says

Document 5

moo

Document 7

This index can only determine whether a word exists within a particular
docume
nt, since it stores no information regarding the frequency and position of the word; it
is therefore considered to be a
Boolean

index. Such an index determines which documents
match a query but does not rank matched documents
. Therefore, additional calcula
tions have
to be performed to assign relevance.

2
)
Search Query and Result Processing

Indexing the desktop files is only half the battle won. To give the most
suitable results, it is not just enough to simply take the search query input as such and
perform

a search. Also the result set obtained cannot be simply displayed. Even though we
can never guarantee that we give the exact result that the user wants at the top of the list, we
can increase the probability to quite some extent.


The search query and res
ult processing is shown below




2.1 Search query and result processing

9




Talking of the search query part, the search input string is usually made to
undergo a lot of transformation processes. Some of the common transformations include
spell checks, stem
ming etc. which reduces the onus on the user to strictly give the exact
query value.


Another optimization done is after the result set is obtained from the indices. Each
result that is obtained is given a certain weightage. According to weightages of the

results,
they are arranged in the descending order with the result entry with the highest weightage at
the top. This increases the chances of the user getting the required result in the first attempt
itself. The different steps that are performed in resul
t processing are as shown in the figure:



Filtering: Removal of duplicate entries



Merging: Grouping of search results from different searches.



Processing: Prioritizing the results in the result set based on a suitable ranking system.



10



CHAPTER 3

SYSTEM ST
UDY OF THE PROPOSED PROJECT


EXISTING SYSTEM



Searching for specific information from huge pools of information has always
been done. At each stage of improvement, the focus has been mainly on improving the
relevancy of the result as

well as the search speed. Use of indexes has greatly assisted in
increasing the retrieval speed. Relevancy of the results has undergone an increase by
performing search query and result optimizing. But since the information database keeps on
increasing, t
he relevancy based on these techniques has taken a hit. The content based result
still continues to be the most popular type of search.


PROPOSED SYSTEM


The proposed system
allows setting up of a
contextual based Desktop search
eng
ine which aims at increasing the relevancy of the result set acquired by bringing in the
Context
factor. In this system, we focus not on the individual existence of the files, but on
the relationship that exists between the files. To achieve this, we use t
he concept of
temporal
locality

where we bring in the assumption that files accessed in nearby time periods have
some relationship between them.



Features of Proposed System



Provides Content
-
based desktop search result.



Contextual based results provide
relevant results and reduce

false
positives
.



Resource Monitor watches over all file manipulations and keeps index up
-
to
-
date.



User has the option to start or stop resource monitor at will.



The weight of edges give an indication of the strength of the relat
ionship between
different nodes

11



CHAPTER

4

SYSTEM ANALYSIS


Project Requirements


Input



Search query string



Trigger to start indexing



Trigger to start
resource monitor



Output



Content
-
based search result



Context
-
based search result


Modules

1.
Ind
exing Module

2.
Content
-
based Search Module

3.
Resource Monitor Module

4.
Relationship Graph Module

5.
Contextual Search Module


1.
Indexing Module



This module contains the functionalities of creating the indexes. This is a
onetime process and is done us
ually during the first run of the project. The problem with the
indexing operation is that a huge amount of time is spent on indexing since it involves a large
number of files. But the time taken for this onetime indexing process is traded off with the
tim
e saved while information retrieval which occurs more frequently.

The whole computer file system is indexed in two different types of indexes


the forward index and the inverted index. Also, for each file name and directory path, we
give a unique file id.

This id can easily identify the file. The id is useful in creating the
inverted index as well as in the creation of the relationship graph. Additional provision is
given to distinguish the files based on their types i.e. music files, video files, document

files
etc.

12




2.
Content
-
based Searching Module

To give the most suitable results, it is not just enough to simply take the
search query input as such and perform a search. Also the result set obtained cannot be
simply displayed. Even though we can never g
uarantee that we give the exact result that the
user wants at the top of the list, we can increase the probability to quite some extent.

The search query processing includes steps like creating different types of
queries from the given search input string.

And these searches are executed in different order
unless a sufficiently large dataset is obtained as result.

The search result query processing includes prioritizing the individual results
that are obtained from the search done earlier. Each result obtai
ned is assigned a particular
weight and the result set is sorted in the descending order.


3.
Resource Monitor Module


The
RM

is the real brain of the project. This module actually provides the
context concept to the program. Basically,
it
’s a daemon process which runs in the
background. It fulfills two purposes:



User File Usage
: It keeps track of the files that are being accessed by the user and
whenever finds two or more files having a potential relationship, it updates the edge
between t
hose nodes in the RG.



Assist Indexer
:


Whenever there is a file modification such as addition, renaming or deletion of
a file, it has to notify the indexer to reflect the changes in the index. Or else, a lot of
inconsistency will creep in and cause problem
s.


4
.
Relationship Graph

Module


The RG module is responsible for the graph creation which maintains the
contextual relationship between different files. It communicates with the RM module’s
Relationship Window (RW) and efficiently updat
es the various edges and relations between
the nodes in the graph. This RG module is used for contextual searching.



The key features of the RG are:



A graph which has got the files as nodes.

13




The graph keeps on growing as the user keeps on using the syste
m



Temporal locality is used to create relationship between the nodes



The more the weight of the edge, the more the strength of the relationship.


5.
Contextual Search Module

The key factors involved with the contextual search are:



Refines the content base
d search.



Input is the content based result.



This content based result is passed into the RG.



The result

thus

is based on the relationship graph
, which was

created by the user
usage pattern



BFS traversal method is used to traverse



14



CHAPTER 5

SYSTEM DESI
GN



System design is the process of planning of new systems or one to replace
an existing system. During this stage the analyst works with the user to develop a physical
model of the system flow chart. The modeling process and its outcom
e depend upon the
system to a certain extent and whether or not object oriented design is followed. The detailed
step followed in arriving at the model is known as the methodology. There are
several
methodologies available.

But currently the most popular m
ethodology is known as
'the
unified process’.



Input design

is the process of creating user defined in
put in computer
defined format.

User originated inputs are converted to a computer based format. It includes
determining th
e record media, methods of input, speed of capture and entry into the system.
The goal of designing input data is to make data entry easy.

Thus the objective of the
designer is to achieve highest possible level of accuracy and ensure that the input is
acce
ptable and understood by user and the staff. A formatted form of the data entry is also
provided which requested the user to enter the data in appropriate location.




A quality output is the one, which needs the user requirem
ents and
presents information carefully. Output design is an important step in the system design.
Computer output is the most direct and important information source to the user. Efficient,
intelligible output design should improve the system relationships

with user and helps in
decision making.

The primary consideration in design of output is the information
requirement and objectives of the end users. The major formation of the output is to convey
the information and so its layout and design need a caref
ul consideration
.







15




5.1 ARC
H
ITECTURAL FRAMEWORK


5.1.1 DATA FLOW DIAGRAM




Data flow diagrams are graphic representation of the flow of data through
a system. It consists of data flows
, processes, sources, destinations and stores. DFD is a
control tool and all tools are based upon it. Logical DFDs show the transformation of data
from input to output through process. The system model is called dataflow diagrams.





5.1
DFD of CoDeSeAr





16



5.1.2 USE CASE DIAGRAM


Use Case diagrams identify the functionality provided by the system (use
cases), the users who interact with the system (actors), and the association between the users
and the functional
ity. Use Cases are u
sed in the a
nalysis phase of software development to
articulate the high
-
level requirements of the system.
Use Cases extend beyond pictorial
diagrams. In fact, text
-
based use case descriptions are often used to supplement diagrams,
and explore use case fun
ctionality in more detail.


The primary goals of Use Case diagrams include:



Providing a high
-
level view of what the system does



Identifying the users ("actors") of the system



Determining areas needing human
-
computer interfaces


Use Case Diagram Compone
nts

Use Case
-

Use cases are drawn using ovals. These ovals are labeled with verbs that
represent the system's functions.

System
-

System boundaries are drawn using a rectangle that contains use cases.

Actors
-

Actors are the users of a system.

Relationsh
ips

-
Illustrate relationships between an actor and a use case with a simple line.



5.2
Use

Case

Diagram


5.
3

COMPONENT

DESIGN

17





The CoDeSeAr is an addition to the traditional content
-
based search engine.
The whole architecture can be divided into two c
omponents.


1)
C
ontent Based Architecture


As mentioned earlier, this is the part where the similarity between CoDeSeAr and the
traditional database exists. The Content
-
based architecture can be divided into two
components:


1)

Indexing: The indexer handles t
he indexing of the files in the desktop
. As can be seen, it
communicates with the resource monitor as well for dynamic updation.




5.3 Content Based Indexing




18



2)

Searc
hing: This part of the content
-
based architecture

handles the search query and
result p
rocessing part. The output of this querying is passed into the contextual search
engine.





5.4 Content
-
based Searching


3)

Context
-
based Architecture



The similarity between CoDeSeAr and the traditional search engines end here. This is
where the context
-
b
ased architecture brings in the result set based on user
-

preference
patterns. The important modules in this
section are:


a)

Resource Monitor

b)

Relationship Graph

c)

Contextual based Search

The resultset after the passing through the last module is expected to be

user specific and
likely to be more relevant because it is based on the user usage patterns.



19








5.5 Contextual
-
side Architecture



5.
2

DATABASE DESIGN



Database design is recognized as a standard of management information system
and is available virtually for every computer system. The general theme behind a database is
to integrate all the information. A database is an integrated collection of data and provides
centralized access to the data.


Da
tabases consist of software
-
based

"containers" that are structured to collect
and store information so users can retrieve, add, update or remove such information in an
automatic fashion. Database programs are designed for users so that they can add or delete
any information needed. The st
ructure of a database is tabular, consisting of rows and
columns of information.


20





In CoDeSeAr, two tables are mainly used. One is
csindex
,
the other is
csiindex
. csindex
is used to maintain the forward index, the other indexer, csiindexr
maintains the i
nverted index.



5.6
Design of

forward

index

storing details of files in the system



21



5.7
Design of inverted index storing details of keywords in the system



The constraints associated with the table csindex are:



The id(file id) is unique.



The combinat
ion of filename and path is unique.


The constraints associated with the table csiindex are:



The keyword entries are unique.



Common English words and digits have to be avoided while creating the inverted index.



5.4 INTERFACE DESIGN



The interfac
e design is an important part of any software.

The goal of user
interface is to make the user's interaction as simple and efficient as possible, in terms of
accomplishing user goals

what is often called user
-
centered design.

Our
system provides a

user frie
ndly interface. The interface has been done using
Visual Studio Resource Editor
.
Resource

components used in our system are

Dialog, Textfield, Button, ListBox
,

Label,

MenuBar, Accelerator Keys and Icons.






22


Dialogs



Standalone applications typically hav
e a main window that both displays the main data over
which the application operates and exposes the functionality to process that data through user
interface (UI) mechanisms like menu bars, tool bars, and status bars. A non
-
trivial
application may also di
splay additional windows to do the following:


•Display specific information to users.


•Gather information from users.


•Both display and gather information.


These types of windows are known as dialog boxes, and there are two types: modal and
modeless.


A modal dialog box is displayed by a function when the function needs additional data from
a user to continue. Because the function depends on the modal dialog box to gather data, the
modal dialog box also prevents a user from activating other windows in t
he application while
it remains open. In most cases, a modal dialog box allows a user to signal when they have
finished with the modal dialog box by pressing either an OK or Cancel button. Pressing the
OK button indicates that a user has entered data and w
ants the function to continue
processing with that data. Pressing the Cancel button indicates that a user wants to stop the
function from executing altogether. The most common examples of modal dialog boxes are
shown to open, save, and print data.


A model
ess dialog box, on the other hand, does not prevent a user from activating other
windows while it is open. For example, if a user wants to find occurrences of a particular
word in a document, a main window will often open a dialog box to ask a user what wo
rd
they are looking for. Since finding a word doesn't prevent a user from editing the document,
however, the dialog box doesn't need to be modal. A modeless dialog box at least provides a
Close button to close the dialog box, and may provide additional but
tons to execute specific
functions, such as a Find Next button to find the next word that matches the find criteria of a
word search.

23


The advantage of Dialog over Frame is that Dialog can be made modal (waits for user
interaction) and it will always remain

on top of its parent.


Textfields

-

It is a text component that allows for the editing of a single line of text.
Text
Fields are used to get information’s from users.
Textfie
lds are used where ever data entry is
needed as in plac
es where user id, password
, job name
, source and destination

database
details to be entered
.


Button

-

A button is a component the user clicks to trigger a specific action.

In this software
navigation is made pos
sible with the help of Button ie for searching, opening files, ok etc
.
Key functio
nalities like selecting options
, p
erforming operations like fetch,
insert, update,
de
lete, test and also which dialog

is to be disp
layed after the current dialog

etc
are also done
with the help of
Button.


Label

-

A display area for a

short te
xt string or an image
or both. A label d
oes not react to
input events.
Labels are used for labeling to mak
e the software understandable ie
, used along
with text fields to make users understand what exactly need to be en
tered
.


MenuBar



A Menu is a control

that allows hierarchical organization of elements associated with
commands or event handlers. Each Menu can contain multiple MenuItem controls. Each
MenuItem can invoke a command or invoke a Click event handler. A MenuItem can also
have multiple MenuItem
elements as children, forming a submenu.


The following illustration shows the three different states of a menu control. The default state
is when no device such as a mouse pointer is resting on the Menu. The focus state occurs
when the mouse pointer is ho
vering over the Menu and pressed state occurs when a mouse
button is clicked over the Menu.


Menus in different states


24



5.8 Sample Menu


List

Box


A Windows Forms ListBox control displays a list from which the user can select one or
more items. If the to
tal number of items exceeds the number that can be displayed, a scroll
bar is automatically added to the ListBox control. When the MultiColumn property is set to
true, the list box displays items in multiple columns and a horizontal scroll bar appears.
Whe
n the MultiColumn property is set to false, the list box displays items in a single column
and a vertical scroll bar appears. When ScrollAlwaysVisible is set to true, the scroll bar
appears regardless of the number of items. The SelectionMode property dete
rmines how
many list items can be selected at a time
.The list box in our package is used to display the
search results of various filenames present in the database.



5.9 Sample Listbox


Accelerator Keys

Accelerators are closely related to menus


both pr
ovide the user with access to an
application's command set. Typically, users rely on an application's menus to learn the
command set and then switch over to using accelerators as they become more proficient with
the application. Accelerators provide faster
, more direct access to commands than menus do.
At a minimum, an application should provide accelerators for the more commonly used
commands. Although accelerators typically generate commands that exist as menu items,
they can also generate commands that h
ave no equivalent menu items.

25



Accelerator Tables

An accelerator table consists of an array of ACCEL structures, each defining an individual
accelerator. Each ACCEL structure includes the following information:


•The accelerator's keystroke combination.

•The accelerator's identifier.

•Various flags. This includes one that specifies whether the system is to provide visual
feedback by highlighting the corresponding menu item, if any, when the accelerator is used



To process accelerator keystrokes for a
specified thread, the developer must call the
TranslateAccelerator function in the message loop associated with the thread's message
queue. The TranslateAccelerator function monitors keyboard input to the message queue,
checking for key combinations that m
atch an entry in the accelerator table. When
TranslateAccelerator finds a match, it translates the keyboard input (that is, the WM_KEYUP
and WM_KEYDOWN messages) into a WM_COMMAND or WM_SYSCOMMAND
message and then sends the message to the window procedure
of the specified window. The
following illustration shows how accelerators are processed.



5.10 Translator Accelerator
Processing


The WM_COMMAND message includes the identifier of the accelerator that caused
TranslateAccelerator to generate the message
. The window procedure examines the identifier
to determine the source of the message and then processes the message accordingly.


26


Accelerator tables exist at two different levels. The system maintains a single, system
-
wide
accelerator table that applies
to all applications. An application cannot modify the system
accelerator table. For a description of the accelerators provided by the system accelerator
table, see Accelerator Keystroke Assignments.


The system also maintains accelerator tables for each a
pplication. An application can define
any number of accelerator tables for use with its own windows. A unique 32
-
bit handle
(HACCEL) identifies each table. However, only one accelerator table can be active at a time
for a specified thread. The handle to th
e accelerator table passed to the TranslateAccelerator
function determines which accelerator table is active for a thread. The active accelerator table
can be changed at any time by passing a different accelerator
-
table handle to
TranslateAccelerator.


Ico
ns

The system uses icons throughout the user interface to represent objects such as files, folders,
shortcuts, applications, and documents. The icon functions enable applications to create,
load, display, arrange, animate, and destroy icons.


C
HAPTER 6

IMP
LEMENTATION METHODOLOGY




Implementation includes placing the system into operation and providing the
users and operation personnel with the necessary documentation to use and maintain the new
system. Implementation inclu
des all those activities that take place to convert from the old
system to the new. The new system may be totally new, replacing an existing system. Proper
implementation is essential to provide a reliable system to meet the organizational
requirements. Su
ccessful implementation may not guarantee improvement in the organization
using the new system, as well as, improper installation will prevent any improvement. There
are four methods for handling a system conversion.


Parallel approach
:
-
The old system is
operated with the new system.

Direct cut over method
:
-
The old system is re
placed with the new one.

27


Pilot approach
:
-
Working version of the system implemented in one part of the organization
based on the feedback, changes are made and the system is instal
led in the rest of the
organization by one of the other methods.

Phase
-
in method
:
-

Gradually implements the system across all users.

Since proposed system is a project module administrator’s system, the phase in method is
suitable.


6.1 PROBLEM STATEMENT



The need for finding information based on the user’s preference becomes
important. A resource monitor watches over the user usage patterns and develops the
relationship graph. It also keeps the index up
-
to
-
date. The search should ret
urn a content
-
based as well as a context
-
based result.


6.2 PROBLEM DESCRIPTION



The project aims at bringing the contextual factor into desktop search along
with the conventional content based search.

The system in its first run creates an index of the
c
omplete file system. Subsequently, whenever there is some file to be searched, the query is
performed on this database. This gives out the content
-
based result set.




The resource monitor keeps a track of all the activities happen
ing in the
computer. Whenever there is a file operation like addition, deletion or renaming of a file, the
index is updated accordingly. For this it communicates with the indexer of the content
-
based
architecture. On the other hand, the resource monitor is

also responsible for creating the
relations between the files as and when they are used. This is a reflection of the user usage
pattern. Once the content
-
side returns the content
-
based results, this result
-
set is passed into
the relationship graph from wh
ere the contextually related files are found out. Thus, the
context
-
based result is displayed to the user.



6.3 FEATURES OF THE PROJECT




Provides Content
-
based desktop search result.

28




Contextual based results provide relevant results and reduce

false negat
ives.



Resource Monitor watches over all file manipulations and keeps index up
-
to
-
date.



User has the option to start or stop resource monitor at will.



The
weight of edges gives

an indication of the strength of the relationship between
different nodes
.



The s
ystem

employ
s GUI and is

user
-
friendly.



Optimization of search queries and search results done wherever possible.



The system can be customized according to the user preference.














CHAPTER 7

PSEUDO CODE



INDEXING MODULE


Drivelister.cpp



drivelist
er
() :
This function lists all the logical drives attached to a computer. The
drive list is then passed on to the indexer function which recursively traverses the
directories to get the file names.


Lister.cpp

29




main()

:
Fetches the drive list and passes
it one by one into the indexer function.



indexer
() :
The indexer gets the directory name as its argument and starts scanning
all the files in that directory and lists them into the database. Additionally, it
recursively calls the indexer function and
keeps

on adding them.



p
athsetter()

: A

small but very useful function
.

It is used to add double slashes into
path names, so that string operations are not affected.



CONTENT BASED SEARCHING

MODULE


CntntSearch.cpp



noofwords
()
:
Function
to return the number of

words in the search query
.

Very
useful in performing the search query optimization.



main
() :
Performs the search query optimization. Communicates with the database
to get the result set. After that prioritizes the result set and passes the processed
resul
ts to the contextual search engine.







RESOURCE MONITOR

MODULE

W
atcher.cpp



*
pathsetter():Changes the address path from single slash to double slash. For eg.
'D:
\
Tst' to 'D:
\
\
Test'. Used in string operations.



databasehandler():


Is responsible for keepin
g the indexes uptodate. Whenever a
file manipulation occurs, the change is reflected in the indexes.



directorywatch():

Initiating the directory watch



directorychanges():

Monitors the change



Filechanges():

returns information about file changes



filerelation
creator():

Creates Relationships among files being created or modified



timer(): for implementing timer

for defining the relationship window

30




MyEnumProc(): Used to track open windows



getfileid():


gets the unique file id for the filename stored in index.txt



PidFinder():

Tracks the process and returns pids to handle


RELATIONSHIP GRAPH MODULE


H
eaders.h



save_relationgraph(): Saves the realtion graph into data.txt



read_relationgrah(): Reads the relation graph from data.txt into vertex[]



addedge()
:

Adds edges to

the relation graph



creategraph()
:

Creates the Relation Graph



checkedge()
:
Check whether an edge exists between node x and node y



updateweight()
:
Updates the weight based on frequency of visit



checkedgecondition()
:

Conditions for edge formation



createadjac
encymatrix();

C
reates adjacency matrix;



initialise()
: L
ays out the nodes



getfilename()
:

G
ets the filename for the given id




CONTEXTUAL SEARCH

MODULE


contentresultoperations.h



read_contentresult(): Get the prioritized result set from the content
-
based se
arch
architecture.



bfs(): Perform the bfs search in the graph.



Displayresult(): Display the contextual search results output


GUI MODULE


31



The CoDeSeAr, even though a powerful tool, is incomplete with a GUI which
facilitates user interaction. The CoDeSeAr
GUI is a simple
-
to
-
use interface with users no
additional instructions on how to use. With
the required text boxes, menus, list boxes and
menus, the GUI has been designed keeping simplicity and clarity in mind.

32





CHAPTER 8

SOFTWARE DESCRIPTION


This software is
implemented using
Visual C++
.

The

software
provides a user
interactive
interface (
GUI).
The user interface is complete with the required text boxes, list
boxes, menus and message boxes.




When the software is run, a small window appears. There is the text box to enter the
search input query. The search string is inputted and then the user clicks on content search.
Once the content
-
based search results are ready, the user clicks on conte
xt
-
based results. A
new window appears which shows the content
-
based as well as the context
-
based search
results side
-
by
-
side.



Ample menus are provided on top of the software window to start important
operations like resource monitor, index monitor and t
he indexing operation. For the users
interested in knowing the relationship graph formed, we’ve provided an option to show the
graph formed along with the weights between the nodes.




The relevant message boxes keep on informing the users the current stat
us of the
software. Thus, the software along with the GUI is an easy to use interface for even a not so
computer
-
savvy person to use and thus has catered for all kinds of users.

33



CHAPTER 9

TESTING AND VALIDATION



Software testing is the proce
ss of checking whether the developed system is
working according to the original objectives and requirements. The system should be tested
experimentally with test data so as to ensure that the system works according to the required
specification. When the
system is found working, test it with actual data and check
performance. Software testing is a critical element of software quality assurance and
represents the ultimate review of specification, design and coding.


Need for Testing




Existence of program de
fects of inadequacies



The software behavior as intended by its designer



Conformance with requirement specification/user needs.



Assess the operational reliability of the system.



Reflect the frequency of actual user inputs.



Find the fault, which caused the
output anomaly.



Check for detect flaws and deficiencies in the requirements.



Check whether the software is operationally useful.



Exercise the program using data like the real data processed by the program.



Test the system capabilities.



Check whether or not

the program is usable in practice.



This application
was
tested with various
test cases. A lot of unexpected
errors came up while testing and these were rectified. The testing also provided an insight
into the potential of s
caling the software. Overall, knows bugs have been rectified. But since
the concept is still in the native stage, there is a chance of issues creeping in at later stages,
but none of them which cannot be rectified.

34



CHAPTER 10

LIMITATIONS OF THE PROJECT a
nd SCOPE FOR

FURTHER WORK


Most of the limitations in CoDeSeAr are actually a platform for further
enhancements. Currently, there are limitations like lacking memory efficiency. Another
drawback is the speed with which the results are returned. This was ma
inly because of the
time constraint we had in integrating the different modules developed at different places in
different times.


The CoDeSeAr has got a wide scope for future work. Some of the prominent ones
include:




Individual User Relationship Graphs:
Providing different users the ability to create
their own relationship graph. This will help in a better one
-
to
-
one correspondence.



Scalablity. Migration to bigger computers with huge amounts of data
.



Can be extended to Distributed Searching, searching for

documents located at various
nodes in the distributed network.



Strength of relationship between files to be perfected



Search query pruning. Use of stemmers etc.













35



CHAPTER 11


CONCLUSION





The CoDeSeAr tool has lived up to its expectations. The results returned were
more relevant

and more user
-
specific
. The usage patterns were successfully
imprinted

in the
relationship graphs, which were ultimately
reflected in the relevant results obtai
ned.




Though there were limitations such as speed and memory inefficiency, in the
end, these limitations took a back seat when the relevant results were obtained. Of course,
these limitations will be rectified in the coming editions. Also, since there i
s a huge scope for
the contextual search and wide choice of future works available, it will not be long before the
world will shift from content
-
based search paradigms to context
-
based search paradigms



















36




APPEND
IX
A
:

WORKING ENVIRONMENT


a.)

SOFTWARE SPECIFICATION




Visual C++ 2009



Windows XP, Vista, 7 (Currently tested on these platforms)



MySQL









b.)


HARDWARE SPECIFICATION




Pentium III 733 MHz
o
r above



50 MB free hard disk space



512 MB RAM

or above
















37





APPENDIX
B
: SCREEN SHOTS






B.1
Startup Form












B.2
Easy to use Menus and associated shortcut keys










38













B.3
Indexing Process
-
1





B.4
Indexing Process
-
2

39



B.5
Relationship Window
-
1




B.6
Relationship Window
-
2

40



B.7
Index Monitor
-
1


B.8
Index Monitor
-
2

41



B.9
Index Monitor
-
3




B.10
Index Monitor
-
4


42



B.11
Content
-
based and Contextual Result





B.12
Opening the files obtained as results directly from t
he application

43



B.13
A sample relationship graph (for study purposes only)







B.14
About us




44



REFERENCES




Books



Understanding Search Engines Mathematical Modeling and Text Retrieval by
Michael W.Berry and Murray Browne.



Search Engine Optimization En
gines Bible by Jerry L. Ledford.



Wiki Links



Indexing and Document Parsing
-
http://en.wikipedia.org/wiki/Index_(search_engine)



Anatomy
o
f Search Architecture
-

http://infolab.stanford.edu/~backrub/
google.html




Millions of other pages providing small but important support information
which cannot be named here due to the space contraint



Using Context to Enhance File Search



Craig A.N Soles, Gregory R. Ganger



Modern information Retrieval


R.Baeza Y
ates and B. Rebeiro