Why Desktop Search Engine - Google Code

gasownerΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 4 χρόνια και 9 μήνες)

168 εμφανίσεις



CSCI 311

Search Engine
Report

CSCI 311 Software Process Management



**********To be completed******************

3.3

Desktop Search Engine Technologies
-----
This one can be
excluded.

6.2

Non
-
Functional Requirements

-----
Crimson

7.2

Distribution

-----
Jizong

7.3

Crawler

-----
Des
k
top search engine, no crawler needed
,am
I

right
?

**********To be completed******************




System Report




1

Table of Contents

1

Application Specification

................................
................................
................................

3

1.1

Overview

................................
................................
................................
...........................
3

1.2

Search Engine Type

................................
................................
................................
............
4

1.3

ID
E

................................
................................
................................
................................
....
5

1.4

Open Source Search Engine

................................
................................
................................
6

1.5

Lucene Features

................................
................................
................................
.................
7

2

Why Desk
top Search Engine

................................
................................
...........................

8

3

Introduction to Desktop Search Engines

................................
................................
.........

9

3.1

What is a Desktop Search engine

................................
................................
........................
9

3.2

How Desktop Search Engines works
................................
................................
....................
9

3.3

Desktop Search Engine Technologies

................................
........

Error! Bookmark not defined.

4

List of Search Engines
-

Desktop Search

................................
................................
........

10

4.1

Main Features and Benefits

................................
................................
..............................

12

4.2

Building a Desktop Search Engine

................................
................................
.....................

12

4.3

O
pen
-
source Search Engines in the market

................................
................................
.......

12

4.4

Use Lucene Indexing

................................
................................
................................
........

13

5

Ob
jective

................................
................................
................................
.....................

15

5.1

Setting Goals and Objective

................................
................................
..............................

15

5.2

Sub Objective
................................
................................
................................
...................

15

5.3

Budget Breakdown

................................
................................
................................
..........

15

6

Functional

and Non
-
Functional

Specification

................................
................................

16

6.1 Functional Requirements

................................
................................
................................
........

16

6
.2
Non
-
Functional Requirements

................................
................................
................................

16

7

Technical Specification

................................
................................
................................
.

17

7.1

System Architecture

................................
................................
................................
.........

17

7.2

Distribution

................................
................................
................................
.....................

18

7.3

Crawler

................................
................................
................................
............................

18

8

Project Planning

................................
................................
................................
...........

19

8.1

Software Methodology

................................
................................
................................
....

19

8.1.1

Waterfall

................................
................................
................................
................................

19

8.1.2

Rational Unified Process (RUP)

................................
................................
.............................

20

8.1.3

Proposed Model

................................
................................
................................
....................

21

8.2

Risk and Counter Measure

................................
................................
...............................

22

8.3

Use
-
Case Diagram

................................
................................
................................
............

23

8.4

Sequ
ence Diagram

................................
................................
................................
...........

24

8.5

Network Activity Diagram

................................
................................
................................

25

8.6

Role and Tabular Summary

................................
................................
..............................

26

8.7

Effort Estimation

................................
................................
................................
..............

27

8.8

Ti
meline

................................
................................
................................
..........................

28

9

Test Cases

................................
................................
................................
....................

29


System Report




2

10

Version Control Ideology

................................
................................
..............................

30

10.1

Introduction

................................
................................
................................
....................

30

10.2

Type

................................
................................
................................
................................

30

10.3

Setup

................................
................................
................................
...............................

31

10.3.1

Repository

................................
................................
................................
.........................

31

10.3.2

Client

................................
................................
................................
................................
.

32

10.3.3

Tortoise SVN

Log

................................
................................
................................
...............

33

11

Screenshots

................................
................................
................................
.................

35

12

Appendices

................................
................................
................................
..................

36

12.1

Project Meeting

Minutes

................................
................................
................................
..

36

12.2

References

................................
................................
................................
.......................

41





System Report




3

1

Application Specification


1.1

Overview


A distributed search engine developed ba
sed on Apache
-
Lucene Version 3.
4
.
0
. It is
able search files in all local hard drive within a very short period of time even though the
volumes of files are large. The search engine consists of mainly two parts and both run
independently of each other. The two parts are the SearchIndexer an
d the
SearchEngine.


SearchIndexer
Searches through all drive or folder and indexes files to a local indexing
database. This database will be utilized by the SearchEngine.


SearchingEngine
Base on the indexing database created, the system can access the
contents of files and search for any keyword. The result will be return once the search
has been completed.


Specification

Chosen Option

Search Engine Type:

Desktop

Platform:

Windows

Programming Language

Java

Integrated Development Environment
(IDE)

NetBeans

Open Source Search Engine

Lucene

























System Report




4

1.2

Search Engine Type


Desktop search engines

Desktop search engine is a search tools which search the contents of a user's own computer
files, rather than searching the Internet. These
tools are designed to find information on the
user's PC, including web browser histories, e
-
mail archives, text documents, sound files, images
and video. Desktop search engines can be build and maintain an index database to achieve
reasonable performance w
hen searching several gigabytes of data or loop through file directory
and search files instantly. When indexing the files, desktop search tools collect three types of
information about files:

1.

File and directory names

2.

Metadata, such as titles, author
s, comments in file types such as MP3, PDF and JPEG

3.

Content of supported documents.


Web search engines

Web search engines work by storing information about many web pages, which they retrieve
from the html itself. These pages are retrieved by a Web
crawler an automated Web browser
which follows every link on the site. Crawler
-
based search engines such as Google and Yahoo,
compile their listings automatically. They "crawl" or "spider" the web and people search
through their listings. These listings ar
e what make up the search engine's index or catalogue.

A search engine operates in the following order:

1.

Web crawling

2.

Indexing

3.

Searching.


Hybrid search engines

A hybrid search engine is a type of computer search engine that presents results from h
uman
-
edited web directories along with algorithmically generated results based on web crawling.
More and more search engines these days are moving to a hybrid
-
based model. Examples of
hybrid search engines are:

1.

Yahoo (www.yahoo.com)

2.

Google (
www.googl
e.com
)





System Report




5

1.3

IDE


NetBeans is a platform framework for Java desktop applications, and an integrated
development environment (IDE) for developing with Java, JavaScript and other
common
application

development.

The latest NetBeans IDE version is 7.0 which
were

used by project
development.

The NetBeans IDE is written in Java and can run anywhere a
compatible JVM

is
installed, including Windows, Mac OS, Linux, and Solaris. A JDK is required for Java
development functionality, but is not required for development in

other common programming
languages such as C,

C++ an
d PHP
.

There are other well known Java development tools such as Eclipse and Jcreator. However, we
have chosen Netbeans mainly because of several reasons.

1. Everything we need is available

Many required

features are available out of the box. I can just download the pack I need and
start using it right away.

No coding need for drag and drop controls which save us lots of time.

2. Support for multiple packs

Netbeans is feature rich and I can select the app
ropriate packs for my different needs and
resolve incompatible libraries easily.

3. Past experience

Most of the group membe
rs have experience using Netbeans

for java development in either
console or GUI applications.



System Report




6

1.4

Open Source Search Engine

Below is a l
ist of options which our team has shortlist:

Name

Description

Sphinx

Sphinx is a free software search engine designed with indexing database content in mind.
It currently supports MySQL and PostgreSQL natively. It is distributed under the terms of
the GNU

General Public License v2.


PhpDig

A

Web spider and search engine written in PHP, using a MySQL database and flat files. It
builds a glossary with words found in indexed static and dynamic pages. On a search query,
it displays a result page containing the search keys, ranked by occurrence. I
t includes a
template system and can index PDF, Word, Excel, and PowerPoint documents using
external tools.


Lucene

Apache Lucene is a high
-
performance, full
-
featured text search engine library written
entirely in Java. Full
-
text search & cross
-
platform.
Apache Lucene is an open source project
available for free download.



We have chosen Lucene due to following reason stated below:

Active development community

This would help in picking up the usage of the search engine API. References and help are easil
y
found on the internet.


Support Language

Lucene used Java which our team is more comfortable with. This would help in reducing time to
research.
Based on Java, it make our product platform independent.




System Report




7

1.5

Lucene Features


Feature provided by Lucene

Scalable, High
-
Performance Indexing



over 95GB/hour on modern hardware



small RAM requirements
--

only 1MB heap



incremental indexing as fast as batch indexing



index size roughly 20
-
30% the size of text indexed


Powerful, Accurate and Efficient Search
Algorithms



ranked searching
--

best results returned first



many powerful query types: phrase queries, wildcard queries, proximity queries, range
queries and more



fielded searching (e.g., title, author, contents)



date
-
range searching



sorting by any field



mu
ltiple
-
index searching with merged results



allows simultaneous update and searching


Cross
-
Platform Solution



Available as Open Source software under the Apache License which lets you use Lucene
in both commercial and Open Source programs



100%
-
pure Java



Implementations in other programming languages available that are index
-
compatible


System Report




8

2

Why Desktop Search Engine


Our group decided to develop a desktop search engine rather than a web search engine due to
the fact a desktop is simpler compared with the web.

For example if we have a web search
engine we will need to deal with more issues such protocol, the formatting, HTML parser and
etc. Therefore we feel that we are able to implement and develop a desktop engine within the
given time frame with the experien
ce and knowledge we have.



System Report




9

3

Introduction to Desktop Search Engines

3.1

What is a Desktop Search engine


Desktop search engine is the name for a tool that a user uses to search files within the content
of its own computer. It is simply a software application tha
t sorts through the large amount of
data in a hard disk or multiple using some algorithm and tries to locate what a user is searching
as quickly and accurately.

Desktop search is emerging as a concern for large firms for two main reasons: untapped
producti
vity and security. A commonly cited statistic states that 80% of a company's data is
locked up inside unstructured data the information stored on an end user's PC, the files and
directories they've created on a
network
, documents stored in repositories suc
h as corporate
intranets

and a multitude of other locations. Moreover, many companies have structured or
unstructured information stored in older
file formats

to which they don't have ready access.


3.2

How
Desktop
Search Engines works


How a search engine wor
ks is all about indexing, it collects, parses, and stores
data

to facilitate
fast and accurate information retrieval.

The purpose of storing an index is to optimize speed and performance in finding relevant
documents for a search query. Without an index,
the search engine would
scan

every document
in the hard disk, which requires processing power as well as taking a longer time. For example,
while an index of 10,000 documents can be queried within milliseconds, a sequential scan of
every word in 10,000 lar
ge documents could take hours.


When indexing the files, desktop search tools will store three types of data, mainly:

1) File and directory names

2) Metadata, such as titles, authors, comments in file types such as mp3, PDF and jpeg

3) Content of su
pported documents.

For example Google Desktop search application when its installed it starts indexing the PC's main drive.
The process, which only takes place when the computer is idle for 30 seconds or more, can take
anywhere from several hours to a few
days, depending on the volume of data.

After the drive is scanned, indexing takes place in real time with little effect on the computer's
performance.

To perform a search you simply type in the
keywords

or
phrases

you are looking for and click the
search

b
utton. From that, you will be able to get the results of all those files that are relating to your keyword.

There are different types of searches you can perform with a Desktop Search engine for example
Keyword,
Phrase

and Boolean, The example queries below describe each of these types of searches:


System Report




10

4

List of Search Engines
-

Desktop Search


Name

Platform

Remarks

License

Autonomy

Windows

IDOL Enterprise Desktop Search.

Proprietary,
commercial

Beagle

Linux

Open source desktop search tool for Linux
based on Lucene.

A mix of the
X11/MIT License
and the Apache
License

Copernic
Desktop

Windows

Considered best overall search engine in
2005 UW benchmark study.

Free for home use

Docco

Cross
-
platform
(Java)

Based on Apache's indexing and search
engine Lucene, and it requires a Java
Runtime Environment.

BSD License

Docfetcher

Cross
-
platform

Open source desktop search tool for
Windows and Linux, based on Apache
Lucene.

Eclipse Public
License

dtSearch
Desktop

Windows

-

Proprietary (30 day
trial)

Easyfind

Mac OS

-

Freeware

Everything

Windows

Find files and folders by name instantly on
NTFS volumes.

Freeware

Google
Desktop

Linux, Mac OS,
Windows

Integrates with the main Google search
engine page. 5.9 Release
now supports
x64 systems.

Freeware

GNOME
Storage

Linux

Open Source desktop search tool for
Unix/Linux.

GPL

imgSeek

Linux, Mac OS,
Windows

Desktop content
-
based image search.

GPL v2

InSight
Desktop
Search

Windows

Metadata
-
based search utility.

Freeware

ISYS Search
Software

Windows

ISYS: desktop search software.

Proprietary (14 day
trial)

Likasoft
Archivarius
3000

Windows

-

Proprietary (30 day
trial)



System Report




11

Meta Tracker

Linux, Unix

Open Source desktop search tool for
Unix/Linux.

GPL v2

Recoll

Linux, Unix

Open Source desktop search tool for
Unix/Linux.

GPL

Spotlight

Mac OS

Found in Apple Mac OS X "Tiger" and later
OS X releases.

Proprietary

Strigi

Linux, Unix,
Solaris, Mac OS
X and Windows

Cross
-
platform open source desktop
search engine.

LGPL v2

Terrier

Search
Engine

Linux, Mac OS,
Unix

Desktop search for Windows, Mac OS X
(Tiger), Unix/Linux.

MPL

Tropes Zoom

Windows

Semantic Search Engine.

Freeware and
commercial

Windows
Search

Windows

Part of Windows Vista and later OSs.
Available as Windows Desktop
Search for
Windows XP and Server 2003. Does not
support indexing UNC paths on x64
systems.

Proprietary, freewa






System Report




12

4.1

Main Features and Benefits


Normal search tools are extremely slow, scanning the hard disk for each search. And it will be
very time consumi
ng as hard drives will contain hundreds of gigabytes of data. The time taken
for every search is practically the same, be it a normal text file or an email. For example,
by using the ‘Windows search companion’ search tool, it can only search t
hrough windows
files and folders only, not e
-
mail or contact databases, and unless you enable the Indexing
Service. For a desktop search engine, all you need to do is to index the folders/drives
that you would want to search. The index
ing might take quite a while, but after it is
completed, your search results will be returned in just a few seconds. And for all other
search thereafter, it will take the same amount of time, few seconds.


4.2

Building a
Desktop
Search Engine


We have created a simple Desktop Search Engine with a
Graphic User Interface (GUI)

by using
java swing where the user will be to index the folders then do the searching. We make it very
simple just by indexing a specific folder rather than the whole hard disk, just to demonstrate
how Lucence API as well as the indexing works.


4.3

O
pen
-
sou
rce

Search Engines
in the market


Based on our group research, we have identified
some of the available open source search
engines which best fits our requirements.




System Report




13

A table comparing the indexing performance over this Twitter data set across the select
vertical
search solutions:




Lucene was the only solution that produced an index that was smaller than the input data size.
Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of
adding another ten seconds to indexing
. sphinx and zettair index the fastest.


4.4

Use Lucene
Indexing


Based on these preliminary results and anecdotal information collected from the web and
people in the field Lucene (which is an
IR

library


use a wrapper platform like
Solr

w/
Nutch

with dress
ings like snippets, crawlers, servlets) for many vertical search indexing applications.

The reasons for doing so are as follows:





















en Source





System Report




14











System Report




15

5

Objective


5.1

Setting Goals and Objective


To develop a fully operational desktop search engine and a complete documentation by 25th
November 2011 which meets the requirements specified in
CSCI311 Assignment 1. The
estimated budget will be $20000.


5.2

Sub Objective


In order to achieve our objective, we need to meet the sub
-
objectives stated below:



Allocate the tasks evenly to all members in the team



Consensus of decisions and tools to be used



Thorough test on search engine



Not exceeding the allocated budget


5.3

Budget

Breakdown


Items

Cost

Quantity

Subtotal

Laptop

$2000

5

$10000

Labour

$100X20 days

5

$10000

Total



$20000


…..










System Report




16

6

Functional

and Non
-
Functional

Specification


6.1
Functional Requirements


The system should be able to do the
following functions
:


1.
Crawl
:




The crawler must be able to crawl all available local drives.



Must be able to search into content of txt file



Must be able to index desire suffix files



2. Query
:



Must be able to take in

B
oolean

operator
s



Able to return the correct result


6
.2
Non
-
Functional Requirements




Search query is able to return result within 5 sec in 99.99% of the time



System must be built for a total cost of
$1000




Syste
m must be able to
run on Windows XP, Windows Vista and Windows 7 operating
system.





System Report




17

7

Technical Specification


7.1

System Architecture




System Report




18

7.2

Distribution

<How we going to make it distrubted>


7.3

Crawler

<the scope of the crawler>

<where is it supposed to run>




System Report




19

8

Project Planning


8.1

Software Methodology

8.1.1

Waterfall


Waterfall model is a sequential design process in which the project phase flow from top to
bottom. T
his model is suitable for software management because it provide certain artifacts at
different stages of the model. Each s
tage is distinct and must be completed before proceeding
to the next time.


Due to the rigid design, there is no feedback to previous stages.
A

requirement changes during
the implementation will be very hard to make amendments.

No working prototype will be
created until the later stage. This will make it harder to integrate and visual the final product.







System Report




20

8.1.2

Rational Unified Process (RUP)


This model is an iterative software development process framework. The 4 phase of this mo
del
are:

-

Inception
:

o

Get basic requirement, identify business risks, establish scope of system

-

Elaboration
:

o

Design, implement, baseline an executable architecture; address major technical
risks

-

Construction
:

o

Deploy internal & alpha releases, address user
’s needs, end phase by providing
full functional release with support documentation

-

Transition
:

o

Ensure software meets user requirements; fine tune, configure, and installation
of final product


RUP is deliberately flexible making use of the 4 phases,
multiple disciplines and iterations.

It is
able
to resolve the project risks associated with the client's evolving requirements requiring
careful

changes request management. Lesser time is required for integrated as the process goes on
throughout the entir
e project phase.

The downside of this model is it required all team members to be expert in their field to develop
software under this methodology. The development process is complex and may not be easy to
understand.



System Report




21

8.1.3

Proposed Model

Our proposed model will be Waterfall model.
Despite the disadvantages of the model, this
project is still suitable.


Here are our key reasons for the model:

-

Well
-
defined requirement would mean there will be little or no changes during the
entire project ph
ase.

-

Small scale project and tight schedule make suitable for this model

-

E
asy to understand the progress of the project due to the distinct stages. This is useful
for our team as our team experience

in project management is little
.




8.2

Risk and Counter Measure

No

Risk

Description

Impact

Likelihood

Counter Measure

Applicable

1

Duration

Duration for the entire project is tight

5

5

Develop project plan and monitor
project progress closely

All

2

Management

Team members’ knowledge of project
management might be limited

4

3

Choose simpler method and start
of slowly

All

3

Technical

Experience of developing software
might be limited.

3

3

Use the programming language
most members are proficient with.

System Architect,
System Analyst

4

Development
cost

Project budget affect the tools we used
to develop software



As there is no budget, we will opt
to use open
-
source tools.

All

5

Communication

As group member has other
commitments, it will be hard to arrange
meeting.

4

2

Meet regularly after lesson and
hold online meeting to update on
project progress via MSN and
emails.

All

6

Unrealistic
project goal

If the project goal is unrealistic,
planning for it will face many
difficulties.

4

3

Review of project goal need to be
done regularly according to the
progress

All

7

Poor reporting
of project
status

Team member giving poor report will
result in poor decision made by the
project management
which in turn
affect the project progress

4

2

Each module need to be check and
integrate if possible. This help to
detect any possible error earlier.

System Analyst,
tester, designer

8

Exceed budget

This will affect the project cost

5

1

As there is no
budget for this
project, we will opt for options
that are open
-
source

All

9

Missing
deadline

If certain stages deadline is delayed,
the other stages cannot proceed. This
will affect the entire project timeline.

3

3

Hold regular meeting to know
team progress and make
contingency
plan for phase for
stages that are likely to be delay

All


8.3

Use
-
Case Diagram





System Report




24

8.4

Sequence Diagram






System Report




25

8.5

Network Activity Diagram


Activity Table

Event ID

Event

Duration

Early Start

Early End

Last Start

Last Finish

A

Feasible Study

5

0

5

0

5

B

Requirement
Specification

4

5

9

5

9

C

Project Planning

5

9

14

9

14

D

Design

5

14

19

14

19

E

Development

10

19

29

19

29

F

Documentation

29

0

29

0

29


Activity Diagram





System Report




26

8.6

Role and Tabular

Summary

Roles

Description

Project Manager

-

Responsible for the progress of the project.

-

S
chedule mee
ting

-

Allocate task to members

-

M
onitor and ensure progress to meet deadline.


System Architect

-

D
esign architecture

-

I
dentify key components for the system


System Analyst

-

Research existing solution

-

Design and implement module


Tester

-

Design test cases

-

C
arry out the test to ensure functionality


Designer

-

Design User Interface of the system to integrate the functionality



Due to tight schedule and small

project scope, team member have to take up multiple role to
speed up project progress. Below is the role allocation table:

Role

Boon Ping

Crimson

Jizong

Kelvin

Ziyun

Project Manager





x

System Architect


x

x



System
Analyst

x

x

x

x


Tester

x





Designer




x


Documentation

x

x

x

x

x











System Report




27

8.7

Effort Estimation

Albrecht complexity multipliers


External user
type

(function type)

Low
complexity

Average
complexity

High
complexity

EI

3

4

6

EO

4

5

7

EQ

3

4

6

LIF

7

10

15

EIF

5

7

10


Logical
Interface Files: Indexed Database > 1

External Interface Files: N.A.

External Input (EI) types: Indexer/Crawler > 1

External Output (EO) types: Search Results > 1

External Inquiry (EQ) types: Search Input > 1


Assuming all function points are of Average C
omplexity and each function point takes 3 function points
per day

Total Complexity: 10 + 4 + 5 + 4 = 23

Days taken to implement the project: 23/3 =
7.7 Days




8.9

Timeline




System Report




29

9

Test Cases


S/N

Test Case

Input

Expect Output

Actual Output

Result

Remarks

1

Empty Field

empty

No results found

No results found

Pass

N.A.

2

Single word that
does not exist

ASDQWEASD

No results found

No results found

Pass

N.A.

3

Single word

Microsoft

All TXT files containing Microsoft

All TXT files containing
Microsoft

Pass

N.A.

4

SPACE

Microsoft Oracle

All TXT files containing Microsoft
Oracle

All TXT files containing
Microsoft Oracle

Pass

N.A.

5

AND

Microsoft AND
Oracle

All TXT files containing Microsoft
and Oracle

All TXT files containing
Microsoft and Oracle

Pass

N.A.

6

OR

Microsoft OR
Oracle

All TXT files containing either
Microsoft or Oracle

All TXT files containing either
Microsoft or Oracle

Pass

N.A.

7

-

Microsoft
-
Oracle

All TXT files containing Microsoft
and without Oracle

All TXT files containing
Microsoft and
without Oracle

Pass

N.A.

8

“”

“Microsoft Oracle”

All TXT files containing the exact
words Microsoft Oracle

All TXT files containing the
exact words Microsoft Oracle

Pass

N.A.

9

AND and OR

Microsoft AND
Oracle OR Apple

All TXT files containing Microsoft
Oracle and Microsoft Apple

All TXT files containing
Microsoft Oracle and
Microsoft Apple

Pass

N.A.

10

AND and OR and
-

Microsoft AND
Oracle OR Apple
-

HP

All TXT files containing Microsoft
Oracle without HP and Microsoft
Apple without HP

All TXT files
containing
Microsoft Oracle without HP
and Microsoft Apple without
HP

Pass

N.A.

11

AND and OR and
-

and “”

Microsoft AND
Oracle OR Apple
-

“HP”

All TXT files containing Microsoft
Oracle without HP and Microsoft
Apple without “HP”

All TXT files containing
Microsoft Oracle without HP
and Microsoft Apple without
“HP”

Pass

N.A.



10


Version Control

Ideology


10.1

Introduction

Version controlling

is a critical tool for software development team
because of the

frequent
changes to code and document made du
ring the entire project phase. It also allow multiple
team members to update the same files at the same time


Each changes made to document are recorded down. This promotes accountability and makes
it easier to solve problems by rolling back to an earlier
version if a serious mistake is made.


The main 2 model for version control are client
-
server model and distributed model. In client
-
server model, there will only be a single repository in the server while the rest of the team
member will use client to upd
ate or retrieve the main copy in repository. In distributed
approach, each team member work directly with their repository, any changes are shared
between repositories as a separated step.


10.2

Type

Open
-
source and
P
roprietary

As there is no budget for this
project, we will choose to use open
-
source.


Client
-
server and Distributed

Main features of distributed system:

-

Easier to work without a network connection because you can commit changes to your
own repository

-

Possible to have multiple ‘central’ branch for

different use such as development, stable
branches

-

Action such as committing and view history log are very fast as there is no need to
access the central server


Main feature of client
-
server:

-

Easier for a single person to keep control of the whole histo
ry and project access

-

A master copies are kept centrally rather than having multiple competing version


We decided to choose traditional client
-
server approach over distributed approach as
our team
has some experience in this approach and
it is easier to u
nderstand workflow. We do not have
any restriction over network connection thus this shouldn’t be an issue.




System Report




31

10.3

Setup

10.3.1

Repository

Our ideal repository has to be:

-

reliable to allow team member to save changes anytime

-

secure to avoid unauthorized access to our

source code and document

-

easy to setup due to the tight project
schedule

-

easy to use as not all of our team members are familiar with version control concept


Our chosen repository is

Google Code
.


The project host can be setup via
http://code.google.com/hosting/





System Report




32

10.3.2

Client

We have chosen Tortoise SVN as client due to the ease of use. A
ll team members need to pick it
up after a short period
.


The tortoise client can be download
ed

at
http://tortoisesvn.net/
. The URL to our repo
sitory is:

https://csci311
-
distributed
-
search
-
engine.googlecode.com/svn/trunk/

To check out, use the checkout function and use enter the URL:







System Report




33

10.3.3

Tortoise SVN

L
og


Log Messages






System Report




34

Log Statistics



System Report




35

11


Product
Screenshots



1.

Text Box

a.

Enter the key words for your search

b.

Operators such as AND, OR,
-
, “” and combination of operators are supported

2.

Search Button

a.

Click on this button to begin searching

3.

Total Hits

a.

This
is the total results returned from the search

4.

File Name

a.

Filename which consist of the key word searched

b.

Double click on the cell to open up the file directly

5.

Directory

a.

Directory where the file resides

b.

Double click on the cell to open up the directory



System Report




36

12


A
ppendices


12.1

Project
Meeting

Minutes

Meeting No:

01

Date:

28th October 2011

Location:

School Lab Level 5
-
17C/D

Time / Duration:

2100 Hrs / 1 hour

Present:

Crimson Thia

Kelvin Yap

Marcus Lin

Ng Boon Ping

Zhuo Jizong


Topics Discussed:




Everyone to research on crawler



Outlined project requirements


Raw Information (notes taken down during
meeting):



First of all, we need to understand
what is crawler
. It is something we
been using often but no experience in
implementing it.



Open source
project which can
refer to




What are the requirements of the
project
, the functional ones and non
-
functional.



What is the timeline given
, base on the
timeline, we need know what are the
tasks in each phase of development





System Report




37

Meeting No:

02

Date:

4th November 2011

Location:

School Lab Level 5
-
17C/D

Time / Duration:

2100 Hrs / 1 hour

Present:

Crimson Thia

Kelvin Yap

Marcus Lin

Ng Boon Ping

Zhuo Jizong


Topics Discussed:



.Decision on Desktop or Web search
engine



.Tasks Allocation



.Tools to be used




Raw Information (notes taken down during
meeting):



Based on our research, w
e have
decided to develop desktop search
engine after
the
discussion
, from the
group member’s perspective, desktop
獥慲ch en杩ge 楳i敡獩敲 W漠業il敭enW
慮T
W敳W.



Tasks are distributed to all the
members



Decisions on what
programming
language we going to use
, we decided
to use java as there are more source
code released using Java.



Decisions on what developing tools we

are
going to use
, we have decided
to
use
Netbean as it is comprehensive
compare to other development tools.



Decisions on what version control we
going to use
. SVNTortoise was chosen
as we have group members that have
lots of experience using it.






System Report




38

Meeting No:

03

Date:

11th November 2011

Location:

School Lab Level 5
-
17C/D

Time / Duration:

2100 Hrs / 1 hour

Present:

Crimson Thia

Kelvin Yap

Marcus Lin

Ng Boon Ping

Zhuo Jizong


Topics Discussed:




Proto
t
ype of search engine



Documentation first draft



Distributed system






Raw Information (notes taken down during
meeting):



Demo
nstrations
on basic in
dexing and
searching
using console application.
Discover various problems such as
duplicate indexing and failure to index.



Integration on all document works
done by group member
.

Reviewed on
the critical sections such as
development methodology and
application specification.



Decision on di
stribution techniques
that is

applied by our search engine
.






System Report




39

Meeting No:

0
4


Date:

1
8
th November 2011

Location:

School Lab Level
5
-
17C/D

Time / Duration:

2100 Hrs / 1 hour

Present:

Crimson Thia

Kelvin Yap

Marcus Lin

Ng Boon Ping

Zhuo Jizong


Topics Discussed:




Demonstration of product




Documentation
Second draft



Distributed system






Raw Information (notes taken down during
meeting):




Demo on
the
indexing and searching
with GUI

and Boolean

operator,
suggestions were given to improve the
product such as displaying result in a
better format.



Integration on all document works
done by gro
up member
. Discussed on
which sections can be improved and
any missing information that should be
added.



Discussed on the distribution method
which was not come to an conclusion
since the first brought up. Various
suggestions was given such as multiple
pro
cess, multiple threading and even
giving up on this part if we pursue a
high quality searching engine.






System Report




40

Meeting No:

0
5


Date:

1
9
th November 2011

Location:

School Lab Level 5
-
17C/D

Time / Duration:

1400

Hrs / 1 hour

Present:

Crimson Thia

Kelvin Yap

Marcus Lin

Ng Boon Ping

Zhuo Jizong


Topics Discussed:




Product Integration




Product Testing



Documentation finalization






Raw Information (notes taken down during
meeting):



Discussed on how, when and who will
implement the integration, which will
be carry out by Zyiun and Kelvin.



Testing to be done after the meeting,
team member will review the test
outcome.



All incomplete sections to be
completed by the appointed member,
oth
er member may help to in the
finalization.






System Report




41

12.2

References

Apache Lucene,

Apache Lucene
-

Overview,

http://lucene.apache.org/java/docs/index.html


Apache Lucene,

Lucene 3.4.0 core API,

http://lucene.apache.org/java/3_4_0/api/core/index.html


TortoiseSVN,

About TortoiseSVN,

http://tortoisesvn.net/


Wikipedia The Free Encyclopedia,

Index (search
engine),http://en.wikipedia.org/wiki/Search_index


Wikipedia The Free Encyclopedia,

Web search engine,

http://en.wikipedia.org/wiki/Web_search_engine


Wikipedia The
Free Encyclopedia,Desktop search,

http://en.wikipedia.org/wiki/Desktop_search


Wikipedia The Free Encyclopedia,

Systems development life
-
cycle,
http://en.wikipedia.org/wiki/Systems_development_life
-
cycle


Wikipedia The Free Encyclopedia,

Software developme
nt
methodology,http://en.wikipedia.org/wiki/Software_development_methodologies