For ITCS 6265

crashclappergapSoftware and s/w Development

Dec 13, 2013 (4 years and 19 days ago)

62 views

For ITCS 6265

Professor:
Wensheng

Wu

Present by TA:
Xu

Fei

What is Lucene


“Apache Lucene is a high
-
performance, full
-
featured
text search engine library written entirely in Java. It is a
technology suitable for nearly any application that
requires full
-
text search, especially cross
-
platform. ”


high performance, scalable Information Retrieval (IR)
library.


a project in the Apache Software Foundation


mature, free, open
-
source


implemented in Java.

full
-
text indexing and searching


“In text retrieval,
full text search

refers to a technique
for searching a computer
-
stored document or
database. In a full text search, the search engine
examines all of the words in every stored document as
it tries to match search words supplied by the user. ”


“Search engine indexing collects, parses, and stores
data to facilitate
fast and accurate
information
retrieval. ”

Lucene is popular


a number of ports or integrations to other
programming languages


C/C++, C#, Ruby, Perl, Python, PHP, etc.


1500+ installations:



HP, FedEx, Iron Mountain,
Akamai
,
DSpace
, IBM/Yahoo,
Healthline
, Webmail, CNET, Lookout (acquired by
Microsoft), webshots.com (100M docs, 4M queries/day),
Siderean
, Monster….

Lucene is just a hammer!


NOT a ready
-
to
-
use search application, like Google


a software library, a toolkit


a single compact JAR file (less than 1 MB!)


A number of full
-
featured search applications have
been built on top of Lucene.

What Lucene can do for you


add search capabilities to your application


index and make searchable any data that you can
extract text from


Lucene doesn’t care about the source of the data, its
format, or even its language, as long as you can derive
text from it.


You can even index data stored in your databases,
indirectly!

Search Application

Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.


Components for indexing


Acquire Content


Build Document


Analyze Document


Index Document


Components for searching


Search User Interface


Build Query


Search Query


Render Results


Others


Administration Interface


Analytics Interface


Scaleout

Ranking formula

score(Q,D)


=


coord
(Q,D)



queryNorm
(Q)










t in Q
(

tf
(t in D)




idf
(t)
2










t.getBoost
()



norm(D)
)




tf

idf

weight (term frequency

inverse document
frequency)


Key index files in Lucene


Segments file


Fields information file


Text information file


Frequency file


Position file

Inverted Index Example


Doc 1:

Penn State
Football …

football

Doc 2:

Football
players …
State

Posting

id

word

doc

offset

1

football

Doc 1

3

Doc 1

67

Doc 2

1

2

penn

Doc 1

1

3

players

Doc 2

2

4

state

Doc 1

2

Doc 2

13

Posting

Table

Demo


How to install Lucene and run the demo


Boolean retrieval example



apache


lucene


apache +
lucene


apache
lucene


Luke:
http://www.getopt.org/luke/


A online demo (PHP + Lucene) :
http://tiny.cc/JCA9K


Reference:


Lucene:
http://lucene.apache.org/


Apache:
http://www.apache.org/


“Lucene in Action” Chapter 1 and code:
Link


Lucene index:
http://www.ibm.com/developerworks/library/wa
-
lucene/



http://lucene.apache.org/java/2_4_0/scoring.html


http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/sea
rch/Similarity.html


http://en.wikipedia.org/wiki/Full_text_search


http://en.wikipedia.org/wiki/Index_%28search_engine%29


http://en.wikipedia.org/wiki/Tf
-
idf