Lucene & Nutch

judgedrunkshipΔιακομιστές

17 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

104 εμφανίσεις

Lucene & Nutch


Lucene


Project name


Started as text index engine


Nutch


A complete web search engine, including:


Crawling, indexing, searching


Index 100M+ pages, crawl >10M/day


Provide distributed architecture


Written in JAVA


Other language ports are work
-
in
-
progress

Lucene


Open source search project


http://lucene.apache.org


Index & search local files


Download lucene
-
2.2.0.tar.gz from
http://www.apache.org/dyn/closer.cgi/lucene/java/


Extract files


Build an index for a directory


java org.apache.lucene.demo.IndexFiles dir_path


Try search at command line:


java org.apache.lucene.demo.SearchFiles

Deploy Lucene


Copy
luceneweb.war

to your {tomcat
-
home}/webapps


Browse to
http://localhost:8080/luceneweb


Tomcat will deploy the web app.


Edit webapps/luceneweb/configuration.jsp


Point “indexLocation”to your indexes


Search at
http://localhost:8080/luceneweb

Nutch


A complete search engine
http://lucene.apache.org/nutch/release/


Mode


Intranet/local search


Internet search


Usage


Crawl


Index


Search

Intranet Search


Configuration


Input URLs: create a directory and seed file


$ mkdir urls


$ echo
http://www.cs.ucsb.edu

> urls/ucsb


Edit conf/crawl
-
urlfilter.txt and replace
MY.DOMAIN.NAME with
cs.ucsb.edu


Edit conf/nutch
-
site.xml


Intranet: Running the Crawl


Crawl options include:


-
dir dir names the directory to put the crawl in.


-
threads threads determines the number of threads that will fetch in parallel.


-
depth depth indicates the link depth from the root page that should be
crawled.


-
topN N determines the maximum number of pages that will be retrieved at
each level up to the depth.


E.g.


$
bin/nutch crawl urls
-
dir crawl
-
depth 3
-
topN 50


Intranet Search


Deploy nutch war file


rm
-
rf TOMCAT_DIR/webapps/ROOT*


cp nutch
-
0.9.war
TOMCAT_DIR/webapps/ROOT.war


The webapp finds indexes in ./crawl, relative to
where you start Tomcat


TOMCAT_DIR/bin/catalina.sh start


Search at
http://localhost:8080/


CS.UCSB domain demo:
http://hactar.cs.ucsb.edu:8080

Internet Crawling


Concept


crawldb: all URL info


linkdb: list of known links to each url


segments: each is a set of urls that are fetched as
a unit


indexes: Lucene
-
format indexes

Internet Crawling Process

1.
Get seed URLs

2.
Fetch

3.
Update crawl DB

4.
Compute top URLs, goto 2

5.
Create Index

6.
Deploy

Seed URL


URLs from the DMOZ Open Directory


wget
http://rdf.dmoz.org/rdf/content.rdf.u8.gz


gunzip content.rdf.u8.gz


mkdir dmoz


bin/nutch org.apache.nutch.tools.DmozParser
content.rdf.u8
-
subset 5000 > dmoz/urls


Kids search URL from ask.com


Inject URLs


bin/nutch inject kids/crawldb 67k
-
url/


Edit
conf/nutch
-
site.xml

Fetch


Generate a fetchlist from the database


$ bin/nutch generate kids/crawldb kids/segments


Save the name of fetchlist in variable s1


s1=`ls
-
d kids/segments/2* | tail
-
1`


Run the fetcher on this segment


bin/nutch fetch $s1

Update Crawl DB and Re
-
fetch


Update craw db with the results of the fetch


bin/nutch updatedb kids/crawldb $s1


Generate top
-
scoring 50K pages


bin/nutch generate kids/crawldb kids/segments
-
topN 50000


Refetch


s1=`ls
-
d kids/segments/2* | tail
-
1`


bin/nutch fetch $s1

Index, Deploy, and Search


Create inverted index


bin/nutch invertlinks kids/linkdb kids/segments/*


Index the segments


bin/nutch index kids/indexes kids/crawldb
kids/linkdb kids/segments/*


Deploy & Search


Same as in Intranet search


Demo of
1
M pages (
570
K +
500
K)


Issues


Default crawling cycle is 30 days for all URLs


Duplicates are those have same URL or md5
of page content


JavaScript parser uses regular expression to
extract URL literals from code.