Searching for Search Solutions

wrendeceitInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

119 εμφανίσεις

Searching for Search Solutions

Harvard IT Summit

June 23, 2011

Randy Stern | randy_stern@harvard.edu | HUL/OIS

David Heitmeyer | david_heitmeyer@harvard.edu | HUIT




Searching the Web

2

Searching a Site

3

Searching a Collection

4

Searching Geospatially

5

Search at Harvard


Web

6

Search at Harvard


Web

7

Search at Harvard


Collections


People


Courses


Grants


Libraries


....many

other things…

8

Search at Harvard


Libraries

9

Search at Harvard


Federated

10

Search Models


“To oversimplify, there's the Google model and the

faceted
navigation model.”


Morville

&
Callendar

in
Search Patterns


Keyword (“Google”)


Keyword search against an index


Advanced Search


Searching or selecting specific fields


Faceted Search (“Guided Navigation”)


Integrated search and browse


Keyword search


Browse by category metadata


“No dead ends”

11

Advanced Search

12

Advanced Search

13

Faceted Search


14

Search Technologies


Summary

15

Technology

Products

Examples at Harvard

Web Search

Google, Yahoo, Bing

everywhere

Site Search

Google Search Appliance,

Nutch
, Sphinx,
Elasticsearch

www.harvard.edu

Relational Database

Oracle,
MySQL
,

PostGres

PeopleSoft, Aleph, DRS,
HOLLIS Classic

XML Database


Tamino
,
eXist

VIA, OASIS, Virtual Collections

Spatially enabled


ArcSDE
,
PostGIS

Harvard Geospatial Library,
WorldMap

Archived web search

NutchWAX
/
Lucene

Library Web Archiving Service

Full text and faceted
search

Apache
Solr
/
Lucene
,
Endeca
, Autonomy,

MS
FAST

Library Full Text Search
Service, HOLLIS,
iSites
,
Course Catalog

Federated

search

Ex
Libris

Metalib

Library Cross Search

Apache
Lucene


Open source from Apache


High
-
performance, full
-
featured
text search engine library written
entirely in Java


Text
-
based inverted index


Documents of name/value pairs


Stemming and
tokenizers

for
various applications and
languages


Query syntax


and/or/not/near


Highlighter


**FAST**

16

Image goes here

Apache
Solr



Solr

is the popular, blazing fast open source
enterprise search platform from Apache”


A REST Web Service on top of
Lucene

for
indexing and querying


XML and JSON output


Caching for faster response


Faceting


Web management interface


XML schema configuration files



did you mean?
” and “
more like th
is” support


Scalable server model


Very active development community

17

Image goes here

http://lucene.apache.org/solr/

Lucene

Solr

Highly scalable with
Hadoop

cluster

Lucene

Solr

Lucene

Solr

Apache
Solr
/
Lucene

Ecology

18

Image goes here

Library
catalogs

Enterprise

databases

Nutch
,

Nutchwax

Web Archives

Lucene

Solr

Text

Fielded data

Solr

Indexing


Indexing: HTTP POST to http://
mysolrserver
/solr/update


<
add
>


<
doc
>


<
field

name="id"
>
13579
</
field
>


<
field

name="title"
>
Mona Lisa
</
field
>


<
field

name="creator"
>
Leonardo
DaVinci
</
field
>


<
field

name="year"
>
1519
</
field
>


<
field

name="genre"
>
painting
</
field
>


</
doc
>

</
add
>




19

Image goes here

Solr

Searching

http://mysolrserver/solr/select?

q=
Davinci
&
start
=0
&
rows=2
&
fl=
title,genre


<
response
>


<
result

numFound
=“43” start="0"
>


<
doc
>


<
str

name=“title"
>
Mona Lisa
</
str
>


<
str

name=“genre”
>
painting
</
str
>


</
doc
>


<
doc
>


<
str

name=“title"
>
Bronze Horse
</
str
>


<
str

name=“genre”
>
sculpture
</
str
>


</
doc
>


</
result
>

</
response
>




20

Image goes here

Solr

Searching

http://mysolrserver/solr/select?

q=
Davinci
&
start
=0
&
rows=2
&
fl=
title,genre
&
wt
=
json


{"
response
" : {


"
numFound
" : 43,


"
start
" : 0,


"
docs
" :


[




{"
title
":"
Mona Lisa
",




"
genre
":"
painting
"},




{"
title
":"
Bronze Horse
",




"
genre
":"
sculpture
"}



]


}

}

21

Image goes here

Use of
Solr

Exploding


Whitehouse.gov
,
FCC.gov
,
Comcast /
xfinity
,
AT&T Interactive
,
AOL

(
Yellow Pages
,
Music
,
NFL Sports
,
Recipes
),
Sears
,
Ticketmaster
,
Digg
,
Netflix
,
Zappos.com
, and many more


Open source library catalogs


Blacklight

(Ruby),
VuFind

(PHP)


Open source digital Repositories


Fedora,
Dspace


Support available from Lucid Imagination (
Solr

creators)

22

Image goes here

Source: http://wiki.apache.org/solr/PublicServers

Harvard University Course Catalog

23

coursecatalog.harvard.edu

Solr

& Course Catalog


9,000+ courses from

13 schools/programs


15 Mb index size


fields are indexed and stored


Search + Faceted Navigation


School, calendar period, term,
department, day, time, cross
-
registration status, credit level


Updated daily


REST interface

HTTP post of XML files


XSLT/
XPath

2 processing of
XML data from
Solr

Course Catalog


Searching and Facets

Search Terms

Facets

25

School

Semester

Department

Credit Level

Day of Week

Cross
Registration

Term within
School

Time of Day

Offered

Course Catalog


Access to data to other
applications


Open Search browser
plugins


26

iSites


5,500 course websites each
year


20,000 websites


16,000 students


8 student portals


33,000 users on a peak day

Search within
iSites

28

Solr

&
iSites



4.5 million items


File, topic, forum, image, page,
html, sign
-
up event, video,
audio, site, link, wiki,
announcement, podcast


Crawlers use database and file
system


MS Office, PDF, Audio
(metadata),
OpenDocument
,
RTF, Text, HTML, XML


35
Gb

index size


Updated hourly


Master and slave


Search Tool
-

Permissions


Search


New Ways of Navigating

Harvard Library Full Text Search Service

31

.

Harvard Library Full Text Search Service

32

.

Full Text Search Service


Uses
Lucene

directly


Full text index of OCR page text
for digitized books and other
page turned objects


Relevance ranked searching


Hits in context


~81,000 objects so far,

7.2 million pages


Index size 8.5GB

33

Harvard Library Web Archiving Service

34

.

Harvard Library Web Archiving Service

35

.

Web Archiving Service


Lucene

plus
Nutchwax

full text index of harvested
web pages and harvested resources


Indexing HTML, PDFs, Word docs, PPTS, etc. and
collection metadata


Currently a “small” web archive


265 web sites


13M web pages


100M web resources, 1TB of archived web data


Index size 170GB and growing


80
-
90% of index size is full text required for “hit in
context” search results


3
-
5 sec search result times on ordinary dual core
Linux box


36

DRS 2 Web Administrator

37

.

Facets to
come!!

DRS 2 Web Administrator


Solr

for digital object management searching


Digital preservation objects have many fields that may be important
for collection management or preservation planning


Faceted browse


by user tags, content type, owners, etc.


Full text searching for descriptions and process info


Easy to configure, update, and use (HTTP and simple URLs)




Indexing metadata plus full text embedded in object descriptors,
rather than the content of files themselves


Scoped at release:


152 fields


30 million records, index size of 60GB


master/slave configuration

38

Footer reference


remove hyperlink if you want to keep this gray.

Email Archiving Service

39

.

Email Archiving Service


Why
Solr

for email object management?


relevance ranking


Facets


full text searching of both email body and header
fields




Indexing email header fields, rights and collection
metadata, plus full text from emails

40

Searching for Search Solutions


Integrating multiple forms of data (text, images, audio, maps,
etc.) into single searchable indexes


Aggregating Indexes


Google, Google Books, Google Scholar


Licensed cloud services for articles, books, media, everything


Library Cloud


DPLA


Semantic Web


Linked Data, RDF, HTML 5’s
Microdata
,
Microformats


Mobile (Localized)


Specialized search vs. general search


there’s an app for that

41

Thank You

Randy Stern | randy_stern@harvard.edu | HUL

David Heitmeyer | david_heitmeyer@harvard.edu | HUIT