This should be easier… and cool, bonsai cool!

groanaberrantInternet and Web Development

Feb 2, 2013 (4 years and 8 months ago)

319 views










Analysis on best technologies to be used for the
search engine

and catalogue of BISE


























Introduction

................................
................................
............................

3

TEXT SEARCH: desired features

................................
...............................

3

Lucene

................................
................................
................................
.....

5

Basic Concepts

................................
................................
................................
...........................

5

Solr

................................
................................
................................
..........

7

Pros and Cons

................................
................................
................................
............................

7

Solr application

................................
................................
................................
..........................

8

ElasticSearch

................................
................................
.........................

12

Why elasticSearch?

................................
................................
................................
.................

12

Compass

................................
................................
................................

14

Hibernate Search

................................
................................
...................

15





Introduction

This document makes a review of different technologies that can help us to build the metadata

catalogue for BISE.


TEXT SEARCH: desired features



(
http://www.packtpub.com/article/text
-
search
-
your
-
database
-
or
-
solr



Text

analysis
: This is the processing of the original text into indexed terms, and there's a lot to it.
Being
able to configure the tokenization of words could mean that searching for “Mac” will be found if
the word “MacBook” is in the text. And then there's synonym processing so that users can search for
similar words. You might want both a common language dictio
nary and also hand
-
picked ones for
your data. There's the ability to smartly handle desired languages instead of the pervasive English.
And then there's

stemming

which normalizes word variations so that for example “working” and
“work” can be indexed the s
ame. Yet another variation of text analysis is phonetic indexing to find
words that sound
-
like the search query.



Relevancy

ranking
: This is the logic behind ranking the search results that closest match the query.
In Lucene/Solr, there are a variety of fac
tors in an overall equation with the opportunity to adjust
factors based on matching certain fields, certain documents, or using field values in a configurable
equation. By comparison, the commercial Endeca platform allows configuration of a variety of
mat
ching rules that behaves more like a strict sort.



Query

features

&

syntax
: From boolean logic to grouping to phrases to fuzzy
-
searches, to score
boosting... there are a variety of queries that you might perform and combine. Many apps would
prefer to hide t
his from users but some may wish to expose it for “advanced” searches.



Result

highlighting
: Displaying a text snippet of a matched document containing the matched word
in context is clearly very useful.
We have all seen this in Google.



Query

spell

correction

(i.e.

“did

you

mean”)
: Instead of unhelpfully returning no results (or very
few), the search implementation should be able to try and suggest variation(s) of the search that will
yield more results. This feature is customarily based on the actua
l indexed words and not a
language dictionary. The variations might be based on the so
-
called edit
-
distance which is basically
the number of alterations needed, or it might be based on phonetic matching.



Faceted

navigation
: This is a must
-
have feature whic
h enables search results to include
aggregated values for designated fields that users can subsequently choose to filter the results on. It
is commonly used on e
-
commerce sites to the left of the results to navigate products by various
attributes like pric
e ranges, vendors, etc.



Term
-
suggest

(AKA

search

auto
-
complete)
: As seen on Google, as you start typing a word in the
search box, it suggests possible completions of the word. These are relevancy sorted and filtered to
those that are also found with any wo
rds prior to the word you are typing.



Sub
-
string

indexing
: In some cases, it is needed to match arbitrary sub
-
strings of words instead of
being limited to complete words. Unlike what happens with an SQL

like

clause, the data is indexed in
such a way for
this to be quick.



Geo
-
location

search
: Given the coordinates to a location on the globe with records containing such
coordinates, you should be able to search for matching records from a user
-
specified coordinate. An
extension to Solr allows a radial based

search with appropriate ranking, but it is also straight
-
forward
to box the search based on a latitude & longitude.



Field/facet

suggestions
: The Endeca search platform can determine that your search query
matches some field values used for faceting and th
en offer a convenient filter for them. For example,
given a data set of employees, the search box could have a pop
-
up suggestion that the word in the
search box matches a department code and then offer the choice of navigating to those matching
records. Th
is can be easier and faster than choosing facets to filter on, especially if there are a great
number of facet
-
able fields. Solr doesn't have this feature but it would not be a stretch to implement it
based on its existing foundation.



Clustering
: This is a
nother aid to navigating search results besides faceting. Search result
clustering will dynamically divide the results into multiple groups called clusters, based on statistical
correlation of terms in common. It is a bit of an exotic feature, but is usefu
l with lots of results with
lots of text information and after any faceted navigation is done if applicable.



Lucene

http://www.lucenetutorial.com/index.html

Basic Concepts

Lucene

is a full
-
text search library which makes it easy to add search functionality to an
application or website.

It does so by adding content to a full
-
text index. It then searches this index and returns results
ranked by either the relevance to the query or b
y an arbitrary field such as a document's last
modified date.

Searching and Indexing

Lucene

is able to achieve fast search responses because, instead of searching the text directly,
it searches an index instead. This would be the equivalent of retrieving pages in a book related
to a keyword by searching the index at the back of a book, as oppose
d to searching the words
in each page of the book.

This type of index is called an

inverted index
, because it inverts a page
-
centric data structure
(page
-
>words) to a keyword
-
centric data structure (word
-
>pages).

Documents

In Lucene, a

Document

is the unit

of search and index.

An index consists of one or more Documents.

Indexing involves adding Documents to an IndexWriter, and searching involves retrieving
Documents from an index via an IndexSearcher.

A Lucene Document doesn't necessarily have to be a docum
ent in the common English usage
of the word. For example, if you're creating a Lucene index of a database table of users, then
each user would be represented in the index as a Lucene Document.

Fields

A Document consists of one or more Fields.

A Field is si
mply a name
-
value pair. For example, a
Field commonly found in applications is

title
. In the case of a

title

Field, the field name
is

title

and the value is the title of that content item.

Indexing in Lucene

thus involves creating Documents comprising of one or more Fields, and
adding these Documents to an IndexWriter.


Searching

Searching requires an index to have already been built. It involves creating a

Query

(usually via
a QueryParser) and handing this Q
uery to an

IndexSearcher
, which returns a list of Hits.

Queries

Lucene has its own mini
-
language for performing searches. Read more about the

Lucene

Query

Syntax

The Lucene

query language allows the user to specify which field(s) to search on, which fields
to give more weight to (boosting), the ability to perform boolean queries (AND, OR, NOT) and
other functionality.



Solr

Pros and Cons


Cons

No synchronization

Probably the biggest obstacle to using Solr for the text index is that you need to synchronize data between
your database and Solr, assuming it isn't going to be in Solr alone


an option discussed later. If it is
satisfactory to fully create the Solr inde
x on occasion, perhaps nightly, then this should be relatively easy.
Getting a little more complex is augmenting that scheme to synchronize just changes (create, update,
delete) on a more frequent basis. Perhaps, the most difficult is attempting to update
the index
automatically when changes occur in the database. Solr does not

yet

support “near real time search”
which will allow changes to be search
-
able almost immediately after a change is submitted to Solr.
However, depending on data size and performance

targets, changes can often be search
-
able less than a
minute after if you need it this fast.

Solr

includes a module it calls the “Data Import Handler”, that is mostly for pulling in data from databases.
It can even handle updates if the records contain a date. But getting data to Solr “just
-
in
-
time” will in all but
one circumstance require some work o
n your part to coordinate. The exception here is using the
“acts_as_solr” plugin available for the Ruby
-
on
-
Rails framework which synchronizes data model changes
automatically. The closest thing in the Java landscape is Hibernate
-
Search which is an extensio
n to
Hibernate that uses Lucene. Since it is not based on Solr, some notable features like faceting are not
available.

No relational schema mapping

Solr has a flat schema and does not have any relational search capabilities. It does at least have multi
-
val
ued field support which in most database schemas is done with additional tables. Your relational
database schema is of course relational, and you will have to devise a mapping to put data into Solr. Often
this is a straight
-
forward mapping, perhaps requiri
ng some de
-
normalization (i.e. in
-
lining related data
causing some duplication). However there can be cases that are particularly difficult or impossible to map
when there is a one or many
-
many relationship with multiple search able fields. If you suspect
challenges
with mapping your relational database schema to Solr then be sure to pose your scenario to
solr
-
users@apache.org

for advice.


Pros

Vendor Lock
-
In

You probably already know that the SQL standard largely diverges with different dialects across database
vendors. Text search features are especially different from vendor to vendor because it is not governed by
SQL or any other standard. Even if SQL stand
ardized the query syntax, it wouldn't be enough because it
wouldn't govern tokenization, stop words, and various other configuration aspects to search. And consider
that databases are becoming increasingly commoditized and standardized, such that more and
more
applications support many databases. Object Relational Mapping (ORM) frameworks like Hibernate and
ActiveRecord help make this possible. It's an easier proposition to add on Lucene or Solr to such
database
-
agnostic systems than to propose that the sys
tem contort itself to each vendor's text
-
index
features, perhaps in a least
-
common denominator way.

Less Features

This is the main reason to avoid database search. To a database, text
-
search simply isn't as important
compared to all the other things a data
base does. But this is what Solr

is
. You can count on your database
for the basics: fast term search with some basic query types and relevancy ranked results, but you
probably have very little control over text
-
analysis or the extensive advanced query features already
mentioned. Furthermore
, Solr has a variety of extensibility points to customize it to your needs and also, it
is open
-
source if you'd prefer to tweak existing functionality if it's close to what you want. This happens
frequently in the Solr community, it is not just a theoretic
al point. Conversely, modifying one's open
-
source
database to suit a project's needs is unheard of (well, rare and certainly requiring more effort than
modifying Solr). And if you're a Java programmer and your database is not Java based, Solr is all the mo
re
approachable to modification.


Faceted Navigation

Faceting is a head
-
liner feature in Solr

because it is a powerful method of navigating search results and
because it's not a widely available capability, especially in open
-
source. It may not seem hard at first but it
turns out that it's

very

difficult

to derive the right SQL queries to generate facets from a database. And if
you're an SQL pro and figured it out, it will be an anticlimactic accomplishment when in all likelihood you
discover that your queries run slowly
--
especially

for sizable data sets. Comparatively, Solr includes highly
optimized code to both optimally filter on chosen facets and to generate facets with their counts. By the
way, this is a capability in Solr, not Lucene, thus making this a reason to use Solr inste
ad of other Lucene
-
based solutions (e.g. Hibernate Search, Compass, ...).


Scalability

If your scenario is such that you are getting vast amounts of text from different sources (not databases
under your direct control) and you want to do text
-
search; a database would simply be inappropriate. With
Solr, you can choose to only index the data w
ithout storing it, thus saving

perhaps

75%

of data storage
requirements that a database would require. Solr is also optimized for fast,

very

fast
, queries (indexing is
no slouch either) and it's fairly straight
-
forward to scale out to more servers, whereas

your database is built
for on
-
line transaction processing (OLTP) usage or data warehousing.


Solr application


Download:
(
http://lucene.apache.org/solr/
)

Read the file "Readme.txt" placed at "
\
apache
-
solr
-
3.6
.1
\
example
"

and follow the instructions.

Output:


To index: java
-
jar post.jar fileName (name.extension, *.extension, *.*...etc)

There are many other different ways to import your data into Solr... one can



Import records from a database using the

Data Import Handler (DIH)
.



Load a CSV file

(comma separated values), including those exported by Excel or
MySQL.



POST JSON documents



Index binary documents such as Word and PDF with

Solr
Cell

(ExtractingRequestHandler).



Use

SolrJ

for Java or other Solr clients to programatically create documents to send to
Solr.

Admin console


(
http://lucidworks.lucidimagination.com/display/solr/Overview+of+the+Solr+Admin+UI
)

Purpose:



view

Solr configuration

details



run

queries and analyze

document fields in order to fine
-
tune a Solr configuration



access

online documentation

and other help



[Index of the site:
http://lucidworks.lucidimagination.com/display/solr/Apache+Solr+Reference+Guide
]


Sample application:



More
URL

http://www.solrtutorial.com

http://lucene.apache.o
rg/solr/

http://lucene.apache.org/solr/api
-
3_6_1/doc
-
files/tutorial.html



ElasticSearch

http://www.elasticsearch.org

Why elasticS
earch?

So, you build a web site or an online system and you want to add search to it. But then it
hits you: getting search to work is hard. You want the search solution to be fast, painless to
setup and to scale. You want to be able to index data simply
using JSON over HTTP
without having to pre
-
define schemas for the index. You want the search server to always
be available and to start small but potentially scale large


Big Data large. You want to
create as many indices as you see fit which will support

a diverse set of document types.
As you search, you want it as close to real
-
time search as possible. Oh yeah… and it would
be great if this search solution is built for the cloud.

“This should be easier… and cool, bonsai cool!”

WE

DECLARED

elasticsearch

was created as a solution for these requirements and more. It is an

Apache

2

licensed open source, distributed and RESTful search engine, built on top of

Apache

Lucene
.

Schema Free & Document Oriented

The data model of a search engine roots to schema free document oriented databases,
and as shown by the #nosql

movement, this model proves to be very effective for building
applications.

elasticsearch
‘s model is JSON, which slowly emerges as the de
-
facto
standard for representing data these days. More over, with JSON, it is simple to provide
semi
-
structured data w
ith complex entities as well as being programming language natural
with first level parsers.

Schema Mapping

elasticsearch

is schema
-
less, just toss it a typed JSON document and it will automatically
index it. Types such as numbers and dates are automatical
ly detected and treated
accordingly. That said, as we all know, search engines are quite sophisticated. Fields in
documents can have boost levels that affect scoring, analyzers can be used to control how
text gets tokenized into terms, certain fields shoul
d not be analyzed at all, and so on…
.

elasticsearch

allows you to completely control how a JSON document gets mapped into
the search engine on a per type and per index level.

Read

more…

GETting Some Data

Indexing data is always done using a unique identifier (at the type level). This is very handy
since many times we wish to update or delete the actual indexed data, or just GET it.
Getting data could not be simpler and all tha
t is needed is the index name, the type and the
id. What we get back is the actual JSON document used to index the specific data,
effectively behaving as a distributed key/value store for structured documents.

Read

more…

Search

What it all boils down to at the end, is being able to search, and with

elasticsearch

it could
not be simpler. Issuing queries is a simple call, hiding away the sophisticated distributed
based search support
elasticsearch

provides. Search can be executed either using a simple
Lucene based query string or an extensive JSON based search query
DSL.

Search does not end with just queries. Facets, highlighting, custom scripts, and more are
all there to be used when needed.

Read

more…

Multi Tenancy

A single index is
already a major step forward, but what happens when we need to have
more than one index. In many cases, multiple indices are required. An example can be
storing an index per week of log files, or even having different indices with different settings
(one w
ith memory storage, and one with file system storage).

elasticsearch

easily enables the creation of as many indices as required, allowing cross
index queries to be executed and index grouping using advanced aliasing functionality.

Settings

The ability to c
onfigure is a double edged sword. We want the ability to start working with
the system as fast as possible, with no configuration, and still be able to control almost
every aspect of the application if need be.

elasticsearch

is built with this notion in mind. Almost everything is configurable and
pluggable. More over, each index can have its own settings which can override the master
settings. For example, one index can be configured with memory storage and have 10
shards with

1 replica each, and another index can have file based storage with 1 shard and
10 replicas. All the index level settings can be controlled when creating an index either
using a YAML or JSON format.

Distributed

One of the main features of

elasticsearch

is
its distributed nature. Indices are broken down
into shards, each shard with 0 or more replicas. Each data node within the cluster hosts
one or more shards, and acts as a coordinator to delegate operations to the correct
shard(s). Rebalancing and routing a
re done automatically and behind the scenes.

View

in

action…

Cloud

elasticsearch

has been purposely built with the cloud in mind. Starting with options like auto
discov
ery of nodes when running in AWS EC2, to being an adaptive distributed system that
automatically handles machines coming and going, and adjusts to a dynamic environment.



Compass

http://www.compass
-
project.org

Simple

Compass provides a

simple API

for working with Lucene. If you know how to use an
ORM, the
n you will feel right at home with Compass with simple operations for save, and
delete & query.

Lucene

Building on top of Lucene, Compass

simplifies

common usa
ge patterns of Lucene such
as google
-
style search, index updates as well as more advanced concepts such as caching
and

inde
x sharding

(sub indexes). Compass also uses built in optimizations for concurrent
commits and merges.

Mapping

Compass provides support for mapping of different data "formats"
-

indexing and
storing (caching) them in the Search Engine:

Object to Search Engine Mapping

(using
annotations or xml),

JSON to Search Engine Mapping

(explicit

or dynamic),
XML to Search
Engine Mapping

(using simple xpath expressions), and the low level

Resource to Search Engine
Mappping
.

Tx

Compass provides a transactional API on top of the Search Engine supporting
different

transaction isolation levels
. Externally, Compass provides a local transaction manager
as well as integration with

external transaction manager
s

such as JTA (Sync and XA), Spring,
and ORM ones.

ORM

Compass integrates seamlessly with most popular ORM frameworks allowing automatic
mirroring, to the index, of the changes in data performed via the ORM tool. Compass
has

generic support for JPA

as well as embedded support for

Hibernate
,
OpenJPA
,

TopLink
Essentials
, and

EclipseLink

allow to add Compass using three simple steps.

Spring

Compass integrates seamlessly with Spring. Compass can be easily

configured

using
Spring, integrates with

Spring transaction manag
ement
, has support for

Spring MVC
, and has
Spring

aspects

built in for refle
cting operations to the search engine.

Distributed

Compass simplifies the creation of distributed Lucene index by allowing to store
the

Lucene index in a database
, as well as storing the index simply with Data Grid products
such as

GigaSpaces
,

Coherence

and

Terracotta
.




Hibernate Search

http://www.hibernate.org/subprojects/search.html

Hibernate Search brings the power of full text search engines to the persistence domain model
by combining Hibernate Core with the capabilities of the

Apache Lucene™

search engine.

Full text search engines like Apache Lucene™ are very powerful technologies to add efficient free
text search capabilities to applications. However, Lucene suffers several mismatches when
dealing with object domain models. Amongst other thi
ngs indexes have to be kept up to date
and mismatches between index structure and domain model as well as query mismatches have
to be avoided.


Hibernate Search addresses these shortcomings. It indexes your domain model with the help of
a few annotations,
takes care of database/index synchronization and brings back regular
managed objects from free text queries.
Hence, it solves:



The structural mismatch: Hibernate Search takes care of the object/index translation



The duplication mismatch: Hibernate Search
manages the index, keeps changes
synchronized with your database, and optimizes the index access transparently



The API mismatch: Hibernate Search lets you query the index and retrieve managed
objects as any regular Hibernate query would do

Even though Hibe
rnate Search is using Apache Lucene™ under the hood you can always fallback
to the native Lucene APIs if the need arises.

Depending on application needs, Hibernate Search works well in non
-
clustered and clustered
mode, provides synchronous and asynchronous

index updates, allowing you to make an active
choice between response, throughput and index update time.

Last but not least, Hibernate Search works perfectly with all traditional Hibernate patterns,
especially the conversation pattern used by
Seam
.