Analysis on best technologies to be used for the
and catalogue of BISE
TEXT SEARCH: desired features
Pros and Cons
This document makes a review of different technologies that can help us to build the metadata
catalogue for BISE.
TEXT SEARCH: desired features
: This is the processing of the original text into indexed terms, and there's a lot to it.
able to configure the tokenization of words could mean that searching for “Mac” will be found if
the word “MacBook” is in the text. And then there's synonym processing so that users can search for
similar words. You might want both a common language dictio
nary and also hand
picked ones for
your data. There's the ability to smartly handle desired languages instead of the pervasive English.
And then there's
which normalizes word variations so that for example “working” and
“work” can be indexed the s
ame. Yet another variation of text analysis is phonetic indexing to find
words that sound
like the search query.
: This is the logic behind ranking the search results that closest match the query.
In Lucene/Solr, there are a variety of fac
tors in an overall equation with the opportunity to adjust
factors based on matching certain fields, certain documents, or using field values in a configurable
equation. By comparison, the commercial Endeca platform allows configuration of a variety of
ching rules that behaves more like a strict sort.
: From boolean logic to grouping to phrases to fuzzy
searches, to score
boosting... there are a variety of queries that you might perform and combine. Many apps would
prefer to hide t
his from users but some may wish to expose it for “advanced” searches.
: Displaying a text snippet of a matched document containing the matched word
in context is clearly very useful.
We have all seen this in Google.
: Instead of unhelpfully returning no results (or very
few), the search implementation should be able to try and suggest variation(s) of the search that will
yield more results. This feature is customarily based on the actua
l indexed words and not a
language dictionary. The variations might be based on the so
distance which is basically
the number of alterations needed, or it might be based on phonetic matching.
: This is a must
have feature whic
h enables search results to include
aggregated values for designated fields that users can subsequently choose to filter the results on. It
is commonly used on e
commerce sites to the left of the results to navigate products by various
attributes like pric
e ranges, vendors, etc.
: As seen on Google, as you start typing a word in the
search box, it suggests possible completions of the word. These are relevancy sorted and filtered to
those that are also found with any wo
rds prior to the word you are typing.
: In some cases, it is needed to match arbitrary sub
strings of words instead of
being limited to complete words. Unlike what happens with an SQL
clause, the data is indexed in
such a way for
this to be quick.
: Given the coordinates to a location on the globe with records containing such
coordinates, you should be able to search for matching records from a user
specified coordinate. An
extension to Solr allows a radial based
search with appropriate ranking, but it is also straight
to box the search based on a latitude & longitude.
: The Endeca search platform can determine that your search query
matches some field values used for faceting and th
en offer a convenient filter for them. For example,
given a data set of employees, the search box could have a pop
up suggestion that the word in the
search box matches a department code and then offer the choice of navigating to those matching
is can be easier and faster than choosing facets to filter on, especially if there are a great
number of facet
able fields. Solr doesn't have this feature but it would not be a stretch to implement it
based on its existing foundation.
: This is a
nother aid to navigating search results besides faceting. Search result
clustering will dynamically divide the results into multiple groups called clusters, based on statistical
correlation of terms in common. It is a bit of an exotic feature, but is usefu
l with lots of results with
lots of text information and after any faceted navigation is done if applicable.
is a full
text search library which makes it easy to add search functionality to an
application or website.
It does so by adding content to a full
text index. It then searches this index and returns results
ranked by either the relevance to the query or b
y an arbitrary field such as a document's last
Searching and Indexing
is able to achieve fast search responses because, instead of searching the text directly,
it searches an index instead. This would be the equivalent of retrieving pages in a book related
to a keyword by searching the index at the back of a book, as oppose
d to searching the words
in each page of the book.
This type of index is called an
, because it inverts a page
centric data structure
>words) to a keyword
centric data structure (word
In Lucene, a
is the unit
of search and index.
An index consists of one or more Documents.
Indexing involves adding Documents to an IndexWriter, and searching involves retrieving
Documents from an index via an IndexSearcher.
A Lucene Document doesn't necessarily have to be a docum
ent in the common English usage
of the word. For example, if you're creating a Lucene index of a database table of users, then
each user would be represented in the index as a Lucene Document.
A Document consists of one or more Fields.
A Field is si
mply a name
value pair. For example, a
Field commonly found in applications is
. In the case of a
Field, the field name
and the value is the title of that content item.
Indexing in Lucene
thus involves creating Documents comprising of one or more Fields, and
adding these Documents to an IndexWriter.
Searching requires an index to have already been built. It involves creating a
a QueryParser) and handing this Q
uery to an
, which returns a list of Hits.
Lucene has its own mini
language for performing searches. Read more about the
query language allows the user to specify which field(s) to search on, which fields
to give more weight to (boosting), the ability to perform boolean queries (AND, OR, NOT) and
Pros and Cons
Probably the biggest obstacle to using Solr for the text index is that you need to synchronize data between
your database and Solr, assuming it isn't going to be in Solr alone
an option discussed later. If it is
satisfactory to fully create the Solr inde
x on occasion, perhaps nightly, then this should be relatively easy.
Getting a little more complex is augmenting that scheme to synchronize just changes (create, update,
delete) on a more frequent basis. Perhaps, the most difficult is attempting to update
automatically when changes occur in the database. Solr does not
support “near real time search”
which will allow changes to be search
able almost immediately after a change is submitted to Solr.
However, depending on data size and performance
targets, changes can often be search
able less than a
minute after if you need it this fast.
includes a module it calls the “Data Import Handler”, that is mostly for pulling in data from databases.
It can even handle updates if the records contain a date. But getting data to Solr “just
time” will in all but
one circumstance require some work o
n your part to coordinate. The exception here is using the
“acts_as_solr” plugin available for the Ruby
Rails framework which synchronizes data model changes
automatically. The closest thing in the Java landscape is Hibernate
Search which is an extensio
Hibernate that uses Lucene. Since it is not based on Solr, some notable features like faceting are not
No relational schema mapping
Solr has a flat schema and does not have any relational search capabilities. It does at least have multi
ued field support which in most database schemas is done with additional tables. Your relational
database schema is of course relational, and you will have to devise a mapping to put data into Solr. Often
this is a straight
forward mapping, perhaps requiri
ng some de
normalization (i.e. in
lining related data
causing some duplication). However there can be cases that are particularly difficult or impossible to map
when there is a one or many
many relationship with multiple search able fields. If you suspect
with mapping your relational database schema to Solr then be sure to pose your scenario to
You probably already know that the SQL standard largely diverges with different dialects across database
vendors. Text search features are especially different from vendor to vendor because it is not governed by
SQL or any other standard. Even if SQL stand
ardized the query syntax, it wouldn't be enough because it
wouldn't govern tokenization, stop words, and various other configuration aspects to search. And consider
that databases are becoming increasingly commoditized and standardized, such that more and
applications support many databases. Object Relational Mapping (ORM) frameworks like Hibernate and
ActiveRecord help make this possible. It's an easier proposition to add on Lucene or Solr to such
agnostic systems than to propose that the sys
tem contort itself to each vendor's text
features, perhaps in a least
common denominator way.
This is the main reason to avoid database search. To a database, text
search simply isn't as important
compared to all the other things a data
base does. But this is what Solr
. You can count on your database
for the basics: fast term search with some basic query types and relevancy ranked results, but you
probably have very little control over text
analysis or the extensive advanced query features already
, Solr has a variety of extensibility points to customize it to your needs and also, it
source if you'd prefer to tweak existing functionality if it's close to what you want. This happens
frequently in the Solr community, it is not just a theoretic
al point. Conversely, modifying one's open
database to suit a project's needs is unheard of (well, rare and certainly requiring more effort than
modifying Solr). And if you're a Java programmer and your database is not Java based, Solr is all the mo
approachable to modification.
Faceting is a head
liner feature in Solr
because it is a powerful method of navigating search results and
because it's not a widely available capability, especially in open
source. It may not seem hard at first but it
turns out that it's
to derive the right SQL queries to generate facets from a database. And if
you're an SQL pro and figured it out, it will be an anticlimactic accomplishment when in all likelihood you
discover that your queries run slowly
for sizable data sets. Comparatively, Solr includes highly
optimized code to both optimally filter on chosen facets and to generate facets with their counts. By the
way, this is a capability in Solr, not Lucene, thus making this a reason to use Solr inste
ad of other Lucene
based solutions (e.g. Hibernate Search, Compass, ...).
If your scenario is such that you are getting vast amounts of text from different sources (not databases
under your direct control) and you want to do text
search; a database would simply be inappropriate. With
Solr, you can choose to only index the data w
ithout storing it, thus saving
of data storage
requirements that a database would require. Solr is also optimized for fast,
, queries (indexing is
no slouch either) and it's fairly straight
forward to scale out to more servers, whereas
your database is built
line transaction processing (OLTP) usage or data warehousing.
Read the file "Readme.txt" placed at "
and follow the instructions.
To index: java
jar post.jar fileName (name.extension, *.extension, *.*...etc)
There are many other different ways to import your data into Solr... one can
Import records from a database using the
Data Import Handler (DIH)
Load a CSV file
(comma separated values), including those exported by Excel or
POST JSON documents
Index binary documents such as Word and PDF with
for Java or other Solr clients to programatically create documents to send to
queries and analyze
document fields in order to fine
tune a Solr configuration
and other help
[Index of the site:
So, you build a web site or an online system and you want to add search to it. But then it
hits you: getting search to work is hard. You want the search solution to be fast, painless to
setup and to scale. You want to be able to index data simply
using JSON over HTTP
without having to pre
define schemas for the index. You want the search server to always
be available and to start small but potentially scale large
Big Data large. You want to
create as many indices as you see fit which will support
a diverse set of document types.
As you search, you want it as close to real
time search as possible. Oh yeah… and it would
be great if this search solution is built for the cloud.
“This should be easier… and cool, bonsai cool!”
was created as a solution for these requirements and more. It is an
licensed open source, distributed and RESTful search engine, built on top of
Schema Free & Document Oriented
The data model of a search engine roots to schema free document oriented databases,
and as shown by the #nosql
movement, this model proves to be very effective for building
‘s model is JSON, which slowly emerges as the de
standard for representing data these days. More over, with JSON, it is simple to provide
structured data w
ith complex entities as well as being programming language natural
with first level parsers.
less, just toss it a typed JSON document and it will automatically
index it. Types such as numbers and dates are automatical
ly detected and treated
accordingly. That said, as we all know, search engines are quite sophisticated. Fields in
documents can have boost levels that affect scoring, analyzers can be used to control how
text gets tokenized into terms, certain fields shoul
d not be analyzed at all, and so on…
allows you to completely control how a JSON document gets mapped into
the search engine on a per type and per index level.
GETting Some Data
Indexing data is always done using a unique identifier (at the type level). This is very handy
since many times we wish to update or delete the actual indexed data, or just GET it.
Getting data could not be simpler and all tha
t is needed is the index name, the type and the
id. What we get back is the actual JSON document used to index the specific data,
effectively behaving as a distributed key/value store for structured documents.
What it all boils down to at the end, is being able to search, and with
not be simpler. Issuing queries is a simple call, hiding away the sophisticated distributed
based search support
provides. Search can be executed either using a simple
Lucene based query string or an extensive JSON based search query
Search does not end with just queries. Facets, highlighting, custom scripts, and more are
all there to be used when needed.
A single index is
already a major step forward, but what happens when we need to have
more than one index. In many cases, multiple indices are required. An example can be
storing an index per week of log files, or even having different indices with different settings
ith memory storage, and one with file system storage).
easily enables the creation of as many indices as required, allowing cross
index queries to be executed and index grouping using advanced aliasing functionality.
The ability to c
onfigure is a double edged sword. We want the ability to start working with
the system as fast as possible, with no configuration, and still be able to control almost
every aspect of the application if need be.
is built with this notion in mind. Almost everything is configurable and
pluggable. More over, each index can have its own settings which can override the master
settings. For example, one index can be configured with memory storage and have 10
1 replica each, and another index can have file based storage with 1 shard and
10 replicas. All the index level settings can be controlled when creating an index either
using a YAML or JSON format.
One of the main features of
its distributed nature. Indices are broken down
into shards, each shard with 0 or more replicas. Each data node within the cluster hosts
one or more shards, and acts as a coordinator to delegate operations to the correct
shard(s). Rebalancing and routing a
re done automatically and behind the scenes.
has been purposely built with the cloud in mind. Starting with options like auto
ery of nodes when running in AWS EC2, to being an adaptive distributed system that
automatically handles machines coming and going, and adjusts to a dynamic environment.
Compass provides a
for working with Lucene. If you know how to use an
n you will feel right at home with Compass with simple operations for save, and
delete & query.
Building on top of Lucene, Compass
ge patterns of Lucene such
style search, index updates as well as more advanced concepts such as caching
(sub indexes). Compass also uses built in optimizations for concurrent
commits and merges.
Compass provides support for mapping of different data "formats"
storing (caching) them in the Search Engine:
Object to Search Engine Mapping
annotations or xml),
JSON to Search Engine Mapping
XML to Search
(using simple xpath expressions), and the low level
Resource to Search Engine
Compass provides a transactional API on top of the Search Engine supporting
transaction isolation levels
. Externally, Compass provides a local transaction manager
as well as integration with
external transaction manager
such as JTA (Sync and XA), Spring,
and ORM ones.
Compass integrates seamlessly with most popular ORM frameworks allowing automatic
mirroring, to the index, of the changes in data performed via the ORM tool. Compass
generic support for JPA
as well as embedded support for
allow to add Compass using three simple steps.
Compass integrates seamlessly with Spring. Compass can be easily
Spring, integrates with
Spring transaction manag
, has support for
, and has
built in for refle
cting operations to the search engine.
Compass simplifies the creation of distributed Lucene index by allowing to store
Lucene index in a database
, as well as storing the index simply with Data Grid products
Hibernate Search brings the power of full text search engines to the persistence domain model
by combining Hibernate Core with the capabilities of the
Full text search engines like Apache Lucene™ are very powerful technologies to add efficient free
text search capabilities to applications. However, Lucene suffers several mismatches when
dealing with object domain models. Amongst other thi
ngs indexes have to be kept up to date
and mismatches between index structure and domain model as well as query mismatches have
to be avoided.
Hibernate Search addresses these shortcomings. It indexes your domain model with the help of
a few annotations,
takes care of database/index synchronization and brings back regular
managed objects from free text queries.
Hence, it solves:
The structural mismatch: Hibernate Search takes care of the object/index translation
The duplication mismatch: Hibernate Search
manages the index, keeps changes
synchronized with your database, and optimizes the index access transparently
The API mismatch: Hibernate Search lets you query the index and retrieve managed
objects as any regular Hibernate query would do
Even though Hibe
rnate Search is using Apache Lucene™ under the hood you can always fallback
to the native Lucene APIs if the need arises.
Depending on application needs, Hibernate Search works well in non
clustered and clustered
mode, provides synchronous and asynchronous
index updates, allowing you to make an active
choice between response, throughput and index update time.
Last but not least, Hibernate Search works perfectly with all traditional Hibernate patterns,
especially the conversation pattern used by