OpenNLP Similarity component
This component does text relevance assessment, accepting two portions of texts (phrases, sentences, paragraphs) and
turns a similarity score.
Similarity component can be used on top of search to improve relevance, comp
similarity score between a
question and all search results (
Also, this component is useful for web mining of images, videos, forums,
, and other media with textual
scriptions. Such applications as content generation and filtering meaning
less speech recognition results are
included in the sample applications of this component. Relevance assessment is based on machine learning of
tactic parse trees (constituency trees). The similarity score is calculated as the size of all maximal common
trees for sentences from a pair of texts.
The objective of Similarity component is to give an application engineer as tool for text relevance which can be
used as a black box, no need to understand comput
tional linguistics or machine learning.
entry point to
Similarity component is
SentencePairMatchResult matchRes = sm.assessRelevance(sentence1,sentence2);
includes the similarity score (weighted number of common terms) and the set of maximum
common parse trees.
First use ca
se of Similarity component: search
To start with this component, please refer to
est.java in package
public void testSearchOrder()
runs web search using Bing API and improves search relevance.
ook at the code of
public List<HitBase> runSearch(String query)
and then at
, String searchQuery)
which gets search results from Bing and re
ranks them based on comput
The main entry to Similarity component is
SentencePairMatchResult matchRes = sm.assessRelevance(snapshot,
where we pass the search query and the snapshot and obtain the similarity a
sessment structure which
the similarity score.
To run this test you need to obtain search API key from Bing at
and specify it in
lic class BingQueryRunner in
protected static final String APP_ID.
2 Solving a uniqu
e problem: content generation
To demonstrate the usability of Similarity component to tackle a problem which is hard to solve without a
based technology, we introduce a content gener
ntry point here is the function call
hits = f.generateContentAbout
which writes a biography of
by finding sentences on the web about various kinds of his
ties (such as 'born', 'graduate', 'invented' etc.).
The key here is to compute similarity between the seed expression like "
invented relativity th
and search result like
College of Medicine | Medical Educ
tion | Biomedical ...
College of Medicine is one of the nation's premier institutions for me
and filter out irrelevant search results.
This is done in function
public HitBase augmentWithMinedSentencesAndVerifyRel
SentencePairMatchResult matchRes = sm.assessRelevance(pageSentence + " " +
You can consult the results in gen.txt, where an essay on
These are exampl
es of generated articles, given the article title
. Solving a high
importance problem: filtering out meaningless speech recognition results.
Speech recognitions SDKs usually produce a number of phrases as results, such as
"remember to buy milk tomor
row from trader
"remember to buy milk tomorrow from 3 to
One can see that the former is meaningful, and the la
ter is meaningless (although similar in terms of how it is
We use web mining and Similarity co
ponent to detect a
meaningful option (a mistake caused by
trying to interpret meaningless request by a query understanding system such as
for iPhone can be cos
SpeechRecognitionResultsProcessor.java does the job:
ranks the phrases in the order of decrease of mea
4. Package structure
Similarity component internals are in the package
does parsing of two portions of text and matching the
resultant parse trees to assess similarity between
these portions of text.
private static String MODEL_DIR = "r
eds to be specified
The key function
public SentencePairMatchResult assessRelevance(String para1, String para2)
takes two portions of text and does similarity asses
ment by finding the set of all maximum common subtrees
of the set of parse tree
s for each portion of text
It splits paragraphs into sentences, parses them, o
information and produces grouped phrases
, prepositional etc.):
public synchronized List<List<ParseTreeChunk>>
and then attempts to find common subtrees:
List<List<ParseTreeChunk>> res =
Phrase matching functionality is in pa
Here's the key matching function which takes two phrases, aligns them and finds a set of maximum common sub
public List<ParseTreeChunk> generalize
Package structure is as follows:
opennlp.tools.similarity.apps : 3 main applications
opennlp.tools.similarity.apps.utils: utilities for above applications
ser which converts
text into a form f
or matching parse trees
opennlp.tools.textsimilarity: parse tree matching functionality