SKOS
-
2
-
HIVE
UNT workshop
Introductions
Craig Willis (craig.willis@unc.edu)
Afternoon Session Schedule
Overview
Using HIVE as a service
Installing and configuring HIVE
Using HIVE Core API
Understanding HIVE Internals
HIVE supporting technologies
Developing and customizing HIVE
Block 1: Introduction
Workshop Overview
Schedule
Interactive, less structure
Hands
-
on (work together)
Activities:
Installing and configuring HIVE
Programming examples (HIVE Core API, HIVE REST API)
Background and Interests
What are you most interested in getting out of this part of the
workshop?
What is your background?
Cataloging, indexing, and classification
Programming and databases
Systems administration
What is your level of familiarity with the following technologies?
Java, Tomcat, Lucene
REST
RDF, SPARQL, SKOS, Sesame
HIVE Technical Overview
HIVE consists of many technologies combined to provide a
framework for vocabulary services
System for management of multiple controlled vocabularies in
SKOS/RDF format
Java
-
based web services can run in any Java application server
Demonstration website (
http://hive.nescent.org/
)
Google Code project (
http://code.google.com/p/hive
-
mrc/
)
Architecture
HIVE Architecture
SPARQL:
RDF query language (W3C recommendation)
REST
:
Web
-
based API and software architecture
Triple store
:
Database for the storage and retrieval of RDF data.
Supports queries using SPARQL.
Sesame
: Open source triple store
Elmo
: Sesame API for common ontologies (OWL, Dublin Core,
SKOS)
Lucene
: Java
-
based search engine
KEA++
: Algorithm and Java API for automatic subject suggestions
from controlled vocabularies.
HIVE Functions
Conversion of vocabularies to SKOS
Rich internet application (RIA) for browsing and searching
multiple SKOS vocabularies
Java API and REST application interfaces for programmatic
access to multiple SKOS vocabularies
Support for natural language and SPARQL queries
Automatic keyphrase indexing using multiple SKOS
vocabularies. HIVE supports two indexers:
KEA++ indexer
Basic Lucene indexer
Block 2:
Using HIVE as a service
Using HIVE as a Service
HIVE web application
http://hive.nescent.org/
Developed by Jose Perez
-
Aguera, Lina Huang
Java servlet, Google Web Toolkit (GWT)
http://code.google.com/p/hive
-
mrc/wiki/AboutHiveWeb
HIVE REST service
http://hive.nescent.org/rs
Developed by Duane Costa, Long
-
Term Ecological Research Network
http://code.google.com/p/hive
-
mrc/wiki/AboutHiveRestService
Activity: Calling HIVE
-
RS
Demonstrate calling the HIVE
-
RS web service (Java)
Block 3:
Install and Configure HIVE
Installing and Configuring
HIVE
Requirements
Java 1.6
Tomcat
(HIVE is currently using 6.x)
Detailed installation instructions:
http://code.google.com/p/hive
-
mrc/wiki/InstallingHiveWeb
http://code.google.com/p/hive
-
mrc/wiki/InstallingHiveRestService
Installing and Configuring
HIVE
-
web
Detailed installation instructions (hive
-
web)
http://code.google.com/p/hive
-
mrc/wiki/InstallingHiveWeb
Quick start (hive
-
web)
Download and extract Tomcat 6.x
Download and extract latest hive
-
web war
Download and extract sample vocabulary
Configure hive.properties and agrovoc.properties
Start Tomcat
http://localhost:8080/
Properties files
hive.properties
Specifies enabled vocabularies and selected indexing algorithm
http://code.google.com/p/hive
-
mrc/source/browse/trunk/hive
-
web/war/WEB
-
INF/conf/hive.properties
<vocabulary>.properties
Specifies location of vocabulary databases/indexes on the local filesystem
http://code.google.com/p/hive
-
mrc/source/browse/trunk/hive
-
web/war/WEB
-
INF/conf/lcsh.properties
Installing and Configuring
HIVE
-
web from source
Detailed installation instructions (hive
-
web)
http://code.google.com/p/hive
-
mrc/wiki/DevelopingHIVE
http://code.google.com/p/hive
-
mrc/wiki/InstallingHiveWeb
Requirements
Eclipse IDE for J2EE Developers
Subclipse plugin
Google Eclipse Plugin
Apache Ant
Google Web Toolkit 1.7.1
Tomcat 6.x
Installing and Configuring
HIVE REST Service
Detailed installation instructions (hive
-
rs)
http://code.google.com/p/hive
-
mrc/wiki/InstallingHiveRestService
Quick start (hive
-
rs)
Download and extract latest webapp
Download and extract sample vocabulary
Configure hive.properties
Start Tomcat
Importing SKOS Vocabularies
http://code.google.com/p/hive
-
mrc/wiki/ImportingVocabularies
Note memory requirements for each vocabulary
http://code.google.com/p/hive
-
mrc/wiki/HIVEMemoryUsage
java
–
Xmx1024m
-
Djava.ext.dirs=path/to/hive/lib
edu.unc.ils.mrc.hive.admin.AdminVoc
abularies [/path/to/hive/conf/] [vocabulary] [train]
Block 4:
Using the HIVE Core Library
HIVE Core Interfaces
HIVE Core Packages
edu.unc.ils.mrc.hive.api
Main interfaces and implementations
edu.unc.ils.mrc.hive.converter
SKOS converters (MeSH, ITIS, NBII,
TGN)
edu.unc.ils.mrc.hive.lucene
Lucene index creation and searching
edu.unc.ils.mrc.hive.ir.tagging
KEA++ and “dummy” tagger
implementations
edu.unc.ils.hive.api
SKOSServer
:
Provides access to one or more vocabularies
SKOSSearcher
:
Supports searching across multiple vocabularies
SKOSTagger
:
Supports tagging/keyphrase extraction across multiple vocabularies
SKOSScheme
:
Represents an individual vocabulary (location of vocabulary on file
system)
SKOSServer
SKOSServer is the top
-
level class used to initialize the
vocabulary server.
Reads the
hive.properties
file and initializes the
SKOSScheme (vocabulary management), SKOSSearcher
(concept searching), SKOSTagger (indexing) instances based
on the vocabulary configurations.
edu.unc.ils.mrc.hive.api.SKOSServer
TreeMap<String, SKOSScheme> getSKOSSchemas();
SKOSSearcher getSKOSSearcher();
SKOSTagger getSKOSTagger();
String getOrigin(QName uri);
SKOSSearcher
Supports searching across one or more configured
vocabularies.
Keyword queries using Lucene, SPARQL queries using
OpenRDF/Sesame
edu.unc.ils.mrc.hive.api.SKOSSearcher
searchConceptByKeyword(uri, lp)
searchConceptByURI(uri, lp)
searchChildrenByURI(uri, lp)
SPARQLSelect()
SKOSTagger
Keyphrase extraction
using multiple vocabularies
Depends on setting in
hive.properties
edu.unc.ils.mrc.hive.api.SKOSTagger
“dummy” or “KEA”
List<SKOSConcept> getTags(String text, List<String>
vocabularies, SKOSSearcher searcher);
SKOSScheme
Represents an individual vocabulary, based on settings in
<vocabulary>.properties
Supports querying of statistics about each vocabulary
(number of concepts, number of relationships, etc).
Activity
Demonstrate a simple Java class that allows the user to query
for a given term.
Demonstrate a simple Java class that can read a text file and
call the tagger.
Block 5:
Understanding HIVE Internals
Architecture
Data Directory Layout
/usr/local/hive/hive
-
data
vocabulary/
vocabulary.rdf
SKOS RDF/XML
vocabularyAlphaIndex
Serialized map
vocabularyH2
H2 database (used by KEA)
vocabularyIndex
Lucene Index
vocabularyKEA
KEA model and training data
vocabularyStore
Sesame/OpenRDF store
topConceptIndex
Serialized map of top concepts
Keyword
Search
Indexing
HIVE Internals: Data Models
Lucene Index
: Index of SKOS vocabulary (view with Luke)
Sesame/OpenRDF Store
: Native/Sail RDF repository for
the vocabulary
KEA++ Model:
Serialized KEAFilter object
H2 Database:
Embedded DB contains SKOS vocabulary in
format used by KEA. (Can be queried using H2 command
line)
Alpha Index:
Serialized map of concepts
Top Concept Index:
Serialized map of top concepts
HIVE Internals: HIVE Web
GWT Entry Points:
HomePage
ConceptBrowser
Indexer
Servlets
VocabularyService: Singleton vocabulary server
FileUpload: Handles the file upload for indexing
ConceptBrowserServiceImpl
IndexerServiceImpl
HIVE Internals: HIVE
-
RS
Java API for RESTful Web Services (JAX
-
RS)
Classes
ConceptsResource:
SchemesResource
Block 6:
HIVE Supporting Technologies
HIVE supporting technologies
Lucene
http://lucene.apache.org
Sesame
http://www.openrdf.org/
KEA
http://www.nzdl.org/Kea/
H2
http://www.h2database.com/
GWT
http://code.google.com/webtoolkit/
Activity
Explore Lucene index with Luke
http://luke.googlecode.com/
Explore Sesame store with SPARQL
http://www.xml.com/pub/a/2005/11/16/introducing
-
sparql
-
querying
-
semantic
-
web
-
tutorial.html
http://www.cambridgesemantics.com/2008/09/sparql
-
by
-
example/
Block 7:
Customizing HIVE
Obtaining Vocabularies
Several vocabularies can be freely downloaded
Some vocabularies require licensing
HIVE Core includes converters for each of the supported
vocabularies.
List of HIVE vocabularies
http://code.google.com/p/hive
-
mrc/wiki/VocabularyConversion
Converting Vocabularies to SKOS
Additional information
http://code.google.com/p/hive
-
mrc/wiki/VocabularyConversion
Each vocabulary has different requirements
LCSH
Available in SKOS RDF/XML
NBII
Convert from XML to SKOS RDF/XML (SAX)
ITIS
Convert from RDB (MySQL) to SKOS RDF/XML
TGN
Convert from flat
-
file to SKOS RDF/XML
LTER
Available in SKOS RDF/XML
AGROVOC
Available in SKOS RDF/XML
MeSH
Convert from XML to SKOS RDF/XML (SAX)
Converting Vocabularies to SKOS
A Method to Convert Thesauri to SKOS (
van Assem
et al)
Prolog implementation
IPSV, GTAA, MeSH
http://thesauri.cs.vu.nl/eswc06/
Converting MeSH to SKOS for HIVE
Java SAX
-
based parser
http://code.google.com/p/hive
-
mrc/wiki/MeshToSKOS
LTER Sample Service
http://scoria.lternet.edu:8080/lter
-
hive
-
prototypes
Block 8:
KEA++
About KEA++
http://www.nzdl.org/Kea/
Algorithm and open
-
source Java library for extracting
keyphrases from documents using SKOS vocabularies.
Developed by Alyona Medelyan (KEA++), based on earlier
work by Ian Whitten (KEA) from the Digital Libraries and
Machine Learning Lab at the University of Waikato, New
Zealand.
Problem:
How can we automatically identify the topic of
documents?
Automatic Indexing
Free keyphrase indexing (KEA)
Significant terms in a document are determined based on intrinsic properties
(e.g., frequency and length).
Keyphrase indexing (KEA++)
Terms from a controlled vocabulary are assigned based on intrinsic
properties.
Controlled indexing/term assignment:
Documents are classified based on content that corresponds to a controlled
vocabulary.
e.g., Pouliquen, Steinberger, and Camelia (2003)
Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.”
Journal of the American Society for Information Science and Technology
, (59) 7: 1026
-
1040).
KEA++ at a Glance
KEA++ uses a machine learning approach to keyphrase extraction
Two stages:
Candidate identification: Find terms that relate to the document’s
content
Keyphrase selection: Uses a model to identify the most significant terms.
KEA++: Candidate
identification
Parse tokens based on whitespace and punctuation
Create word n
-
grams based on longest term in CV
Stem to grammatical root (Porter)
Stem terms in vocabulary (Porter)
Replace non
-
descriptors with descriptors using CV relationships
Match stemmed n
-
grams to vocabulary
KEA++: Candidate
identification
Original
Stemmed
“information organization”
“inform organ”
“organizing information”
“inform organ”
“informative organizations”
“inform organ”
“informal organization”
“inform organ”
Stemming is not perfect ...
KEA++: Feature definition
Term Frequency/Inverse Document Frequency
Frequency of a phrase’s occurrence in a document with frequency in general
use.
Position of first occurrence:
Distance from the beginning of the document. Candidates with high/low
values are more likely to be valid (introduction/conclusion)
Phrase length:
Analysis suggests that indexers prefer to assign two
-
word descriptors
Node degree:
Number of relationships between the term in the CV.
DummyTagger
Primarily intended as baseline for analysis of KEA++
Uses LingPipe for part
-
of
-
speech identification (limits
indexing to certain parts of speech)
Uses Lucene vocabulary index
Simple TF*IDF implementation
Configurable in
hive.properties
Who’s Using HIVE
NESCENT/Dryad
Evaluating HIVE for automatic term suggestion from multiple
vocabularies for scientific article metadata.
Long
-
Term Ecological Research Network (LTERNet)
http://scoria.lternet.edu:8080/lter
-
hive
-
prototypes/
Prototype application for automatic term suggestion for EML
metadata files.
Library of Congress Web Archives
Evaluating HIVE for automatic term suggestion for web archive
(WARC) files
Plans
Automatic updates to vocabularies
Integration of other concept extraction algorithms
Maui
Dryad integration
Other
Maven integration
Spring integration
Data directory and property file restructuring
Concept browser updates
Discussion
Pros and Con
HIVE Core vs. HIVE Web vs. HIVE
-
RS
Brainstorm applications that could benefit from HIVE, discuss
implementations
Credits
Ryan Scherle
José Ramón Pérez Agüera
Lina Huang
Alyona Medelyan
Ian Whitten
Questions /Comments
Craig Willis
craig.willis@unc.edu
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment