Slide
‹#›
Reproduksjon forbudt uten tillatelse fra Computas AS ©
Implementation of Topic
Centered Portals
David Norheim
Computas AS, Norway
Robert Engels,
ESIS AS, Norway
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Motivation
The system
Challenges and lessons learned
Future work
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Computas
23 years experience in knowledge management, expert
systems, and process modeling
Special focus toward government and the oil
-
and gas
sector
The major semantic web company in Norway
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Computas’ semantic Web activities
Sectors
•
Oil
-
and gas industry
•
Government
Type of applications
•
Knowledge
management
•
Semantic search
support
•
Research and
commerical projects
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Background
A clear shift towards
open source
and
open standards
•
Linux for Schools, Open Document formats in the public sector
•
National semantic registry, Large governmental information portals based on
semantic standards
•
The government through
Norwegian Archive, Library and Museum Authority
(ABM
-
utvikling): development of an open standard based, open
-
source
software for creation and maintenance of topic
-
driven portals.
”there is a need for a
targeted effort to create a
framework based on
Semantic Web to enable
professional users to
organize information and to
make libraries build and
maintain metadata
-
driven
search solutions.”
A digital culture and knowledge policy?
EFN.no
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
A topic driven portal
For a library it is as natural to
evaluate, describe and enable
retrieval of
any resource
on the
web as printed material
Quality evaluated collection of
information resources organized
according to some
topic
structure
and published online.
Retrieval through
search and
navigation
in topics
Source: Ellen Aarbakken, Oslo Public Library (Deichmanske Bibliotek)
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Why a topic centric portal tool and not search?
Yahoo! provided the first subject driven portal, but focused on most popular
aspects
-
> replaced by Search (e.g. Google)
However, the words in the long tail is
context dependent
, and generic web
search will frequently pollute results due to ambiguity
Example of long tail portals
•
Medical information for laymen
•
Primary school educational resources
•
Public information for immigrants
•
Juridical information for laymen
•
Norwegian architecture portal
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Why not Web 2.0?
•
Folksonomies
•
Collaborative “categorization”
•
Freely chosen keywords
•
Manual “tagging”, practically
no existing metadata
•
Mostly acting as a
popularity measure
•
Topic tools
•
Conceptual level
with navigation
•
Quality evaluated
with metadata
•
Manual “tagging”,
but support for
more automation
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
SUB
ject oriented tool for
LI
braries
, M
useums and
A
rchives
Several roads to the same destination
Key requirements in developing the tool
•
Handle metadata of various sources and
vocabularies (e.g. Dublin Core)
•
Interoperability
-
among portals based on
the same tool and same protocols
(SPARQL, SRU)
•
Open source and open (semantic web)
standards
•
Combining free text search and
navigation through models
•
Handling both informal and formal models
(e.g. SKOS and OWL DL)
-
future
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Scandinavian Medical Information
for Laymen (SMIL)
is a Scandinavian
international cooperation to offer
quality controlled meta
-
data with
references to pages related to
health, illnesses and treatments.
Contributing partners to the portal
are librarians and nurses from the
Nordic countries. The current SMIL
base consists of 8500 records
creating around 250.000 triples.
Two initial portals
Detektor
targets public schools.
Resources are annotated by public
libraries consists of about 1850
topics and 4600 resources. This
results in about 100.000 triples
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Portal Technical Characteristica (grounding technologies)
Technology
Name, release
Comment
Operating system
Linux Ubuntu
Also tested on Redhat, Windows and OS X
Database
Postgress (under Jena) indexing with
Lucene
Should work with any SPARQL and SPARUL
supported storage
Document repository
The Web, any URLs
Webserver
Apache Tomcat v.5.5 and 6.0
Also tested on RESIN
Applied ontology
Domain Ontology and Portal ontology
(object types)
Ontology Language
SKOS, RDF/S
Currently implementing OWL support
Export/Import
RDF/XML, Turtle
Reuse and
Interoperbility
Voc.: DC, FOAF, SIOC, Powder Lingvoj.
Query lang.: SPARQL, CQL
Also using SPARUL
Inference engine
None
Will implement OWL DL supported
inference engine Q4 2008
Ontology editor
Internal web
-
based, Protégé (external)
Export ontology and continue to work in
any RDF/OWL compliant ontology editor
User interface
HTML, Apache Cocoon
License
Open Source, CDDL
-
lisence
Evaluation criterieas inspired by the Esperonto project
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Architecture
Web client
Search and navigation
SPARQL dispatcher
SPARQL
queries
Local endpoint
Indexing
Topic ontology
Metadata store
Ontology maintenance
External clients
SUR client
External servers
SRU server
SPARQL endpoint
Crawler
Portal configuration
SPARQL
update
Web resources
Open
search
SPARQL client
The client consists
of a search
interface allowing
users to search
using free text and
meta
-
data search.
The search string is
transferred into a
structured SPARQL
query
Interoperability at
the query layer
System accept
queries from
both SPARQL
and SRU/CQL
Backend consists
of an RDF Store
with SPARQL
interfaces.
Freetext indexing
using
lucene/LARQ
System can
query external
SPARQL and
SRU/CQL
services
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Sublima
Ontologies generally provide
the structure for the navigation
of the results, support
browsing
and classification
.
Ontologies allow for term
disambugation, query rewriting
and semantic distance
measures
In sublima we use informal SKOS
to
•
Navigating through subjects
,
showing the subject relations
(“fish eye”)
•
Search expansion; synonyms,
common misspellings
•
Faceted filtering; topics as well
as other metdata
Future version will also support
OWL DL
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Good and bad choices, lessons learned the hard way
•
Keeping the semantics
•
Living with free
-
text indexing and
structrued queries
•
Tool maturity
•
Scalabilty
Keep in mind this is NOT a research project, but with a
real and demanding customers expecting everything to work
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Perserving the semantics
We needed flexibility for users to add any metadata without
touching code
SPARQL SELECT loses the meaning returning only a binding,
hence clients become static. We therefore used
SPARQL
DESCRIBE
extensively
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Living with free
-
text indexing and structrued queries
Indexing with respect to structure
Our
breastfeeding twin
-
problem
•
Not sufficient to index all literals as users
expect hits on the combination of
dc:title
and
dc:description
•
And even worse; the combination of
dc:title
and
dc:subject/skos:preferedLabel
Scoring/ranking
•
Easy with SELECT, but not with DESCRIBE
•
How do you rank results from a structured
query?
No universal way to handle sturctured and unstructure information
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Constistent tool maturity and missing links
Some ”small” issues
•
Support for Turtle in Protégé
-
>
needed to convert to RDF/XML
•
Resources identified with URLs
in Protégé
•
Tools mostly geared towards
one dialect of RDF/OWL
•
Indeterministic RDF/XML
serialization for XSLT processing
•
Lacking a binding from OWL
classes to OO languages
The simple things sometimes turns out to be the hardest…
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Scalability
Response time varies
with store size and
query complexity
•
Too much complexity
in queries
Moving from 500k
triples to 10th of
millions
•
Need to refactor into
smaller faster queries
•
Federation of queries
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Some good lessons
New standards
(e.g. SPARQL),
proposals for standardization
(e.g. SPARUL),
new tools
(e.g. Jena),
open source
(e.g.
Tomcat, Apache),
lack of good documentation
all say
high
risk
!!!!
However, the support and maintenance from the W3C
community and open source developers (e.g. Jena team)
has been impressive, the support through IRC channels,
mailing lists etc has been
invaluable
for the project.
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Some good lessons
Good experiences with
reusing metadata
schemas
•
FOAF, Dublin Core, Powder, SKOS, SIOC, Lingvoj
Extensive
dereferencing
of URIs, any topic
and resource URI pasted in the browser
results in a DESCRIBE query for that URI.
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Living with informal and formal ontologies
Current ontologies are modeled
informally
with W3C Simple Knowledge
Organization System (SKOS)
•
No distinction between part
-
of, contains,
is
-
a
•
No reasoning support
•
Possible with small datasets
Sublima will also support models using
formal ontologies
•
Formal IS
-
A
•
DL reasoning
•
Required for large datasets
Expressivity
Reasoning
Large data sets
Smaller data sets
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Future work
•
Integration with other SPARQL
-
based portals.
•
Interoperability with ISO Topic
Maps models
•
Graphical visualization with touch screen, clever UIs
•
Hi
-
quality multimedia resources
The code
-
base is no in
use in more projects
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Conclusion
We clearly found that the technology currently available starts to
reach a certain state of maturity if it comes to functionality.
BUT STILL RISKS!
Careful evaluation of tools and scalability is needed as content
increases.
Query interoperability
Do not eat the whole menu at once!
Recording
companies
Broad
-
casters
High quality metadata
Open metadata
e.g.
Wikipedia
Slide
‹#›
Reproduction prohibited without authorization by Computas AS ©
Thank you for your attention
david.norheim@computas.com
We welcome sharing our experiences with yours!
Welcome to upcoming conferences in Norway next year
•
Mid February in Oslo
-
hands
-
on tutorials
•
May in Stavanger
-
Semantic Days focusing on the oil
-
and gas industry
•
September 2008
-
initiating Scandinavian Semantic Web Conference
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο