Implementation of Topic Centered Portals

hurriedtinkleΤεχνίτη Νοημοσύνη και Ρομποτική

15 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

78 εμφανίσεις

Slide
‹#›

Reproduksjon forbudt uten tillatelse fra Computas AS ©

Implementation of Topic

Centered Portals

David Norheim

Computas AS, Norway

Robert Engels,

ESIS AS, Norway

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©


Motivation


The system


Challenges and lessons learned


Future work

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Computas


23 years experience in knowledge management, expert
systems, and process modeling


Special focus toward government and the oil
-

and gas
sector


The major semantic web company in Norway

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Computas’ semantic Web activities


Sectors


Oil
-

and gas industry


Government


Type of applications


Knowledge
management


Semantic search
support


Research and
commerical projects

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Background


A clear shift towards
open source

and
open standards


Linux for Schools, Open Document formats in the public sector


National semantic registry, Large governmental information portals based on
semantic standards


The government through
Norwegian Archive, Library and Museum Authority

(ABM
-
utvikling): development of an open standard based, open
-
source
software for creation and maintenance of topic
-
driven portals.

”there is a need for a
targeted effort to create a
framework based on
Semantic Web to enable
professional users to
organize information and to
make libraries build and
maintain metadata
-
driven
search solutions.”

A digital culture and knowledge policy?

EFN.no

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

A topic driven portal


For a library it is as natural to
evaluate, describe and enable
retrieval of
any resource

on the
web as printed material



Quality evaluated collection of
information resources organized
according to some
topic
structure

and published online.



Retrieval through
search and
navigation

in topics

Source: Ellen Aarbakken, Oslo Public Library (Deichmanske Bibliotek)

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Why a topic centric portal tool and not search?


Yahoo! provided the first subject driven portal, but focused on most popular
aspects
-
> replaced by Search (e.g. Google)


However, the words in the long tail is
context dependent
, and generic web
search will frequently pollute results due to ambiguity


Example of long tail portals


Medical information for laymen


Primary school educational resources


Public information for immigrants


Juridical information for laymen


Norwegian architecture portal


Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Why not Web 2.0?


Folksonomies


Collaborative “categorization”


Freely chosen keywords


Manual “tagging”, practically

no existing metadata


Mostly acting as a

popularity measure



Topic tools


Conceptual level

with navigation


Quality evaluated

with metadata


Manual “tagging”,

but support for

more automation



Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

SUB
ject oriented tool for
LI
braries
, M
useums and

A
rchives


Several roads to the same destination


Key requirements in developing the tool


Handle metadata of various sources and
vocabularies (e.g. Dublin Core)


Interoperability
-

among portals based on
the same tool and same protocols
(SPARQL, SRU)


Open source and open (semantic web)
standards


Combining free text search and
navigation through models


Handling both informal and formal models
(e.g. SKOS and OWL DL)
-

future

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©


Scandinavian Medical Information
for Laymen (SMIL)

is a Scandinavian
international cooperation to offer
quality controlled meta
-
data with
references to pages related to
health, illnesses and treatments.
Contributing partners to the portal
are librarians and nurses from the
Nordic countries. The current SMIL
base consists of 8500 records
creating around 250.000 triples.


Two initial portals


Detektor

targets public schools.
Resources are annotated by public
libraries consists of about 1850
topics and 4600 resources. This
results in about 100.000 triples

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Portal Technical Characteristica (grounding technologies)

Technology

Name, release

Comment

Operating system

Linux Ubuntu

Also tested on Redhat, Windows and OS X

Database

Postgress (under Jena) indexing with
Lucene

Should work with any SPARQL and SPARUL
supported storage

Document repository

The Web, any URLs

Webserver

Apache Tomcat v.5.5 and 6.0

Also tested on RESIN

Applied ontology

Domain Ontology and Portal ontology
(object types)

Ontology Language

SKOS, RDF/S

Currently implementing OWL support

Export/Import

RDF/XML, Turtle

Reuse and
Interoperbility

Voc.: DC, FOAF, SIOC, Powder Lingvoj.

Query lang.: SPARQL, CQL

Also using SPARUL

Inference engine

None

Will implement OWL DL supported
inference engine Q4 2008

Ontology editor

Internal web
-
based, Protégé (external)

Export ontology and continue to work in
any RDF/OWL compliant ontology editor

User interface

HTML, Apache Cocoon

License

Open Source, CDDL
-
lisence

Evaluation criterieas inspired by the Esperonto project

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Architecture

Web client

Search and navigation

SPARQL dispatcher

SPARQL

queries

Local endpoint

Indexing

Topic ontology

Metadata store

Ontology maintenance

External clients

SUR client

External servers

SRU server

SPARQL endpoint

Crawler

Portal configuration

SPARQL

update

Web resources

Open

search

SPARQL client

The client consists
of a search
interface allowing
users to search
using free text and
meta
-
data search.
The search string is
transferred into a
structured SPARQL
query

Interoperability at
the query layer

System accept
queries from
both SPARQL
and SRU/CQL

Backend consists
of an RDF Store
with SPARQL
interfaces.
Freetext indexing
using
lucene/LARQ

System can
query external
SPARQL and
SRU/CQL
services

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Sublima


Ontologies generally provide
the structure for the navigation
of the results, support
browsing
and classification
.


Ontologies allow for term
disambugation, query rewriting
and semantic distance
measures


In sublima we use informal SKOS
to


Navigating through subjects
,
showing the subject relations
(“fish eye”)


Search expansion; synonyms,
common misspellings


Faceted filtering; topics as well
as other metdata


Future version will also support
OWL DL

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Good and bad choices, lessons learned the hard way


Keeping the semantics


Living with free
-
text indexing and
structrued queries


Tool maturity


Scalabilty

Keep in mind this is NOT a research project, but with a

real and demanding customers expecting everything to work

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Perserving the semantics


We needed flexibility for users to add any metadata without
touching code


SPARQL SELECT loses the meaning returning only a binding,
hence clients become static. We therefore used
SPARQL
DESCRIBE

extensively

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Living with free
-
text indexing and structrued queries


Indexing with respect to structure



Our
breastfeeding twin
-
problem


Not sufficient to index all literals as users
expect hits on the combination of
dc:title

and
dc:description


And even worse; the combination of
dc:title

and
dc:subject/skos:preferedLabel


Scoring/ranking


Easy with SELECT, but not with DESCRIBE


How do you rank results from a structured
query?

No universal way to handle sturctured and unstructure information

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Constistent tool maturity and missing links


Some ”small” issues


Support for Turtle in Protégé
-
>
needed to convert to RDF/XML


Resources identified with URLs
in Protégé


Tools mostly geared towards
one dialect of RDF/OWL


Indeterministic RDF/XML
serialization for XSLT processing


Lacking a binding from OWL
classes to OO languages

The simple things sometimes turns out to be the hardest…

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Scalability


Response time varies
with store size and
query complexity


Too much complexity
in queries


Moving from 500k
triples to 10th of
millions


Need to refactor into
smaller faster queries


Federation of queries

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Some good lessons


New standards

(e.g. SPARQL),
proposals for standardization

(e.g. SPARUL),
new tools

(e.g. Jena),
open source

(e.g.
Tomcat, Apache),
lack of good documentation

all say
high
risk
!!!!


However, the support and maintenance from the W3C
community and open source developers (e.g. Jena team)
has been impressive, the support through IRC channels,
mailing lists etc has been
invaluable

for the project.

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Some good lessons


Good experiences with
reusing metadata
schemas


FOAF, Dublin Core, Powder, SKOS, SIOC, Lingvoj


Extensive
dereferencing

of URIs, any topic
and resource URI pasted in the browser
results in a DESCRIBE query for that URI.

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Living with informal and formal ontologies


Current ontologies are modeled
informally

with W3C Simple Knowledge
Organization System (SKOS)


No distinction between part
-
of, contains,
is
-
a


No reasoning support


Possible with small datasets


Sublima will also support models using
formal ontologies


Formal IS
-
A


DL reasoning


Required for large datasets

Expressivity

Reasoning

Large data sets

Smaller data sets

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Future work


Integration with other SPARQL
-
based portals.


Interoperability with ISO Topic

Maps models


Graphical visualization with touch screen, clever UIs


Hi
-
quality multimedia resources

The code
-
base is no in
use in more projects

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Conclusion


We clearly found that the technology currently available starts to
reach a certain state of maturity if it comes to functionality.

BUT STILL RISKS!


Careful evaluation of tools and scalability is needed as content
increases.

Query interoperability

Do not eat the whole menu at once!

Recording

companies

Broad
-

casters

High quality metadata

Open metadata

e.g.

Wikipedia

Slide
‹#›

Reproduction prohibited without authorization by Computas AS ©

Thank you for your attention

david.norheim@computas.com

We welcome sharing our experiences with yours!

Welcome to upcoming conferences in Norway next year


Mid February in Oslo
-

hands
-
on tutorials


May in Stavanger
-

Semantic Days focusing on the oil
-

and gas industry


September 2008
-

initiating Scandinavian Semantic Web Conference