Proposal for a CLARIN Service CMDI Components

balecomputerSecurity

Nov 3, 2013 (3 years and 8 months ago)

86 views

1

Proposal for CLARIN Service CMDI Components



Proposal for a
C
LARIN

Service
CMDI
C
omponent
s

Marc Kemps
-
Snijders (MPI Nijmegen), Nuria Bel (UPF Barcelona)
,
Daan Broeder

(MPI Nijmegen)

September

2009



Content

1. Introduction

................................
................................
................................
................................
...
2

1.1 Embedding in WP2 Activities

................................
................................
................................
..............

2

1.2 Service discovery

................................
................................
................................
................................
.

2

2. CLARIN Metadata

................................
................................
................................
...........................
2

2.1 CMDI introduction

................................
................................
................................
..............................

2

2.2 Profile matching

................................
................................
................................
................................
..

3

3. Techn
ical Metadata for web services

................................
................................
..............................
5

3.1 Introduction

................................
................................
................................
................................
........

5

3.2 Service Component

................................
................................
................................
.............................

6

3.3 Operation Component

................................
................................
................................
........................

7

3.4 Pa
rameter Component

................................
................................
................................
.......................

8

3.5 TechnicalMetadata Component

................................
................................
................................
.........

8

3.6 ContentEncoding component

................................
................................
................................
...........

10

3.7 AnnotationLevel

................................
................................
................................
................................

10

4. Compa
rison to D
-
Spin approach
................................
................................
................................
....

11

5. Conclusions and state of discussion

................................
................................
..............................

15

6. References

................................
................................
................................
................................
...

16

Acronyms

................................
................................
................................
................................
................

1
6

Related Documents

................................
................................
................................
................................
.

16



Formal Reference:

The proposed work is

a continuation of various others tasks described in the Technical Annex in
particular the continuation of the web services and workflow oriented tasks
.


2

Proposal for CLARIN Service CMDI Components

1.
Introduction

This note describes
a
proposed
CLARIN Meta Data Infrastructure (
CMDI
)

metadata specif
ication for web
services

as outlined in the Technical Annex as
Task
4

(Guide the
specification a generic component
model for various tools and services
) and
a
continuation of
Task
6

(
Analyze web services requirements)
.
Responsible for these tasks are UPF
and MPI. In addition this work will be based on




all requirements specifications and prototypical implementations developed so far in WP2, in
particular the specifications and implementations in the area
s

of
CMDI infrastructure

and web
services and workflo
w

In addition the
CMDI specification for web services

should be designed and development considering the
following criteria:



It should provide a mechanism through which resources and web services can be matched.



It should exhibit extensibility to
new web s
ervice types

and be suitable to be integrated by other
projects.

1.1
Embedding in WP2 Activities

With respect to the requirements already developed in WP2 we can state the following:



it should be compliant with the work on the component based metadata
infrastructure and build
on it



it should extend the work on web services that has been started in various national programs

This proposal is an result of the first discussions in Barcelona between the UPF and MPI teams that took
place in September 2009.

1.2
Service discovery

The overall goal of
CLARIN

is to make Language Resources and Technology available to its end users,
who are commonly understood to be humanities researchers.
For registration of web services in the final
CLARIN registry it has become
clear t
hat neither UDDI or EbXML are
adequate to describe the web
service characteristics for this domain.
A
ll web services and tools will be described using the
CLARIN
MetaData
Initiative

which provides a flexible metadata framework with references to wel
l defined data
categories in the DCR.

A list of data categories has been identified which serves as a basis for automated
service discovery and thus serve as basic building elements for the CMDI component proposal.

2.
CLARIN Metadata

2.1
CMDI

introduction

The
CLARIN
Metadata infrastructure
(CMDI)
has been
extensively

described in the
Metadata
Infrastructure for Language Resources and Technology (CLARIN
-
2008
-
5
)
1
. The core model is shown
below. It is inspired by the ISO TC37/SC4 approach taken for LMF and consists of the core model and an



1

http://www.clarin.eu/system/files/private/metadata
-
doc
-
v3.pdf

3

Proposal for CLARIN Service CMDI Components

extension mechanism using components that is used to describe various aspects of a resource.
Components may be registered in a

separate component registry for reuse. A component can consist of
components taken from a component registry and data categories taken from the
ISO 12620 Data
Category Registry (ISOcat
).
The elements that have been specified so far can be used to describ
e media
resources, text resources, annotations, lexica, lists, tools and services. A number of main structural parts
were worked out that should be provided to describe various aspects of the resources and tools/services:
metadata self, creation, access, c
ontent, participants and resources. These descriptions cover
administrational, functional and inspectional elements, the details are available through the CLARIN site.



Figure
1
:
UML diagram of a Cl
arin Metadata Description (CMD)
.


2.2
Profile matching

In this context only those elements are of direct relevance that describe functional aspects that are
needed for example to do profile matching, i.e. to check whether a resource can be processed by a tool
or web service as indicated

in the following figure. Profile matching is to take place on the input and
output parameters of the services.


4

Proposal for CLARIN Service CMDI Components



In the following table the elements that have currently been identified as being most relevant in this
process

have been indicated
.

Category

D
efinition

comment

ResourceFormat

Specifies the annotation format that is used since often the
mime type will not be sufficient for machine processing.

this is at the heart of the I/E problem,
if mime type etc don't work than this
field could be
used

AnnotationType

An indication whether the annotation was created inline or in a
stand
-
off fashion.

could be relevant for some tools
although in future all tools should
operate on both types
.

NOTE: The AnnotationType is
mandatory
for the
TechnicalMetadata component
and
determines the manner in which
resource metadata is generated.

ApplicationDomain

An indication of the application domain for which the resource
or tool/service is intended.

could be relevant as a rough
indicator

Ch
aracterEncoding

Name of the character encoding used in the resource or
accepted by the tool/service.

is obvious

CreationTool

An indication of the tool with help of which the resource or the
annotations in the resource were created.

this is even more
detailed than the
"AnnotationFormat" descriptor

DeploymentTool

An indication of a specific tool that may be used for the
deployment of the resource.

not really relevant

This figure indicates the principle of profile matching. A resource can be consumed by a succeeding
processing step if the functional characteristics of the resource description map with those that are specified
for the input of the tool or web service.
Th
e tool or web service will create additional metadata so that for the
next processing step the same argument holds.

Figure
2
:
Principle of profile matching

5

Proposal for CLARIN Service CMDI Components

LanguageID

Identifier of the language as defined by ISO 639 that is
included in the r
esource or supported by the tool/service.

check whether the language is
supported

LanguageName

A human understandable name of the language that is used in
the resource or supported by the tool/service.

check whether the language is
supported
-

although ID

should be
used

LanguageScript

Indication of the writing system used to represent the
language in form of a four letter code as it is defined in ISO
-
15924.

of no relevance I guess, since this
has to do with visualization

MediaType

Specification of the me
dia type of the resource or the media
types the tool/service is suitable for.

it is a rough indicator on which
media the tool is operating

MimeType

Specification of the mime
-
type of the resource which is a
formalized specifier for the format included or
a mime
-
type
that the tool/service is accepting.

this is at the heart of the I/E problem,
in many cases the mime type should
be sufficient for format matching

Modalities

A listing of all modalities that are contained in the recording
such that they can be
subject of analysis or that are supported
by the tool/service.

check whether the tool supports
certain modalities

Resolution

Specification of the spatial resolution of images or movies.

some programs might have
restrictions

Samplerate

Specification of
the sample rate that is used for the recording.

some programs might have
restrictions

Tagset

Specifies the tag set used in the annotation of the resource or
a used by the tool/service or it contains a URL that points to
the information about the tag set.

could be relevant to check the
content of the annotation

TagsetLanguage

Indicates the language of the tag set itself, expressed in the
two
-
letter language codes of iso639.

could be required by some tools

AnnotationLevelType

Specifies the types of annotation tiers provided by the
resource.

could be interesting to quickly check
whether the resource includes
appropriate information

Table
1
:
List of relevant data categories for profile matching

The Technical Metadata Component describes the technical characteristics of a resource in terms of the
data categories from Table 1.
While not all data categories are currently taken into consideration at this
stage it is expected that refinements of this
components will be made as it is used to describe a growing
number of resources
.

3.
Technical

Metadata for web services

3.1 Introduction

Metadata for web services describe both descriptive and technical metadata of web services.
Descriptive metadata is int
ended for human end users while technical metadata is used by automated
procedures such as profile matching. This proposal currently only covers the technical metadata needed
for profile matching. Descriptive metadata and technical metadata for other purpo
ses will need to be
added through the CMDI extension mechanism.

6

Proposal for CLARIN Service CMDI Components


3.2 Service Component

The Service Component describes a service and its capabilities
.

The Service component for this proposal
currently only covers the technical metadata required to support
the process of profile matching and will
need to be extended with descriptive components
, like documentation, organizational information, contact
etc intended
for human end users.


Figure
3
: CMDI component

specification for web services

7

Proposal for CLARIN Service CMDI Components

The service component

consists

information on the service type (REST/SOAP), its location, the
operations and the input and output parameters it supports.

To a large extent the information for a
Service Component as presented

here may be automatically generated from the WSDL
/
WADL file
specifications of t
he webs
s
ervice.

The components and the data categories applicable to the Service Component are listed below.
No data
categories are reported for the Input and Output Component
s.
Figure 3 provides a full overview of the
Service CMDI component.


Service Component

data categories

Category

definition

ISOcat

comment

type

Refers to the web service type


-



Conceptual domain
must
be one of REST,
SOAP or XML/RPC



MorphoSyntax profile
contains type
(
http://www.isocat.org/d
atcat/DC
-
1971
)

Persistent
Identifier

Specification of a persistent identifier that
refers to the resource or tool/service this
metadata information descr
ibes.

http://www.isocat.org/datcat/
DC
-
2573

Persistent identifier of the
service. Debateable whether a
PID is justified for a service

ResourceNam
e


A short name to identify the language
resource.

http://www.isocat.org/datcat/
DC
-
2544


URL

A URL referring to another resource that can
be used in various contexts.
.

http://www.isocat.org/datcat/
DC
-
2546

Refers to the WSDL/WADL

Table
2
: Service component data categories


3.3
Operation Component

The Operation Component describes the metadata characteristics for an operation. Operation
information, including input and output parameters, and may initially be extracted from the WSDL or
WADL file.

Operation Component data categories

Category

definition

ISOcat

comment

Persistent
Identifier

Specification of a persistent identifier that
refers
to the resource or tool/service this metadata
information describes.

http://www.isocat.org/datcat/DC
-
2573

Persistent identifier of the
service. Debateable
whether a PID is justified for

a service

Name


.

-


Table
3
: Operation component data categories

8

Proposal for CLARIN Service CMDI Components


3.4
Parameter Component

The Parameter Component specifies the metadata associated with the parameters. It describes the
name of the parameter and its type. If the

type is a XML complex type the URL refers to the schema
location

Operation Component data categories

Category

definition

ISOcat

comment

Name




Type




URL

A URL referring to another resource that can
be used in various contexts.
.

http://www.isocat.org/datcat/DC
-
2546

Refers to the schema
location if the parameter is a
complex XML type

Table
4
: Parameter Component data categories


3.5
TechnicalMetadata Component

Technical metadata describes the technical characteristics of a resource or a service operation
parameter. Its intended purpose is to capture technically relevant information in a uniform manner that is
of relevance to the techn
ical aspects of a resource

needed for profile matching
.

The profile mat
ching mechanism as described in
figure
4
takes place by matching the TechnicalMetadata
components on both resources and service parameters
. Although the technical characteristics descr
ibed
in the TechnicalMetadata component and associated components may often be described in other
sections of the metadata as well the flexible and extensible nature of the CMDI model makes it impossible
to predict where in the metadata relevant technical
information may be expressed without a predefined
context. . For this reason any technically relevant information that may is used in the profile matching
scenario must be expressed as aprt of the TechncialMetadata or one of its associated components.
If t
he
TechnicalMetadata characteristics of a resource are a superset of those specified for a web service input
parameter then the resource may
be considered appropriate to serve
as an input parameter for the
service operation.
TechnicalMetadata for output pa
rameters specifies the characteristics of the resource
that are produced as a result of the service operation. Parameters that have no resource specific
requirements, e.g. configuration parameters will have no TechnicalMetadata component attached.



9

Proposal for CLARIN Service CMDI Components


Figure
4
:
TechnicalMetadata

CMDI component



The TechnicalMetadata component for output parameters can be applied for the metadata specification of
the resulting resources, i.e. the TechnicalMetadata for the resulting resource c
ont
ains the
TechnicalMetadata information as specified in the output parameter. The following situations are relevant
here:



The AnnotationType of the output parameter is stand
-
off. The resulting resource is a separate
resource from the original one. The Techn
icalMetadata of the output parameter is placed directly
onto the resulting resource.



The web service only adds information to the resource, i.e. inline processing. Here, in principal
the TechnicalMetadata of the output parameter metadata is added to the me
tadata of the original
resource.

In particular the inline web service metadata generation depends upon the availability of the original
metadata of the resource.
This assumes that only one resource is involved in the original input. However,
there may be
cases where multiple input parameters declare the use of a resource. In this case there is
no direct method of assessing which of the resources is

one to take the metadata from. This suggests
that as an additional step it is necessary to describe the metad
ata relationships that exist between input
and output parameters, which metadata record to use as a starting point for the resulting resource(s). For
inline services this mapping is mandatory if there are more than one resources involved at the input side.

The manner in which this mapping is to be expressed is to be further specified.

TechnicalMetadata

Component data categories

Category

definition

ISOcat

comment

LanguageID

Identifier of the language as defined by ISO
639 that is included in the resource or
http://www.isocat.org/datcat/DC
-
2482


10

Proposal for CLARIN Service CMDI Components

supported by the tool/service.

CharacterEncoding

Name of the
character encoding used in
the resource or accepted by the
tool/service.

http://www.isocat.org/datcat/DC
-
2564


A
nnotationStandoff

Indicates whether the annotation was
created inline or in a s
tand
-
off fashion.

http://www.isocat.org/datcat/DC
-
2507

Corresponds to
AnotationType in table 1

MimeType

Specification of the mime
-
type of the
resource which is a formalized specifier for
the
format included or a mime
-
type that the
tool/service accepts.

http://www.isocat.org/datcat/DC
-
2571



3.6 ContentEncoding component

The ContentEncoding component describes the content
characteristics of a resource. It consists of a
URL (referring to the schema location), the resourceFormat and a list of AnnotationLevel elements which

describe the specific annotation levels supported by the resource.


ContentEncoding

Component data categories

Category

definition

ISOcat

comment

URL

A URL referring to another resource that
can be used in various contexts.

http://www.isocat.org/datcat/DC
-
2546

Refers to the

schema
location

ResourceFormat



Currently not represented
in DCR (check!!)


3.7 AnnotationLevel

The AnnotationLevel component describes the characteristics of a specific annotation present in the
resource. It consists of an AnnotationLevelType, TagSet
and a TagSetLanguage.

ContentEncoding

Component data categories

Category

definition

ISOcat

comment

AnnotationLevelType

Specifies the types of annotation levels
(tiers) provided by the resource.

http://www.isocat.org/datcat/DC
-
2462


TagSet

Specifies the tag set used in the
annotation of the resource or a used by
the tool/service or it contains a URL that
points to the information about

the tag
set.

http://www.isocat.org/datcat/DC
-
2497


TagSetLanguage

Indicates the language of the tag set
itself, expressed in the two
-
letter
language codes of iso639.

http://www.isocat.org/datcat/DC
-
2498



11

Proposal for CLARIN Service CMDI Components

4. Comparison to D
-
Spin approach

The German CLARIN D
-
Spin project is currently working out a solution where web services can be
combined based on the characteristics of the
wrappers described in a web service repository. The
following proprietary characteristics are used to describe the input and output parameters
2
:

Name

Standard

Language

German, English, French, it

Text

Is present

Tokens

Is present

Lemmas

Is present

POS
-
Tags

STTS
, stei, penntb

Parsing

TigerTb, penntb, Negra

Source

GermaNet

SemLexRels

Is present, GermaNet

format

TXT, RTF, DOC

Sentences

Is present

Frequency

Is present

Baseform

Is present

Morphology

STTS

Table
5
: D
-
Spin
profile matching characterics



The following diagram shows the mapping onto the proposed TechnicalMetadata CMDI component
3
:





2

Obtained from ASV DSpin Registry Management Tool v 0.01a

3

Source has not been mapped here as it is not yet been represented in the DCR.

12

Proposal for CLARIN Service CMDI Components


Figure
5
: Mapping of d
-
Sping service characteristics onto TechnicalMetadata component


The following
figures present some examples of D
-
Spin services mapped onto the Service CMDI
component specification.



13

Proposal for CLARIN Service CMDI Components


Tokenizer(IMS, TCF0.2, deutsch)




14

Proposal for CLARIN Service CMDI Components


BBAW Tag
g
er (TCF 0.2)




15

Proposal for CLARIN Service CMDI Components


POS Tagger( IMS, TCF0.3, deutsch)





5
. Conclusions and state of discussion


Use of the TechnicalMetadata component for resources and web services implies that current resources
must be curated to be able to participate in the profile mechanism. In most cases the information may
already be present in the current metadata of existi
ng resources, such as IMDI.
The metadata of these
resources must be extended with the proper TechnicalMetadata specifications to allow participation in
profile matching scenarios.

For inline web service that contain more than one input parameter containing

a TechnicalMetadata
component a mapping must be made between input and output parameters to determine which resource
is to be used as a starting point for generating the metadata of the resulting resource.
This mapping is to
be further discussed and
specified.

16

Proposal for CLARIN Service CMDI Components

6
.
References

Acronyms


Reference

Abbreviation of



Link

[ISOcat]







http://www.isocat.org

[LAF]


Linguistic Annotation Framework

[LMF]


Lexical Markup Framework


http://en.wikipedia.org/w/index.php
?

title=Lexical_Markup_Framework&

oldid=
255448197

[PMH]



Protocol for Metadata Harvesting

http://www.openarchives.org/pmh/

[SRU]


Search/Retrieve via URL


http://www.loc.gov/standards/sru/

[SRW]


Search/Retrieve Web Service


http://en.wikipedia.org/wiki/Search/

Retrieve_Web_Service


Related
Documents

[CLARIN_NEWS_4]

CLARIN news letter Dec
2008

http://www.clarin.eu/filemanager/active?fid=231

[CLARIN_WS_NOTE]

CLARIN note on web
services

http://www.clarin.eu/filemanager/active?fid=270

[D
-
SPIN_PRES]

D
-
SPIN workshop report
and Presentations

http://www.clarin.eu/wp2/wg
-
26/wg
-
26documents/web
-
service
-
presentations
-
at
-
the
-
wp2
-
workshop
-
in
-
oxford

[CLARIN_MD_SHRT]

CLARIN Component
Metadata Shortguide

http://www.clarin.eu/files/metadata
-
CLARIN
-
ShortGuide.pdf

[CLARIN 2008
-
1]

Centers Types

CLARIN
-
2008
-
1

February 2009

http://www.clarin.eu/files/wg2
-
1
-
center
-
types
-
doc
-
v5.pdf

[CLARIN
-
2008
-
2]

Per
sistent and Unique
Identifiers


CLARIN
-
2008
-
2


February 2009

http://www.clarin.eu/files/wg2
-
2
-
pid
-
doc
-
v4.pdf

[CLARIN
-
2008
-
3]

Centers

CLARIN
-
2008
-
3

February 2009

http://www.clarin.eu/files/wg2
-
1
-
centers
-
doc
-
v8.pdf

[CLARIN
-
2008
-
4]

Language Resource and
Technology
Federation
CLARIN
-
2008
-
4

February 2009

http://www.clarin.eu/files/wg2
-
2
-
federation
-
doc
-
v6.pdf

[CLARIN
-
2008
-
5]

Metadata Infrastructure
for Language Resource
and Technology

CLARIN
-
2008
-
5
February 2009

http://www.clarin.eu/files/wg2
-
4
-
metadata
-
doc
-
v5.pdf

17

Proposal for CLARIN Service CMDI Components

[CLARIN
-
2008
-
6]

Report on web services

CLARIN
-
2008
-
6

March 2009

http://www.clarin.eu/files/state
-
report
-
WP2
-
6
-
v2.pdf

[
CLARIN
-
2009
-
1
]

Requirement
Specification Web
Services and Workflow
systems


CLARIN
-
2009
-
1

August 2009

http://www.clarin.eu/files/wg2
-
6
-
requirements
-
doc
-
v2.pdf