Standards and Tools: DOBES and CLARIN Views

warbarnacleSecurity

Nov 5, 2013 (4 years and 4 days ago)

80 views

Standards and Tools:

DOBES

and
CLARIN

Views

-

resumé

after about 8 years
-

Peter
Wittenburg
, André
Moreira

The Language Archive
-

Max Planck Institute

CLARIN

European Research Infrastructure

Content

1.
CLARIN

vs.
DOBES

-

differences?

2.
Tools vs. Standards
-

differences?

3.
Overall Comparison

4.
TLA

Team
-

Landscape and Strategy

5.
Technology
-

Mainstream influences


6.
Conclusions

DOBES

vs.
CLARIN



DOBES

is about the documentation of endangered languages


(as many other comparable initiatives)




documentation teams are under time pressure



thus efficiency is required (transcription: 1
-
35, translation: 1
-
25)



can be facilitated by good
tools




documentation certainly is for this generation of

researchers, speech communities, students, public, etc.

(primary focus of
DOBES

and teams)




documentation is also for future generations



documents part of our cultural heritage



languages encode knowledge about natures and cultures



historical material helps finding our identity




therefore
DOBES

has a
short
-
term

and a long
-
term challenge

DOBES

vs.
CLARIN



CLARIN

is about an interoperable + persistent infrastructure for
LRT




landscape is fragmented and nothing fits together



thus researchers working on data can't be efficient


(knowledge workers spend 40% of time on finding resources,


making things compatible etc)



can be facilitated by good standards and agreements




infrastructure certainly is for this/next generation of



researchers, students, "citizen scientists", etc.



enable "better" research if it is "data
-
driven"




infrastructure is also for future generations



ensuring access to our research records



lots of data is highly endangered !!!



comparing "old" data with "new" data




therefore
CLARIN

has a short
-
term and a long
-
term challenge

DOBES

vs.
CLARIN
:
interoperability



DOBES




community of documenting field linguists



is interoperability an issue?

well I still don't know



interoperable with whom?



cross
-
corpus work based on data is still to come



of course some practical barriers (language)




CLARIN




infrastructure covering "all" language resources & tools


(named entity recognition relevant for everyone)



is interoperability an issue:

YES
-

it's in the focus



otherwise always barriers to tackle relevant questions



otherwise data
-
driven research too expensive




seems that here is a clear difference in primary objectives

DOBES

and
CLARIN


DOBES

CLARIN

researcher focus

"comprehensive"
documentation

give seamless access to all
relevant data

main characteristic

efficiency in annotating,
lexicon creation etc

efficiency

in finding things
and combining them

DOBES

and
CLARIN


DOBES

CLARIN

researcher focus

"comprehensive"
documentation

give seamless access to all
relevant data

main characteristic

efficiency in annotating,
lexicon creation etc

efficiency

in finding things
and combining them

addressees

communities
, researchers,
students, pupils,

public

researchers, students,
"citizen

scientists"

DOBES

and
CLARIN


DOBES

CLARIN

researcher focus

"comprehensive"
documentation

give seamless access to all
relevant data

main characteristic

efficiency in annotating,
lexicon creation etc

efficiency

in finding things
and combining them

addressees

communities
, researchers,
students, pupils,

public

researchers, students,
"citizen

scientists"

short
-
term

task

give access now

improve access now

long
-
term task

preserve cultural heritage

second priority

ensure

access in future

part of the concept

DOBES

and
CLARIN


DOBES

CLARIN

researcher focus

"comprehensive"
documentation

give seamless access to all
relevant data

main characteristic

efficiency in annotating,
lexicon creation etc

efficiency

in finding things
and combining them

addressees

communities
, researchers,
students, pupils,

public

researchers, students,
"citizen

scientists"

short
-
term

task

give access now

improve access now

long
-
term task

preserve cultural heritage

second priority

ensure

access in future

part of the concept

interoperability

not first priority

first priority

DOBES

and
CLARIN


DOBES

CLARIN

researcher focus

"comprehensive"
documentation

give seamless access to all
relevant data

main characteristic

efficiency in annotating,
lexicon creation etc

efficiency

in finding things
and combining them

addressees

communities
, researchers,
students, pupils,

public

researchers, students,
"citizen

scientists"

short
-
term

task

give access now

improve access now

long
-
term task

preserve cultural heritage

second priority

ensure

access in future

part of the concept

interoperability

not first priority

first priority

Ulrike

-

Nicoletta

"standard" a no topic

"standard" a major topic

thus very much in common
-

but also

some differences

Tools

vs. Standards



who dears to doubt that




tools determine our "productivity"



tools influence attractiveness of solutions



people are used to tools
-

who wants to learn new stuff?




tools need to be egocentrically built



development is expensive (UI)



fast development cycles are necessary



SW management is very expensive and


eats up person power



~ 80 % of all software developments fail



lot of SW developed will die


quickly since not enough


money to maintain it



tools have a short lifecycle


of in average about 10 years

functionality

time

Tools vs.
Standards



who dears to doubt that




standards live almost forever




de facto lifetime comparatively high



standards are in general not attractive for users


except for some XML "fans"



standards should be hidden and only experts


need to read all documents




standards building has some form of


altruism (if big industry is not involved)



costs lot of time and effort


(ISO
TC37
/
SC4

started 2002 at
LREC
)



risk of being quickly outdated



will a standard be accepted?



implementing standards in tools can be expensive


(moving target, complexity of standard, etc)

Tools and Standards

Tools

Standards

lifetime

comparatively short

comparatively long

user attractiveness

high

low

creation costs

high

high

maintenance costs

high

low

Tools and Standards

Tools

Standards

lifetime

comparatively short

comparatively long

attractiveness

high

low

creation costs

high

high

maintenance costs

high

low?

short
-
term

success

high

low (requires time)

long
-
term "factor"

low

potentially

high

Tools and Standards

Tools

Standards

lifetime

comparatively short

comparatively long

attractiveness

high

low

creation costs

high

high

maintenance costs

high

low

short
-
term

success

high

low (requires time)

long
-
term "factor"

low

potentially

high

thus tools are important

for short term success

standards are important for long term success

all together



for
CLARIN

no separation
-

symbiosis between short
-
term tool


support and long
-
term interoperability facilitation




for
DOBES

there seems to be a difference

Tools

Standards

CLARIN

relevant

for short and long term
development (stability, generic,
standards
-
based)

relevant

for interoperability on
short and long term

DOBES


clear interest in short term
efficiency

relevant only for those

who focus
on long
-
term aspects

Landscape for
TLA

Team



being archivist and providing access to stored material in
DOBES

(+
MPI
)



being in the core of
CLARIN
/
EUDAT

infrastructure development




a few major questions:




how can we preserve bit streams and
interpretability

over long period?



how can we give access to heterogeneous resources and also


support resource creation and manipulation/enrichment?



have about 71 lexica (and many different annotation types)


61 in the archive, 10 active in LEXUS



created by different tools,



using different structures



using different categories (lexical attributes)




how can we build "generic" tools and frameworks that can cope with


heterogeneity
-

cannot build/maintain SW too specifically targeted?



how can we build SW in a scenario where there are so many smart


developers out there?

Strategy for
TLA

Team



Rule 1: have a coherent archive of 34/75 TB



i.e. convert "everything" to
stable

formats with explicit syntax/encoding


and check quality



otherwise long term
curation

and access too expensive



costs for late
curation

and manual migration are extreme




Rule 2: base tool development on open and "generic" formats



EAF

for annotations turned out to be flexible enough over 10 years



LMF

is a flexible model for lexicon structures



"LEGO" approach makes some people frightened



but flexibility not even sufficient for field linguists



yet no agreement on an exchange format
-

a disaster




ISOcat

for registering semantics (is it generic enough?)




Rule 3: provide converters and interfaces for major tools/formats



Toolbox, CLAN, Transcriber,
PRAAT
, other XML



time consuming effort (cyclic flow almost impossible)

Is our Strategy Successful?



very difficult to answer
-

what are the criteria?




strategy allows us to be coherent with both
DOBES

and
CLARIN




strategy was broad enough to help establishing
TLA




although



LMF

turned out to be very expensive for us



much time investment to participate in x meetings



little understanding from NLP hardcore guys



can't even claim to be 100% compliant or?



some years of instability of the model thus changes of code



thus slowing down development



invent own interchange format for archiving purposes (RELISH ??)




modern lexica are complex objects with inclusions of objects


(images, a/v fragments, internal and archived resources, etc)



finally an approach based on flexible standards will pay off


but it takes more time

Technology (IT) Issues



technology innovation is moving ahead with the web as driving force



designs and tools need to be web
-
ready



visibility from everywhere



access from everywhere



collaboration support



annotation (incl. relation drawing) support


(there are so many knowledgeable people around)




web
-
technology subject of high innovation rate



frequent re
-
design of components



what is the stable core to keep costs low and make code


maintenance feasible?



Conclusions



research communities naturally more interested in tools



research infrastructure work needs to find a balance between


short
-

and long
-
term aspects




however, need to store data following general IT principles

explicit syntax, declared semantics, open formats




need to build better tools to support standards


and/or to convince companies to adopt standards



but tool building based on standards can be more expensive and


time consuming




RELISH is very good to compare
TEI
,
LMF

and LIFT



RELISH is very good to compare
ISOcat

and GOLD



we need a strategy for
TLA

to support one (or two) exchange formats and


one needs to be based on a standard (data will go into the archive)