"Communication in Slovene"

addictedswimmingAI and Robotics

Oct 24, 2013 (4 years and 16 days ago)

51 views

"Communication in Slovene"

with an emphasis on
the

Slovene
lexical database and corpora



Simon Krek

Amebis, d.o.o., Kamnik,
Slovenia

Jožef Stefan Institute, Ljubljana,
Slovenia

European

Union &

Slovene

Ministry

of

Education

and

Sport


The operation is partly financed by the European
Union, the
European Social Fund
, and the Ministry
of Education and Sport of the Republic of Slovenia.
The operation is being carried out within the
operational
programme

Human Resources
Development

for the period 2007

2013,
developmental priorities: improvement of the
quality and efficiency of educational and training
systems 2007

2013.


Communication

in
Slovene



http://www.slovenscina.eu


Leading

partner: Amebis, d. o. o., Kamnik


Duration
:
June

2008
-

December 2013


Total

value
:
3.2
million

Euro


Project
consortium
:


Amebis, d. o. o., Kamnik


Jozef

Stefan Institute


University

of

Ljubljana


Scientific

Research

Centre
of

the

Slovenian

Academy

of

Sciences

and

Arts


Trojina
, Institute
for

Applied

Slovene

Studies

Goals

Natural

Language

Processing

Tools

and

Resources

Didactics

Language

description

(and standardization)

Language

Data

Today

















Slovene

Lexical

Database

Timeline


Number of lexical units: minimum 2,500


June
-
October
20
08: preparation


November
20
08
-
J
une

20
09: specifications

June

2010

June

2011

June

2012

Legal
aspects


Creative

Commons


Attribution


Share Alike


Noncommercial


Availabitity


On
-
line (
http://www.termania.net/
)


Dataset

(
http://www.slovenscina.eu/
)


Owner
:
Ministry

of

Education

and

Sports


Future
:
Slovene

HLT
Agency
?

Past
experience


International

(
early
):


GENELEX (1990
-
94)


LE PAROLE (1993
-
98)


SIMPLE (1998
-
2002)

-----------------------------------------------


ACQUILEX I, II (
-

1995)


ILC
-

DELIS …


Individual
languages
:
elexico

(DE),
CLIPS

(IT),
CORNETTO

(NL),
ALFALEX
(FR),

STO

(DK),
ADESSE

(SP)
,
GRIAL

(SP),
CEGLEX

(PL),
SALDO

(S),
BLF

(FR), PRALED (CZ), ...




Important

for

us:

FrameNet
,
C
orpus

Pattern

Analysis
,




DANTE, COBUILD

Basics


corpus

data

analysis


lexicogrammatical

approach


semantics

and

syntax

are not
separated


valency



colligation



collocation


meaning

=
meaning

potential



is not
stable

(
norms

&
exploitations
)


lumpers

vs
.
splitters

=
splitters


lexicography

first
, NLP
second

semantic

indicator

semantic

frame

syntactic

structure

&
pattern


syntactic

combination

collocation

extended

collocation

example

phraseology

Lexicogrammatical

continuum


I. LEXICAL UNIT




headword


to
squeeze







part
-
of
-
speech

verb


VI. PHRASEOLOGY


phraseological

unit

to squeeze a quart into a pint pot


II. SENSE


indicator



1.
grip

firmly



2.
press

out

liquid




frame

If
a PERSON
squeeze
s

an

OBJECT
,


If a PERSON squeezes a

LIQUID




s|
he

press
es

it firmly, usually


or a SOFT SUBSTANCE
out

of





with
his
|
her

hands.


an

OBJECT, s|
he

get
s

the liquid








or substance out by pressing








the object.





multi
-
word

unit

(
only

nouns

and

adjectives
)


IV. COLLOC'S •

collocation



to
squeeze

(
sb
's) [
hand
,
arm
] to
squeeze

[
the

poison
,
the









venom
]
out


V. EXAMPLES


example


I squeezed her hand gratefully.

She immediately squeezed the







poison out and that probably







saved her life.




III. SYNTAX



s
tructure


vb
-
obj


vb
-
out
-
obj





pattern


sb

squeezes

sth


sb

squeezes

sth

out
















combination

(
to squeeze your eyes shut
)

I.
Lexical

Unit


link to the lexicon


morphosyntactic

information


corpus frequency


pronunciation

etc
.


additional grammatical information


can be inferred (un/
countability

etc.)


manual (part
-
of
-
speech subtypes etc.)

II.
Semantic

Level


Semantic Indicators


simple EFL
-
like explanations or synonyms
forming a
sense menu


self
-
explanatory in relation to each other


Semantic Frames


COBUILD /
FrameNet

/ C
orpus

Pattern

Analysis


combination

of

the

systems

Semantic

Indicators

1
padat

déšť

1
.1 o
věcech

2

objevovat se ve velkém množství

pršet


sloveso

Semantic

Frames


identification

of

verb
/
semantic

arguments


prototypical

pattern




the

norm” (Hanks)


the

headword

in
its

syntactic

environement


identification

of

semantic

types

in
particular

syntactic

positions


the

semantic

scenario


a
full
-
sentence

definition

making

a
link

between

the

arguments

and

the

situation

(FN)
typical

for

a
particular

sense


Semantic

Frame

když prší, padají kapky z mraků na zem

1

padat déšť

když
VĚCI

nebo jejich
SOUČÁSTI

prší, padají jako
kapky deště na zem

1.1
o věcech

když
KRITIKY

nebo
DOTAZY

prší, znamená to, že že je
jich hodně

2

objevovat se ve velkém množství

III.
Syntactic Level


semantic frame
(between semantics and syntax)


semantic arguments in capital letters (ID
-
ed
)


linked with collocates via syntax


syntactic structures

(formal)


clause

and phrase level
(all POS; only for NLP)


the number of syntactic structures is finite (SLB ~290)


source: word sketches (Sketch Engine)


syntactic patterns
(
verbalized
)


valency

(only verbs; for lexicography and NLP)


syntactic combinations


more than basic

patterns: "to squeeze your eyes shut"





Syntactic

Structures








NP/S
+pršet



ADV
+
pršet

když

KRITIKY

nebo

DOTAZY

prší
,
znamená

to
,
že

že

je
jich

hodně

2
objevovat se ve velkém množství

Syntactic

Patterns








NP/S
+pršet


co

prší


co

prší

na

co/
kogo

když

KRITIKY

nebo

DOTAZY

prší
,
znamená

to
,
že

že

je
jich

hodně

2
objevovat se ve velkém množství

IV.
Collocation

Level



SEMANTIC FRAME
:



1 když prší, padají kapky z mraků na zem


2
když

KRITIKY

nebo

DOTAZY

prší
,
znamená

to
,
že

že

je
jich

hodně




SYNTACTIC
STRUCTURES AND PATTERNS
:


1

NP/S
+pršet


LOKACE


2
NP/S
+pršet

co

prší



pršet na
co
/
koho



co

prší






pršet na
čem


co

prší

na

co/
kogo


If a part of syntactic patterns are
collocational
, they are shown on the
collocation level.




COLLOCATIONS





[
kapky
,
déšť
]
pršet






[
kritika
,
dotazy
]

prší





pršet na
[
zem
]









pršet na
[
hlavu
]





V
.
Examples




COLLOCATIONS





[
kapky
,
déšť
]
pršet







[
kritika
,
dotazy
]

prší





pršet na
[
zem
]







pršet na
[
hlavu
]





EXAMPLES (TBL + GDEX)


Dívám

se z
okna
,
jak

prší

déšť
.


Tato

klenba

zadržuje vodu,
která

pak
skrze

průduchy

prší

na
zem
.


Nevýhodou

přilby

s
otvory

je, že
při

dešti

Vám
prší

na
hlavu
.


Na
nakladatelství

pršely

dotazy
,
zda

kniha

vyjde

i
česky
.


Zdrcující

kritika
pršela

na
adresu

vlády

i na
tiskové

konferenci,
kterou

v
úterý

uspořádal

Svaz

obchodu

a
cestovního

ruchu

(SOCR).



reference

general
user

school

population

Slovene

as
foreign

language


semantic

info


menus

+
frames


collocations


corpus

examples

natural

language

processing

computer

linguist


FOR
WHAT

FOR
WHOM

WHAT

semantic

frames

syntactic

structures

syntactic

patterns

other

grammatical

info

Corpus Data & Authoring Tools


FidaPLUS



Gigafida



Sketch Engine:
www.sketchengine.co.uk


Slovene Sketch Grammar

(
LBS syn. structures
)


Tick
-
box Lexicography


GDEX


IDM Dictionary Production System


http://www.idm.fr/products/dictionary_writing_system/27/


custom DTD

FidaPLUS


(G
igafida
)


precursor
: FIDA (1997
-
2000


100
million
)


621 million tokens


tagged
-
lemmatized

(85
% accuracy



rule
-
based tagger
)


taxonomy


text types


medium


l
inguistic

proof
-
reading


time

span: 1990


2006


concordancers


http:
//
www.fidaplus.net
/


http://www.sketchengine.co.uk/


SLB sketch grammar + TBL


to love


TBL


examples

by

GDEX

TBL


Entry

Editor

GDEX


G
ood

D
ictionary

Ex
amples


system for evaluation
(
ranking
)
of sentences
with respect to their suitability to serve as
dictionary examples


sorting sentences so that good examples do
not have to be searched for in hundreds of
unusable sentences


initially

trained on English,
it
did

not give
good results for other languages

Evaluation

Authoring & search tools


IDM Dictionary Production System


currently used by lexicographers


iLex
(
http://www.emp.dk/
)


in the process of evaluation


T
-
Lex
(
http://tshwanedje.com/
)


evaluated, stand
-
by


ABBYY
(
http://www.abbyy.com/lingvo_content/
)


in the process of evaluation


Termania
(
http://www.termania.net/
)


online search and vizualization tool

Corpora


and

web

concordancers

Corpora


Gigafida



c
orpus
of
written texts


KRES


smaller and more carefully balanced c
orpus of written
texts


GOS
(
Go
vorjena

s
lovenščina
)



corpus
of spoken Slovene


Šolar


corpus of school
essay transcriptions

with teachers’
corrections

CONCORDANCERS


Gigafida, KRES, Šolar


written


http://demo.gigafida.net/



GOS


spoken


http://www.korpus
-
gos.net/



Gigafida


new generation in the
written
corpus series


FIDA (2000),
FidaPLUS

(2006),
Gigafida

(2011)


1,148,350,213 tokens (1.15 billion)


simpl
ified

taxonomy


changed copyright status


10% can be used freely

(downloadable as a data set)


no authentication for web access


new annotation tools




Corpus

annotation


new statistical tagger: 92.17 %


meta
-
tagger


a c
ombination

of the Amebis rule
-
based tagger and the new statistical tagger


new
lemmatizer
: 98
-
99

%


new parser

under development
:
MSTParser


training
corpus
:


500.000
words
:
manually

verified

POS
tags


200.000 words (~11.300 sentences): manually verified
dependency treebank with only 10 lables


Taxonomy

PRINTED

87,1


BOOKS


6,5


FICTION


2,1



NON
-
FICTION


4,4


PERIODICALS


79,9



NEWSPAPERS


57,7



MAGAZINES


22,2


OTHER


0,7

INTERNET

12,9

KRES & free corpus


KRES (in development)


100 million words


online


balanced


Free corpus (in development)


100 million words


10% of each corpus document


downloadable data set

Taxonomy KRES

PRINTED

80


BOOKS


35


FICTION


17



NON
-
FICTION


18


PERIODICALS


40



NEWSPAPERS


20



MAGAZINES


20


OTHER


5

INTERNET

20

GOS


the first
corpus of spoken Slovene


120 hours of speech


one

million word
s


criteria


demographic


speech type/situation


additional (language learning, 15%)


transcription


pronunciation
-
based


standardized


Web
concordancers


Log analysis of
FidaPLUS

concordancer


FidaPLUS

web survey


Analysis of existing corpus tools


Analysis of popular web tools (Google etc.)


Final goal


use in classroom and by general public


linguists can use existing tools (
SkE
, CWB, etc.)

Survey


findings


Simple search


regularly used by 72% users


Advanced search


rarely used (only 8% use it

regularly)


Lack of intuitiveness


T
he manual is almost key to learning how to use a corpus
tool


“…if you are not using the interface for a while, you forget

what the search commands are, and you don’t (want to)

bother with looking into the manual”


“…the interface should have
a modern design, it

should be
more
user
-
friendly, and its use should be

clear and
transparent”

Main design principles


similarity to the well
-
known non
-
linguistic

tools
(e.g. Google)


No registration


Minimum navigation


No redundant functions (less is more)


Simplicity of searches


Help and tips in pop
-
up windows


Simple descriptions of functionality (no

terminology
)

The result


two concordancers


written corpora: Gigafida, other w. corpora


spoken corpus: GOS


only one meta
-
character: quotation marks


extensive use of filters


multiple possible lemmas


use of capital letters


immediate access to meta
-
information