A NEW INTERFACE MODEL FOR THE DATABASE

wrackbaaΚινητά – Ασύρματες Τεχνολογίες

10 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

70 εμφανίσεις



A NEW INTERFACE MODEL FOR THE
“JAZYKI

MIRA”/“LANGUAGES OF THE WORLD” TYPOLOGICAL
DATABASE
1


Oleg Belyaev


ABSTRACT

Aside from its use as an object for quantitative studies, any
linguistic database, and JM in particular, can be used as a reference
tool by r
esearchers or students and, as such, requires a user
-
friendly
graphical interface. The existing software for accessing JM data,
while being full
-
featured and suitable for data input and having
extensive search capabilities, is lacking in terms of accessibi
lity for
the ordinary user. The new interface model, focused specifially on
such users, aims to make browsing the database and analyzing its
data more convenient. This is achieved in part by several new
extensions to the database, which allow for including

the geographic
and genetic classifications, feature annotations, examples and
references to the source books in .pdf format.


KEYWORDS

Software, typology, Jazyki Mira, database, computational linguistics,
quantitative linguistics


INTRODUCTION

As an elect
ronic database with 3827 binary feature values for each of its 317
languages, “Jazyki Mira” contains a massive amount of linguistic data. While it has
already been successfully used for quantitative analysis in several studies
2
, it has
not gained much popu
larity as a reference tool, which is surprising for such an
extensive coverage of the languages of Eurasia. The main reason the database has
been relatively unknown outside a small group of linguists working in the field of



1

The

research

is

supported

by

RFBR grant (www.rfbr.ru)
, № 07
-
06
-
00229
а

2

e. g., Novikov and Yaroslavtseva (1985), Polyakov and Solovyev (2006),
Polyakov, Solovyev and Akhtyamov

(2006), Vinogradov, N
ovikov and
Yaroslavtseva (2003), Vinogradov et

al.

(2003).



quantitative typology is its lac
k of a full
-
featured graphical user interface aimed at
ordinary researchers or students.

For the last 6 years the main tool for browsing and editing the database was
a program developed in 2002 by V. Polyakov and V. Logunov (cf. Polyakov and
Solovyev (2006
) for a detailed overview). It is a complete environment which can
be used both to browse through all the information in the database and to modify
its contents. It also includes some advanced searching capabilities. But what it
lacks is usability as a ref
erence material


that is, ease of browsing and a natural
representation of the data. The second shortcoming is best illustrated by the way
the feature sets are presented: instead of a true hierarchical structure, a dot
-
based
notation (with the number of d
ots indicating the level of a feature in the hierarchy)
is used, as in the source files of the database. While it may be useful for editing or
reviewing the database, seeing the source data “as
-
is” is totally unnecessary for the
ordinary user.

This lack of

a modern and full
-
featured environment led the JM team to the
decision that a new interface model should be developed. The project, however,
includes not only the development of a new computer application, but also the
extension of the database with a fra
mework for several new datasets: regional and
genetic classifications of the languages, feature annotations, examples and
references to the texts of the source books in .pdf format. It should be noted that, as
of yet, only the genetic classification has be
en filled with data and is fully
functional.

In the course of the project a basic computer application, with the support of
all the data structures necessary, has been developed. For the structures not already
filled, sample data has been provided. The aim

of this paper is a general description
of these results and an outline of future plans in the development of the product.
The main body of the paper is organized as follows: Section 2 describes the main
structure of the user interface; Section 3 describes

the language viewer; Section 4
describes the feature annotations, the examples system, and the framework for
accessing the source books; Section 5 describes the search interface; Section 6
describes the region and family browsers; finally, in Section 7 th
e principal details
of the software implementation are presented, and Section 8 concludes the paper,
outlining prospects for the future.


SECTION 2. THE MAIN USER INTERFACE

The main window of the application is organized in a TDI (Tabbed document
interface
) fashion: that is, each new language, as well as the region and family


browsers and the search interface, is opened in a separate tab in the working area to
the right. This paradigm should be familiar to some users from several popular web
browsers. The g
eneral look of the main window is presented on Figure 1.

On the left pane one can see a list of all the languages present in the database.
By clicking on one of them a new tab with the selected language's info is opened.
The structure of such tabs will be
described in detail in the next section.


The main menu strip of the application contains the following items:

1. The File menu


An image of the File menu can be seen on Figure 2. This menu allows the user
to save the currently open languages and load a l
anguage list saved earlier. This
can be useful when frequent reference to the same set of languages is required.



Figure
1
.The main interface of the application




2. The Language menu


In the current version of the application, the Language menu (seen on Figure 3.)
contains only an option to open the S
earch tab, which will be described in detail in
Section 4.



3. The View menu


The View menu (on Figure 4.) allows the user to access the region browser and
the family browser, which will be described in detail in Section 6.


The Help menu does not serve

any purpose in the current version of the
application, but in the future an HTML help system is planned to be created.


SECTION 3. THE LANGUAGE VIEWER

The language viewer, seen on Figure 1. on the right, is the core component of
the program. It is here th
at individual info for each language is shown. On the left
Figure
2
.The File menu


Figure
3
.The Language menu

Figure
4
.The View menu



pane a text
-
based description of the language, a so
-
called
dossier

(Russian
индивидуальная часть,

lit. 'individual part') is displayed. This is essentially an
overview of those sections of the sour
ce article which couldn't be included in the
form of binary features. On the right, the features themselves are shown in a tree
-
like fashion for the hierarchy to be easily seen. A tick marks the presence of a
feature, while an empty check box marks its abs
ence.

Right
-
clicking on a feature in the right pane brings out a context menu, from
where one can access the examples for the selected feature, as well as an
annotation and a reference to the source books.


SECTION 4. THE EXAMPLE VIEWER, THE ANNOTATION VI
EWER
AND SOURCE REFERENCES

The interfaces for viewing the examples for a given feature, its annotation or a
reference to the source .pdf can be brought up from the context menu for this
feature.

The example viewer (Figure 6.) and the annotation viewer (Fig
ure 7.) are
organized in a very similar way. Each of them is, internally, a table with several
columns. Both contain a column for the number (identifier) of the feature for which
an example/annotation should be given, and a column for the source of this in
fo.
The annotation table additionally contains a column for the text that should be
displayed, while the examples table contains columns for the language number, for
the text of the example (in IPA), for its translation into English, and a comment
field.


Figure
5
. A placeholder example for Russian for the 'accusative alignment'

feature




If the language number is specified as zero, the example is assumed to be
default and is to be displayed when no entry for the selected language could be
found. A placeholder default example can be found on Figure 8.



Both viewers utilize a so
-
called
fallback

mechanism, which works as follows: if
no information (individual or default) for a given feature could be found, the info
for the parent feature is displayed, and so on recursively to the lowest feature
containing any info. This is done because of

the peculiarities of the binary structure
of JM, where individual lowest
-
level features often don't mean much by
themselves, but are actually values for their parent.

Figure
6
. A placeholder annotation for the 'accusative alignment' feature

Figure
7
. A placeholder default example for the 'accusative alignment' feature



The example viewer is currently displaying the glosses in a non
-
standard and
not very co
nvenient way. In future versions a standard inter
-
linear glossing format
is planned to be implemented.

Selecting the 'show reference' option opens a .pdf file with the corresponding
source issue of the “Jazyki Mira” encyclopaedia. There is not a lot of the
se which
have already been digitized, but the functionality is present should such a project
be finished.


SECTION 5. THE SEARCH INTERFACE

The search interface (Figure 9.) operates in a separate tab. Currently, it is only
capable of filtering languages by
a given set of features, but in the future an
extension of the application's search capabilities is planned.

On the left, a list of all the features available is displayed, in the same way as in
the language viewer, but with all features unticked. By click
ing the checkboxes
near the feature names, one can select those of them that he wants to search the
languages for. On the right, a complete list of all the languages in the database is
initially displayed. Clicking the 'Search properties' button filters th
e list so that only
the languages having the selected features are displayed. Any subsequent searches
are applied to the previously filtered list. To return the list to its initial state, the
user should click the 'Restore list' button.







Figure
8
.
The search interface (feature 'number of monophthongs' displayed on the left, with the value
'eight' selected)



SECTION 6. THE

REGION AND FAMILY BROWSERS

The region and family browsers (Figures 10. and 11.) operate in a completely
identical manner. Internally, they are both organized as two tables, one of which
describes the regions/families (with columns for the region/family in
dentifying
number, its name, and the number of its parent


zero value marks the
region/family as root
-
level), while the other consists of two columns, where a
language number is assigned to a region/family number. Languages are supposed
to be included onl
y in the lowest
-
level nodes.

Clicking on a language opens it in a new tab.

It should be noted that the family browser has already been filled with complete
data. However, the classification used in it is based on the one in the “Jazyki Mira”
encyclopaedia,

which includes areal groupings like “Caucasian” or “Palaeoasiatic”
on the same level as true language families, and unifies isolates under one group,
while they should be represented by a number of different families, each consisting
of one language. This

makes it useful only for reference purposes, but not for any
computational studies. However, a more comprehensive classification could be
integrated for such tasks. If the existing classification is kept, however, there still
exists the need to translate
it into English.






Figure
9
. The region browser (with sample data)





SECTION 7. IMPLEMENTATION DETAILS

The main application is written in the C# programming language, using the
Microsoft .NET Framework (the Common Language Infrastructure, CLI
3
). This is
done both to ease and simplify the development
process and to ensure portability



3

ECMA standard, available at:
http://www.ecma
-
international.org/publications/standards/Ecma
-
335.htm

.

Figure
10
. The family browser



across different systems: the open
-
source Mono framework
4

allows binary
compatibility on a number of different platforms, including Linux, Apple Mac OS,
and FreeBSD
5
. The version of the runtime required is 2.0 or higher.

B
oth the main database and the supplements provided as part of this project are,
internally, tables, which makes it possible to present them in virtually any format
suitable for such data structure. In the current version of the program, the
Microsoft Excel

(.xls) format is used, which is, however, not the best design
decision since Excel incorporates a lot of supplementary data (like font sizes,
typefaces etc.) which is not required for the task at hand and substantially increases
the size of the files. In
the future version of the application, the data will probably
be represented in a simple plain
-
text format like .csv (comma
-
separated values),
using Unicode, to both decrease its size, simplify the program's access to it, and
generally make the implementat
ion more economic and internally consistent.


SECTION 8. CONCLUSIONS AND PLANS FOR FUTURE
DEVELOPMENT

While the application cannot yet be considered a complete and finished product
due to unimplemented features and lack of complete data for its supplementa
ry
materials, it can already be successfully used for browsing the database and
working with the data it contains. It is hoped that this new tool will help many new
researchers and students get acquainted with the “Jazyki Mira” typological
database.

The pr
ospects for the future, which have already been for the most part outlined
in preceding sections, can be summed up as follows. The primary goal today is to
fill up all the supplementary tables with the relevant data to make the new features
fully functiona
l. Another goal is to implement new features, particulary with regard
to the Search function, make the user interface more convenient and streamlined,
and ensure true portability and compatibility of the application across different
platforms. These are th
e main directions in which the software has to be developed;



4

Availa
ble at
www.mono
-
project.com

. It is preinstalled on most modern Linux
distributions which use the GNOME desktop environment.

5

Currently, the user interface is implemented using the Windows Forms
libraries, whic
h hinders portability since they are not completely supported in
Mono, but a version using the portable and free software GTK libraries is in
development.



however, to make the database truly useful internationally, a complete translation
into English has to be made, which makes it one of the top priorities in the project.
Only with these goals fulf
illed will it be possible for JM to become a truly
international cooperative project.


REFERENCES

1.

Novikov
,
A
.
I
.
and

Yaroslavtseva
,
E
.
I
. (1985) База лингвотипологических
данных и принципы её функционирования.
[A database for linguistic
typology and the pr
inciples of its functioning]
Вести АН СССР,
1985, №3.

2.

Polyakov
,
V
.
N
.
and

Solovyev
,
V
.
D
. (2006) Компьютерные модели и
методы в типологии и компаративистике.
[Computer models and methods
in typology and comparative linguistics] Kazan, 2006.

3.

Polyakov
,
V
.
N
.
,
Solovyev
,
V
.
D
.
and

Akhtyamov
,
R
.
B
. (2006) Сложное
предложение в базе данных “Языки мира”: статистика и методы
исследования.
[Complex sentence in the “Jazyki Mira” database: statistics
and research methods] In
Труды лаборатории языков народов Сибири.
To
msk: TGPU, 2006.

4.

Vinogradov
,
V
.
A
.,
Novikov
,
A
.
I
.
and

Yaroslavtseva

,
E
.
I
. (2003) База
данных “Языки мира” как инструмент лингвистического исследования.
[The “Jazyki Mira” database as a tool for linguistic research]
Вопросы
языкознания,
2003, №3.

5.

Vinogra
dov, V. A., Novikov, A. I., Yaroslavtseva, E. I., Polyakov, V. N. and
Logunov, V. V. (2003) База данных “Языки мира” и лингвистические
исследования. [The “Jazyki Mira” database and linguistic research] In
II
Международные Бодуэновские чтения (Казань, 11
-
13

декабря 2003 г.).
Труды и материалы.
Vol. 1, p. 39

41.