Design of a Multimedia Corpus of Austronesian Linguistics

impulseverseAI and Robotics

Oct 24, 2013 (3 years and 8 months ago)

217 views

Design of a Multimedia Corpus of Austronesian Linguistics


Zhemin Lin, Li
-
May Sung, I
-
wen Su


Graduate Institute of Linguistics of the National Taiwan University

Abstract

In this paper, the design of an integrated platform of multimedia
online corpora aimi
ng to serve both linguists and the public is introduced along
with database schema and programming details. Compared with the
Formosan language archive of Academia Sinica, our design
emphasizes

more in
terms of
normalization, accessibility and interoperab
ility of the system. The
design of an automatically generated dictionary with cross
-
references and the
capability of searching the entire database in various ways are also described
here.

1

Introduction

The development of natural language processing techniq
ues and dynamic web pages
has generated wide interest in the construction of an integrated platform which
enables people to submit, to browse and to search among collected texts in corpora.
However, most online corpora are specially built for experts; the
y are sentence
-
based
and do not provide mult imedia contents. The NTU corpus of Austronesian
languages
1

introduced in this paper is an attempt to construct a multi
-
lingual online
corpus with mult imedia contents
meeting

the needs of both linguists and the p
ublic.
In the following sections, we will take a brief review of previous works and then
focus on the
features

of our current work.




1

http://corpus.linguistics.ntu.edu.tw








106

Zhemin Lin, Li
-
May Sung, I
-
wen Su


2


Formosan Language Archive of Academia Sinica

Zeitoun et al. (2003) discussed some of the problems in the conservation of F
ormosan
Austronesian languages. The continuous enhancement of their work with many
newly designed tools is further described in Zeitoun and Yu (2005). As discussed in
the two articles, fieldwork data are rarely shared in the linguistic community.
Collec
ted materials are somet imes inaccessible even in the office where they are
stored, due to the change of storage media or data damage. One of the most serious
problem
s

is that, although there are elicitated sentences and recordings, few of them
are rearran
ged and published. As a response to the problems, researchers in the
Academia Sinica have built a Formosan language archive, i.e., an online corpora with
texts, translations, word glosses and sounds from native speakers of 14 languages and
dialects.
2


De
spite their labour, there are
however

insufficiencies in their system
, one of
them being the
theoretical
issue:
the
Sinica
corpora are sentence
-
based, where pauses,
pause fillers, repetitions, intonation contours, IU boundaries and other discoursal
clues
a
re

either discarded or missing. A sentence
-
based corpus excludes
important
linguistic
information

only present in discourse
.
W
ords in the system are written in
an
ad hoc

mixture style
via

International Phonetic Alphabet (IPA),
in a

transcription
style th
at
prevents
their respective
native speakers
from using
the data directly.
N
early every word
is altered

to some extent
. Example (1) is a Saisiyat example
extracted from the Sinica archive.


(1)

(a)

yao noka ma
ʔ
iiæh ... hayða
ʔ

ʔ
æhæ
ʔ

ma
ʔ
iiæh la m
-
waai
ʔ
, yao mina
-
ŋa
ʔ
ŋa
ʔ

nak hini mina
-
ʃ
aa
ə
ŋ.

(b)

ʔ
in
ʔ
alay hikor may nak hini yakin,
ʃ
β
ə
t yakin ho.

(c)

ʔ
ok
-
ik
ʃə
β
ə
t, m
-
waai
ʔ

nak hini pa
-
pa
ʃ
œ
ʃ
, yao h<œm>
ʃ
œ
ʃ

atomalan.

(Extracted from 05.002a
--

05.002c of “5.
我的故事
” of the Sinica a
rchive.)

There is
so far
no dictionary
available
with cross
-
referencing function in
the



2

http://formosan.sinica.edu.tw/formosan/ch/select_corpus.htm








Design of a Multimed
ia Corpus of Austronesian Linguistics

107


Sinica corpus, even though
cross
-
referenc
ing

for

an online corpus is essential
for
researchers deal with elicited or authentic
data
. Like KWIC (Keyword
-
in
-
context,
cf.

Luhn (1960)), a user can trace a word back to the context where it occurs, and
browse
its
surrounding IUs. Zeitoun et al (2003) has planed a data schema that ran
on Microsoft Access. Their design, however,
cannot
take advantage of the SQL92
query langua
ge. Moreover, they designed an XML dialect
to

improv
e

the
interoperability, which
does not encourage

researchers to share
their
collected data
in
a convenient way
. The
Sinica archive, though primitive in
design
,

is the first attempt
to
provide

public acc
ess to
the nearly extinct linguistic data, which is an effort

highly
respectable

by itself
.

3

NTU Corpus of Austronesian Languages

The system designed in this paper is based on the NTU corpus of Austronesian
languages. The NTU corpus, first described in Hua
ng, Su, and Sung (2003), is
composed of spoken texts in various languages. Currently NTU Saisiyat corpus
contains 22 texts, 3081 intonation units (IUs) and approximately 10635 words, whose
transcription follows the conventions of Du Bois (1993). There ar
e one
conversation, eight narrative
s of

indigenous legends, thirteen
elicited
narrati
ves based
on
“Pear Stories” (
5
narrati
ves

based on a six
-
minute color mute film made by
Wallace Chafe, see Chafe (1980)) and “Frog Stories” (
8 narratives

from a sketch boo
k
by Mayer (1980)). An example of an original data segment follows:

(2)

9.

... (1.7)

m
-
wa:i'

'aehae'

ka



AF
-
come

one


NOM

10.

...(1.1)

ma'iaeh

ima

h<oem>oehoe'

ka

siri'



person

ASP

<
AF
>pull


ACC

goat

11.

...

may


hiza



pass.by[
AF
]

there

12.

...(1.9)

ilahiza


kabih








108

Zhemin Lin, Li
-
May Sung, I
-
wen Su





move.to.that.place

side


“(The man pulling a goat) passed by this way and went that way.” (Pear 3:9
-
12)

Spoken corpus, in contrast to written corpus, is composed of utterances shorter or
equal to sentences, which are
transcribed

according to certain cri
teria,
such as

turn
-
taking, pause, and rupture
s

in intonation contours

of monologue
(Tao 1996:35). Fig
.

1
.

shows a unified intonation contours in a praat
3

window.



Fig
.

1.


A unified intonation contour


When

a corpus is
transcribed
, tagged and analyzed,

one needs
to look for a means
to make it accessible to the public. An integrated platform to store, to represent and
to lower the technological boundaries for further use of the collected data is thus
necessary. With the
insufficiencies

of the Sinica ar
chive in mind, normalizat ion,
accessibility and interoperability are emphasized in the design of our system.
The
f
ollowing guid
e
lines are
thus
proposed.


(3)
Guidelines of the integrated platform




3

praat is a programmable phonetic analyser written by Paul Boersma and David Weenin
k,
Institute of Phonetic Sciences, University of Amsterdam. It is licensed in GNU Public
License
, with the c
ourtesy of their outrageous work and the free software. Cf.

http://www.fon.hum.uva.nl/praat/








Design of a Multimed
ia Corpus of Austronesian Linguistics

109


(a)


Easy to customize for most Austronesian languages

(b)


Standar
dized procedures of transcription, annotation and process

(c)


Automatic extraction of morphosyntactic information to reduce
repetition of human labor

(d)


Web
-
based, unified input/output interface

(e)


Searchable corpus that fit the needs of both linguists and the pu
blic

(f)


Multimedia representation of collected texts

(g)


Interoperable with other systems

(h)


Cross
-
platform, operating system independent


Below is a description of t
he input, processing and output
of our system design.

3.1

Standardization of text commitment and
standards of committed texts

The standardization comprises the procedure of
handling
transcribed texts, the
transcription itself and morphosyntactic and discoursal codes used in the transcription.
The procedure to
handle

collected texts is designed with l
ow coupling in order to
reduce complexity. Therefore, the dependence in human manipulation in the system
is almost uni
-
directional, as can be seen in Figure 2. Whenever
a spoken text is
collected
, some worker transcribes it. Once the transcription is co
mplete, it is given
to the database maintainer for processing and storage. The web interface shows the
corpus in the database, so that people on the other end of Internet can browse and
search the corpus.














110

Zhemin Lin, Li
-
May Sung, I
-
wen Su


Fig
.

2.

Use cases of the system


The tran
scription follows Du Bois (1993), a
de facto

standard in the linguistic
community. Word glosses and annotations follow a standardized coding list
inherited from conventional mark
-
ups (cf. Appendix A) and the Leipzig glossing
rules
4
. A standard operation
is also set for th
e database maintainer to
handle
fieldwork collections as shown in Figure 3.



4

http://www.eva.mpg.de/lingua/files/morpheme.html








Design of a Multimed
ia Corpus of Austronesian Linguistics

111



Fig
.
3
. Standard operation of text commitment


The corpus in our system is stored in Unicode (UTF
-
8 encoding) for potential
need of IPA, Japanese, and annotatio
ns in other languages. If some of the tribes
decide to adopt non
-
ASCII letters, such as “
ɖ

ʈ

ɼ

ɫ

ʔ

”, into their writing systems, the
programs can process them correctly with no need of modification. As Unicode
BOM (byte
-
order
-
mark, U+FEFF) is appended
in the beginning of the file in
Microsoft Windows and is absent in Unix
-
based systems, the mark may cause a
potential problem in reading files edited in different operating systems. It is
properly dealt with in order to
fulfil

the criteria of platform
-
ind
ependence.


A set of metadata is defined in the head of committed files.
An example of
the head of a committed text is given
in (4) and the description
of the
fields
is

shown
in Table 1.


(4)

Topic: Pear story

Type: Narrative








112

Zhemin Lin, Li
-
May Sung, I
-
wen Su


Language: Kavalan

Dialect:
Xinshe

Speaker: Imui,

潘金妹
,

F,1952

Time: 00:01:15

Total IUs: 31

Collected: 2003
-
05
-
30

Revised: 2003
-
11
-
11

Transcribed by:
葉俞廷
,

王以勤

Double checked:
鍾曉芳
,
沈嘉琪

,
葉俞廷


Table 1.

Metadata of committed data

Field name

Description

Format

Topic

Topic of text

String (
e.g., Pear Story)

Type

Style of text

Narrative|Conversation|...

Language

Language of text

String, first letter in capital

Dialect

Dialect or district

String

Speaker

Base data of the informant

Native/Chinese name, Gender, Age

Time

Length of recording

h
h:mm:ss

Total IUs

Number of IUs in text

Numeric








Design of a Multimed
ia Corpus of Austronesian Linguistics

113


Field name

Description

Format

Collected

Date of recording

yyyy
-
mm
-
dd

Revised

Date of latest revision

yyyy
-
mm
-
dd

Transcribed by

Transcribers and annotaters

Comma separated string

Double checked

Inspectors of text

Comma separated strin
g


The text follow
ing

the metadata

is

described below.


(5)


5.






[IU #, with a period in the end]

.. qay
-

.. qay
-
byabas 'nay ,_



[words separated by spaces]




QAY
-
guava that


[English gloss separated by
spaces]


QAY
-
芭樂






[Ch
inese gloss separated by spaces]


6.

... razat 'nay nani.
\


person that DM






DM


#e That person picked guavas. Then,

#c
那個人採芭樂。然後,

#n Elicitaion notes

#n (More elicitation notes)









114

Zhemin Lin, Li
-
May Sung, I
-
wen Su


Lines beginning with a sharp (#) are processor instruc
tions (PI). “#e” indicates
a line of English translation of a paragraph composed of the IUs from the last
translation to the current one. “#c” marks a Chinese translation, and “#n” is
elicitation notes. It is possible to have more than on
e note.

The al
ignment of
native words and glosses is automatically done. Morpheme boundaries,
morphological information and word senses are extracted using the techniques
introduced in Lin (2005:

Chapters
2

and
4.2).

As the transcription is supposed to more or less ref
lect actual pronunciation of an
informant, spelling
may
var
y

slightly from word to word. For the system not to be
confused by these variations, a feature vector is configured for each
f
ormosan
language. A vector describes how to reduce the variants into
a simpler form. For
example, the pronunciation of
a

and
ae

is quite
similar

in Saisiyat, and glottal
-
stops
are sometimes
omitted
.
'aehae'


one” is usually spelled
'ahae

or
aehae
. Below are
feature vectors of Saisiyat and Kavalan.
5




Saisiyat:

ae → a,

oe → o,

S → s,


' →



Kavalan:

th → l,


d → l,


' →



A string substitution is executed before any operation in the database in order to
preve
nt possible duplicated entries
;
otherwise full
-
text search may fail to work.

3.2

Database design

Databa
se design affects the efficiency in search and storage. For simplifying
programming logic and high
-
speed query, we proposed a schema that differs from the
Sinica archive. Every relational database engine that follows the SQL92 standard
can be used in the

implementation of the schema. SQLite
6
, among relational
database systems, is recommended for the following reasons:




5

Kavalan is an Austronesian language spoken in Hualien County, east Taiwan.

6

http://www.sqlite.org








Design of a Multimed
ia Corpus of Austronesian Linguistics

115



1.

It is light
-
weight, fast and platform independent.

2.

A database is stored in a single file, thus is easy to maintain.

3.

It supports UTF
-
8 enc
oding.

4.

It is a free software.


One
formosan
language is
placed
in one database

and is
thus
stored in a single
file. The schema of every language should be the same, therefore cross
-
linguistic
search can be executed in a single page. It is often argued th
at a database has to be
normalized to the third level.
7

To be realistic, our
system is
design
ed

for the sake of
efficiency. The relational diagram of tables in the database is shown in Figure 4.
A full list of database schema is given in Appendix B.



7

There is a good tutorial about database normalization at http://dev.mysql.com/tech
-
resources/articles/intro
-
to
-
normalization.html








116

Zhemin Lin, Li
-
May Sung, I
-
wen Su



F
ig
.4
.

Relational diagram of tables in the database



The text is mainly stored in Table “iu”. In contrast to the word
-
based design in
the Sinica archive, every intonation unit is stored in one row. For example,



article

: pear3


nat

: ...(1.2) ima h
-
oe
m
-
angaw kasna'itol ray kahoey babaw


sim

: . ima homangaw kasnaitol ray kahoy babaw


eng

: . Asp set_a_ladder
-
AF move_up
-
AF Loc tree above


For a full
-
text search, a simple query of “
%keyword%
” to every field listed
above returns the correct results. The
simplified spelling is stored for searching
among spelling variants. Words in the database are separated by a single space, so
that they are easily processed in programs by a single function (explode

() in PHP and







Design of a Multimed
ia Corpus of Austronesian Linguistics

117


split

() in Python). Places where no glo
ss is available are occupied by a period (“.”);
thus, words and glosses are always aligned across the fields.

Another specialized data structure is designed in Table “lemma”. In order to
properly search an affix, the stem is marked for every word in the d
ictionary. The
morpheme before the stem is a prefix and the one after it a suffix. For example,
Saisiyat
kapapama'an

'bicycle' is stored
as
ka
-
#papama'#
-
an

in the table. If one
looks for a prefix
ka
-

or a suffix
-
an
, one can always obtain the right answ
er by taking
the elements before the first sharp (#) or after the second sharp. Since infixat ion is
simple in the two languages, it is currently analyzed on the fly by external programs.
8

3.3

Back
-
end programs and the POS
-
tagger

Database maintainer commit
s a pre
-
processed transcription into the database through
a batch of back
-
end programs. Commit ment is preferably done in the command
-
line, so that mismatches in alignment or failure of automated morphological analysis
may be corrected immediately and inte
ractively. A prototype is implemented to
prove the workability of the system. Here is a list of programs.


features.py


defines language
-
specific feature
-
vectors and provides connection DSN.

simplify.py


is the common library for reducing spelling varian
ts.

canon.py

checks input validity, including metadata and text format. It writes the data
into the database when the check passes.

extractmorph.py


defines morphological and discoursal codes and extracts them from the texts.

makedict.py




8

After the corpus compl
i
es with

the Leipzig glossing rules, infixation will be marked by < and
>.








118

Zhemin Lin, Li
-
May Sung, I
-
wen Su



extracts informa
tion from imported texts and updates the dictionary.

mp3splt.py/mpgsplt.py


splits .mp3 / .mpg files according to the time
-
file (see below).

tidy.py

utility to convert Chinese punctuation into ASCII and remove unnecessary
Microsoft Word mark
-
ups.


The coup
ling of the modules is fairly low. “features.py” and “simplify.py”
provide the necessary functions for all programs.

As texts have been put into the database, they are tagged by a TBL tagger (cf. Lin
2005:

Chapter 2), and the dictionary is updated at the
same time. When a user looks
up a word, the part
-
of
-
speech informat ion can be obtained along with its frequency in
the corpus. Any time the database maintainer finds an error in the tagged corpus, it
can be corrected on
-
line as an immediate feedback to t
he tagger. The tagger can
later be retrained by a single click.

3.4

Unified output interface

For the corpus to be accessible to the public, a unified user
-
friendly interface is built.
The system follows HTML 4.01 (loose) proposed by the World
-
Wide Web
Co
nsortium
9

and is designed to be browsed with a browser, because this is one of the
major means to access data from the Internet. For a dynamic and interactive
representation, the Document Object Model (DOM)
10

and JavaScript 1.2 are
preferably us ed. Popula
r bro ws ers, s uch as Internet Explore r 5.0, Mo zilla 1.7,
Firefo x 0.9 and Opera 4, are co mpliant to thes e s tandards. It is important to s upport
ma jor bro ws ers for the purpos e of access ibility. Figure 5 is a s creen dump of the
web s ite under cons truction.





9

ht t p://www.w3.org/TR/REC
-
ht ml40/

10

ht t p://www.w3.org/DOM/








Design of a Multimed
ia Corpus of Austronesian Linguistics

119



Fig
.
5.

Screen dump of the web site (under construction)


The interface is composed of the following parts: a window with the informant's
photo where movie clips are played, a list of metadata in the upper
-
left corner, several
switches to adjust browsing

effects and a frame in the bottom of the screen to dump
the selected article in a format following linguistic convention. A dictionary is
popped
-
up anytime
when
a user clicks on an unknown word (see Figure 6). The
pages are being revised for a better vi
sual effect.

Ethnological notes and examples are preferably given in the dict ionary with
cross
-
reference. The design for an interface for searching is simple, yet complicated
and special linguistic needs are still possible. For example, by typing
tabata
t
h
an

a
user can find the occurrences of the Kavalan word
ta
-
batad
-
an
; typing
'ahae

or
aehae

results in
'aehae'

for Saisiyat, and so on. Interfaces to user
-
defined functions (UDF)
are also kept for further improvement.







120

Zhemin Lin, Li
-
May Sung, I
-
wen Su



Fig
.

6.

Pop
-
up dictionary with cros
s
-
references


As the bandwidth is quite limited, it is suggested that multimedia data are stored
and transferred in the formats

of
16Kbps 11kHz MPEG
-
1 layer 3 for audio data and
MPEG
-
1 for video data.

3.5

Interoperability

It is important to share the corpu
s with the linguistic community. The Extensible
Mark
-
up Language (XML)
11

is a s imple and fle xib le language us ed to e xchange data
between different s ystems. It is now a
de facto

standard on the web. For
researchers of natural language processing to easily

profit from our collected data, the
corpus should be able to be exported in XML. Morphological informat ion, gloss
and part
-
of
-
speech of every word may be output in a uniform manner. A
n
export
ed

format is given below.




11

http://www.w3.org/XML/








Design of a Multimed
ia Corpus of Austronesian Linguistics

121



<?xml version="1.0" encoding="utf
-
8
" ?>

<article id="pear_imui">


<topic>Pear Story</topic>


<language>Kavalan</language>


<dialect>Xinshe</dialect>


<speaker>


<natname>imui</natname>


<chnname>
潘金妹
</chnname>


<gender>F</gender>


<age
-
of
-
record>51</age
-
of
-
record>


</speaker
>


<duration>00:01:15</duration>


<total
-
iu>31</total
-
iu>


<collected>2003
-
05
-
30</collected>


<revised>2003
-
11
-
11</revised>


<transcriber>
葉俞廷
</transcriber>


<transcriber>
王以勤
</transcriber>


<doublecheck>
鍾曉芳
</doublecheck>


<doublecheck>
沈嘉琪
</doubleche
ck>


<doublecheck>
葉俞廷
</doublecheck>


<text>


<iu id="iu_1">


<word>


<nat>tangi</nat>


<sim>tangi</sim>


<eng>today</eng>


<chn>
今天
</chn>


<pos>RB</pos>


</word>








122

Zhemin Lin, Li
-
May Sung, I
-
wen Su



<word>


...


</word>


</
iu>


<iu id="iu_2"> ... </iu>


...


<para von="1" bis="4">


<eng>I just saw a person there ...</eng>


<chn>
我剛剛看到

...</chn>


<notes>Some elicitation notes</notes>


</para>


...


</text>

</article>

4


Conclusive Remarks

The onli
ne version of NTU corpus of Austronesian languages
is

still under
construction and more texts are
to be added
. The adaptation of the Leipzig glossing
rules
will be adopted

in the near future. As normalization, accessibility and
interoperability are empha
sized for the system, it should be useful and helpful for
linguists, teachers and even native speakers of Austronesian languages. It is hoped
that our work could contribute to the language communities. The system is
extendible for the processing of other

languages once the proper feature vector is set.
The implementation is still on its experimental stage. As Saisiyat and most
Formosan languages are
on the verge of being
endangered, more people are urged to
participate, to use and to promote the enlarge
ment

of the corpora.







Design of a Multimed
ia Corpus of Austronesian Linguistics

123


Appendi x A.

Coding List
s




Table 2.

Morphological coding list

English

code

Chinese
code

Description

1SG

1SG

1st person singular

2SG

2SG

2nd person singular

3SG

3SG

3rd person singular

1IPL.NOM

1IPL
.
主格

1
st
person plural, Inclusive, Nominative

1EPL.NOM

1EPL.
主格

1
st
person plural, Exclusive, Nominative

1PL

1PL

1st person plural

2PL

2PL

2nd person plural

3PL

3PL

3rd person plural

ACC

受格

Accus at ive

AF

主焦

Agent Focus

ASP

動貌

As pect

AUX

助動詞

Auxiliary

BC

BC

Back Channel / Reactive Token

BF

予焦

Benefact ive Focus

CAU

使役

Caus at ive

CLF

量詞

Clas s ifier

CLF.HUM

人量詞

Human Clas s ifier

CLF.NHUM

非人量詞

Non
-
human Clas s ifier

COM

?

Comit at ive

COMP

補語詞

Complement izer

COND

條件詞

Condit ional Marker

DAT

予格

Dat ive

DE
F

定指

Definit e








124

Zhemin Lin, Li
-
May Sung, I
-
wen Su


DET

限定詞

Determiner

DIST

遠距

Distal





Discourse Marker

EXCL

排除

Exclusive

EXIST

存在

Existential

EXPER

經驗

Experiential

FIL

FIL

Pause Filler





False Start

FUT

未來

Future

GEN

屬格

Genitive



工焦

Instrumental Focus

IMP

祈使

Imperative

INCL

包含

Inclusive

INDF

不定指

Indefinite

INS

工具格

Instrument

INT

感嘆

Interjection

INVIS

不可見

Invisible

IRR

非實現

Irrealis



處焦

Locative Focus

LNK

連詞

Linker

LOC

處格

Locative

NCM

Ncm

Non
-
common Name Marker

NEG

否定

Negative

NEU

中性格

Neutral

NMZ

名物化

Nomina
lizer/No minalization

乏N

主格

Nominative

NRFUT

即將

Near Future

OBL

斜格

Oblique



受焦

Patient Focus








Design of a Multimed
ia Corpus of Austronesian Linguistics

125


PFV

完成

Perfective



人名
/
地名

proper name/place name

POSS

所有格

Possessive

PROG

進行

Progressive

PROX

近距

Proximal/Pro ximate

Q

疑問

Question Marker

兕佔

兕佔


otative

REC

交互

Reciprocal

RED

重疊

Reduplication

REL

關係詞

Relativizer

REFL

反身

Reflexive



指焦

Referential Focus

TOP

主題

Topic

VIS

可見

Visible

VOC

呼格

Vocative

X

X

Uncertain Hearing






mazmun

many.HUM

many (humans)

mwaza

many.NHUM

many (animals)

this

這個


that

那個



Tabl e 3.

Discour
se

coding list (adopted from Du Bois (1993))

Meaning


Marker

Units

Intonation Unit


((newline))

Truncated IU


--








126

Zhemin Lin, Li
-
May Sung, I
-
wen Su


Meaning


Marker

Word


((space))

Truncated word


-

Speaker identity / turn start


:

Speech Ov
erlap


[ ]

Transitional Continuity


Final


.

Continuing


,

Appeal


?

Terminal Pitch Direction

Fall


\

Rise


/

Level


_

Accent and Lengthening

Primary accent


^

Secondary accent


`

High booster

!

Low booster

;

Le
ngthening

=
=

Tone

Fall


\

Rise


/

Fall
-
Rise

\
/








Design of a Multimed
ia Corpus of Austronesian Linguistics

127


Meaning


Marker

Rise
-
fall

/
\

Level


_

Pause


Long

...(N)

Medium

...

Short

..

Latching

0

Vocal Noises

Vocal noises

(CAPITAL LETTERS)

Inhalation

(H)

Exhalation


(Hx)

Glottal stop

%

Laughter

@

Quality

Quality


<Y Y>

Laugh quality


<@ @>

Quotation quality


<Q Q>

Phonetics

Phonetic / phonemic transcription

(/ /)

Transcriber's Perspective

Researcher's comment


(( ))

Uncertain hearing


<X X>








128

Zhemin Lin, Li
-
May Sung, I
-
wen Su


Meaning


Marker

Indecipherable syllable

X

Specialized Notations

Duration

(N)

IU boundary

&

Accent unit boundary

|

Embedded IU

<| |>

Restart

{Capital Initial}

False start

< >

Code switching

<L2 L2
>

Nontranscription line

$

Reserved Symbols

Phonetic / orthographic symbols

'

Morphosyntactic coding

+ * # { }

User
-
definable symbols


" ~


Appendi x B.

Database Schema


Table meta:

metadata of text

Field name


Format


Description


Example

article


varchar(80)


Filename

pear_imui

topic


varchar(80)


Text name

Pear Story

texttype


varchar(40)


Text style

narrative








Design of a Multimed
ia Corpus of Austronesian Linguistics

129


Field name


Format


Description


Example

language


varchar(40)


Text language

Kavalan

dialect


varchar(40)


Dialect or district

Xinshe

spknat


varcha
r(80)


Native name of Informant

imui

spkhan


varchar(80)


Chinese name of Informant

潘金妹

spkgdr


char(1)


Gender of Informant (M|F)

F

spkage


integer


Age of Informant in time of recording



duration


time


Length of the recording

00:01:1
5

totaliu


integer


Number of intonation units



collected


date


Date of record

05/5/30

revised


date


Date of last revision

03/11/11

transcr


blob


Comma separated names of
transcribers

A, B, C

dblchk


blob


Names of people who double

check
the text

D, E, F


Table iu:

storage of a single intonation uni
t

Field name


Format


Description

article


varchar(80)


Text name (foreign key of meta.article)

no


integer


IU #

nat


blob


Native words, space separated

sim


blob


Simpl
ified native words

eng


blob


English gloss, space separated

chn


blob


Chinese gloss, space separated








130

Zhemin Lin, Li
-
May Sung, I
-
wen Su



Table para:

translation of a block of intonation units

Field name


Format


Description

article


varchar(80)


Text name (foreign key of meta.
article)

von


integer


IU# where the block begins

bis


integer


IU# where the block ends

eng


blob


English translation

chn


blob


Chinese translation

note


blob


Elicitation notes (mult i
-
line, separated by two semi
-
colons)


Tabl e di ct:

t he

dict ionary

Field name


Format


Description

word


varchar(80)


Word of index (simplified word)

lemma


blob


Word forms, comma separated

eng


blob


English gloss with morphological marks

chn


blob


Chinese gloss with morphological marks

note


blob


Notes (multi
-
line, separated by two semi
-
colons)

ex


blob


Example of how the word is used


Tabl e l emma:

analyse of a lemma

Field name


Format


Description

word


varchar(80)


Lemma of a word (foreign key to an element of
dict.lemma)

lemm
a


varchar(80)


Prefix
-
#stem#
-
suffix








Design of a Multimed
ia Corpus of Austronesian Linguistics

131


Field name


Format


Description

enmorph


varchar(255)


Morphological marks in English, comma separated

zhmorph


varchar(255)


Morphological marks in Chinese, comma separated

enstem


varchar(255)


Sense of the stem in English, comma separate
d

zhstem


varchar(255)


Sense of the stem in Chinese, comma separated


Table affix:
dictionary of affixes

Field name


Format


Description

affix


varchar(20)


Affix (prefix
-
,
-
infix
-
,
-
suffix)

englos


varchar(255)


Morphological analyse in Engli
sh

zhglos


varchar(255)


Morphological analys e in Chines e

not e


blob


Not es


Tabl e xref:
cross
-
reference of a word

Field name


Format


Des cript ion

word


varchar(80)


Word (s implified)

xref


blob


Art icle:IU#.number_of_word


References

Chaf
e, Wal lace L. ed. 1980. The pear s t ories: Cognit ive, cult ura l, and l inguis t ic
as pect s of narrat ive product ion. Norwood, NJ: Ablex Publis hing Corp.

Du Bois, J. W. 1993. Talking dat a: Trans cript ion and coding in dis cours e res earch,
chapt er Out line of dis cour
s e t rans cript ion, 45
-
89. NJ: Hills dale: Lawrence
Erlbaum As s ociat es.

Huang, Shuan
-
Fan, Lily I
-
wen Su, and Li
-
May Sung. 2003. Synt ax and cognit ion in







132

Zhemin Lin, Li
-
May Sung, I
-
wen Su


SaiSiyat. NSC 93
-
2411
-
H
-
022
-
094.

Lin, Zhemin. 2005. Automatic processing of languages with small
-
scaled cor
pus:
Part
-
of
-
speech tagging and partial parsing saisiyat and applications. Master's
thesis, National Taiwan University.

Luhn, H. P. 1960. Keyword
-
in
-
context index for technical literature (kwic index).
American Documentation 11:288
-
295.

Mayer, Mercer. 198
0. Frog, where are you?. NY: Dial Books.

Tao, Hongyin. 1996. Units in mandarin conversation: Prosody, discourse and
grammar. Amsterdam: John Benjamins.

Zeitoun, Elizabeth, Ching hua Yu, and Cui xia Weng. 2003. The formosan language
archive: Development of

a multimedia tool to salvage the languages and oral
traditions of the indigenous tribes of taiwan. Oceanic Linguistics 42(1):218
-
232.

Zeitoun, Elizabeth, and Ching
-
Hua Yu. 2005. The formosan language archive:
Linguistic analysis and language processing. C
omputational Linguistics and
Chinese Language Processing 10(2):167
-
200