The Rosetta Project ALL Language Archive

sounderslipInternet και Εφαρμογές Web

22 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

67 εμφανίσεις

The Rosetta Project

ALL Language Archive

A Project of the Long Now Foundation & A National Science Digital Library

www.rosettaproject.org

Presented by:

Laura Buszard
-
Welcher

The Rosetta Project / University
of California, Berkeley

Primary Goals


Support the documentation of the world’s nearly
7000 languages through building


A digital archive of language documentation


A linguistically sophisticated site that is also useful and
interesting for the general public


Networks of speakers, educators, linguists


Contributes to the effort to document endangered
languages


Promotes linguistic diversity by educating the public
about languages with small numbers of speakers.

Secondary Goals


Support metadata standardization and
interoperability


OLAC


EMELD


Develop tools for collaborative linguistic research


Endangered Language Query Room


Wordlist Tool


Collaborative document editing/creation (new site)

Roles


The Long Now Foundation


Parent organization of The Rosetta Project


Projects, seminars on topics that foster long term thinking


The National Science Digital Library


U.S. National Science Foundation Program


Goal is to bring online high quality STEM (Science,
Technology, Engineering, and Math) resources for education


Sponsor of Rosetta Project (NSF 333727)


Stanford University


Online and offline storage of Rosetta materials

The Long Now Foundation

The National Science Digital
Library

Stanford University Libraries

Project History:

The 1000 Language Archive


Initiated by The Long Now Foundation


Wanted to experiment with new
microetching technology, looking for
suitable content


Decided to collect basic descriptive
information for 1000 of the world’s
approximately 7000 languages

Why language information?


Most natural human languages are products of
millenia of human history (therefore a good long
term thinking project)


Repositories of cultural information


Languages showcase


Human intellectual sophistication


Cultural diversity


To draw attention to the critical issue of language
endangerment

The Rosetta Disk



Next generation microfiche


Micro
-
etched 2" nickel disk at
densities of up to 200,000
page images per disk


Developed by Los Alamos
Laboratories and Norsam
Technologies


Reading the disk requires a
microscope, either optical or
electron, depending on the
density of encoding

The Rosetta Stone


Not us! (196 BC)


Parallel text written in
three scripts:


Hieroglyphic


Demotic (script form)


Greek


The key to deciphering
Egyptian Hieroglyphs

Rosetta Stone

Language
Learning

Software


(Also not us!)

Design of the Disk


Original design has human
-
eye readable text (Genesis
text) and micro
-
etched text
inside an index


New design has human
-
eye
readable text (instructions)
on one side and microetched
images on the reverse

In
-
House Scanning


HP CapShare Scanners


Scan printed page in
multiple passes, any
direction


Page is ‘assembled’ into
one image


Stores about 50 pages at a
time (300 dpi bitmap .tif)


Uploads numerically
sequenced images to
computer by infrared port

In
-
House Scanning


Minolta PS 7000 Overhead


Bitmap and grayscale scans up
to 600 dpi


Multiple sizes, orientations


Single page / double page
spread (good for text
collections with verso
annotations)


Best for fragile books,
manuscripts that would be
damaged by hand scanning

Categories of Collection (1)

Ethnologue
description

General information from www.ethnologue.com about
language affiliation, where spoken, number of speakers,
dialects, alternate language names.

General
description

General description of the language. Origin and current
distribution of language, number of speakers, family,
typology, history, etc.

Maps

Maps of the geographic distribution of a language and its
relationship to other languages in the region.

Orthography

Writing system(s) of the language with any
accompanying guide to pronunciation, use, etc.

Phonology

A description of the basic sound units in a language
(phonemes) and how they combine to form utterances.

Categories of Collection (2)

Grammar

How a language combines the smallest units of meaning
(morphemes) to create words and words to create
sentences.

Core Word
Lists

A common word list of 100 or 200 terms typically
collected in linguistic fieldwork (“Swadesh Lists”), often
used for comparative purposes.

Numbers

A description of the numbering system(s) in a language
with a list of basic terms.

Parallel Texts

A common text with translation for each language.
Initially Genesis Chapters 1
-
3 (a commonly collected
text). Now also the UN Declaration of Human Rights.

Glossed Texts

Transcribed indigenous texts with word glosses, free
translations and grammatical markup.

Language Curation

Ethnologue Description

Grammar (1167)

General Description (1651)

Core Word Lists (3098)

Maps (376)

Numbering Systems (215)

Orthography (1052)

Main Parallel Texts (1109)

Phonology (1731)

Glossed Vernacular Texts (869)

Rosetta Project Web Site


Welcome


Search for a language


Language overview page


Browse (by name, family, country)


Wordlist tool

Welcome

Search

Language Overview

Browse

Projects


Endangered Language Query Rooms


Digital Online Curation Services for
Endangered Language Archives
(DOCS)


Wordlist Tool


LangGator

Endangered Language Query Rooms

http://emeld.rosettaproject.org/

Query Room Virtual Keyboard

Potawatomi Query Room

Re: Bozho

by Donald Perrot (host) on July 9 2004, 8:53 PM

Nmedagwe'ndan e'gi nebye'ge'yen ngom. Neaseno ndesh ne kas ge' nin, mine E'shkanabe'
e'nda ge' nin. I like what you have written. I am called Neaseno (Southwind) myself, and I
live in Escanaba, MI.


Re: Bozho

by Justin Neely on September 7 2004, 1:16 PM

Bozho Neaseno mine Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi Bodewadmi
ndaw. Shi shi ban nee yek ndebendagwes. Zego ndotem. Kansas City,Mo ndoch bya.
Eskanabe edayen ge nin. Bama pi ngom Zagnenibi ndeznekas


[Hello Neaseno and Lameen my name is Zagnenibi. I’m Native and Potawatomi. I belong
to the Citizen Band. I’m Crane Clan . I’m from Kansas City, Missouri. I also live in
Escanaba. Bye for now, Zagnenibi.]

Taking Conversational Risks

by [TL] on July 17 2004, 10:30 AM

mbesuk onago ngi zhyamen . nseze wgi bye tot i jiman ewi nepamshkamen be gishek.
wabek nuwi zhya men ibe eje shna mbesuk . ngi wabmak gode chemokmanuk demojgewat.
wabek nin gezhe ni demojgeyan gnebech. bama mine mtego


[I went to the lake yesterday. My brother brought a canoe so we could float around all day.
Tomorrow we’ll go there to the lake. I saw the white folks fishing. Tomorrow I’ll fish too,
maybe. So long for now, Mtego.]


Re: onago egi zhejkeyak

by [JN] on July 17 2004, 8:12 PM

mbesek ndazhya ngom. Mbish ksenyak shode. Nedwendan ode Mbish gshatek. Megwa
Nwinebyege ode bodewadmi kiktowenen bama. Megwetch Zagnenibi nin se


[I should go to the lake today. The water is cold here. I wish the water were warm. I’ll
write more of this Potawatomi conversation later. Thanks, yours truly Zagnenibi.]

Factors in query room success

Nias

Potawatomi

Speech community

500,000

~25 native

Robust use

On Nias

Nowhere

Diaspora

Indonesia, West

US, Ontario

Internet access

Only in diaspora

US
-
normal

Online community

Preexisting

Preexisting

Rooms requested

By speaker

By speaker

DOCS Project


Digital Online Curation
Services for Endangered
Language Archives


Many small language
archives are beginning to
digitize their materials


Lack technical
infrastructure to bring
resources online


Goal is to provide access
through Rosetta

DOCS Project Archives


Endangered Language Fund (ELF)


Survey for California and Other American
Indian Languages (SCOIL)


The Alaska Native Languages Center
(ANLC)


Max
-
Planck Institute for Evolutionary
Anthropology (Leipzig)

Wordlist Tool


Swadesh lists (100, 200, 207 terms) from:


Tryon's Comparative Austronesian Dictionary (rekeyed)


Tim Usher's Indo
-
Pacific database (2002 version)


Paul Whitehouse's Australian and New Guinea database (2002
version)


George Starostin's Dravidian database


Ilya Peiros' Mon Khmer database


Total of 1,384 languages, 3,090 lists online


Additional 3000 lists, up to 1850 terms per
list, most 300
-
500 words in length.

LangGator


A linguistic “Wayback Machine”


Language resource location and aggregation


Use alternate language names, spellings


Deutsch, Hochdeutsch, High German, Allemande


Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca


Character identification (inventory, distribution)


Dera (Chadic, Nigeria)


Dera (Trans
-
New Guinea, Indonesia)


Seed crawler with Wordlist terms (see previous slide),
weighted towards longer terms


Archiving through Internet Archive


Serve results through the Rosetta site

Collaborations


Electronic Metastructure for Endangered
Languages Data (E
-
MELD)


General Ontology for Linguistic Description
(GOLD)


Open Language Archives Community
(OLAC)

E
-
MELD


Electronic Metastructure for Endangered
Language Data


School of Best Practice
http://emeld.org/school/index.html


Guidelines and examples for putting linguistic data
into best practice digital formats


XML with XML Schema or DTD


Mapping terminology to ontology (GOLD)


FIELD lexical database tool
http://emeld.org/tools/field/beta/


Online collaborative tool to build linguistic
dictionaries, backed by ontology (GOLD)

GOLD


General Ontology for Linguistic Description


Built in OWL (Web Ontology Language), linked to
SUMO (Suggested Upper Merged Ontology)


Best practice resources should include a mapping
between the researcher’s terms, and a standard set,
known as the ‘profile’


‘independent’ (mine) = ‘main clause’ (GOLD)


‘obviative’ (mine) = ‘fourth person’ (GOLD)


The standard terminology set can then allow
sophisticated searches across disparate resources.

GOLD Community Model

OLAC


Open Language Archives Community


Set of 23 metadata elements and controlled
vocabularies (based on Dublin Core)


Subject.language (language described, rather than audience
language) uses SIL language codes


Type.linguistic (grammar, lexicon, text)


IMDI (Isle Metadata Initiative) has 135 elements


Recommended extensions (Discourse Types, Linguistic
Field, Participant roles


Enables searches across a network of archives that use
OLAC metadata
http://www.language
-
archives.org/tools/search/

URLs


Electronic Metastructure for Endangered Language Data (E
-
MELD)
http://www.emeld.org

(School of Best Practice, FIELD Tool).


Endangered Language Query Rooms
http://rosettaproject.org:8080/emeldbase/
.


The Ethnologue
http://www.ethnologue.com
.


General Ontology for Linguistic Description (GOLD)
http://www.linguistics
-
ontology.org
.


ISLE MetaData Initiative (IMDI)
http://www.mpi.nl/IMDI/
.


National Science Digital Library (NSDL)
http://nsdl.org



Open Language Archives Community (OLAC)
http://www.language
-
archives.org
.


The Rosetta Project,
http://www.rosettaproject.org/live
. A preview of the
new Web site (currently under construction) is available at
http://preview.rosettaproject.org
.

Credits


This project is funded by the US National
Science Digital Library (NSF 333727)