The Rosetta Project ALL Language Archive

sounderslipInternet and Web Development

Oct 22, 2013 (4 years and 6 months ago)


The Rosetta Project

ALL Language Archive

A Project of the Long Now Foundation & A National Science Digital Library

Presented by:

Laura Buszard

The Rosetta Project / University
of California, Berkeley

Primary Goals

Support the documentation of the world’s nearly
7000 languages through building

A digital archive of language documentation

A linguistically sophisticated site that is also useful and
interesting for the general public

Networks of speakers, educators, linguists

Contributes to the effort to document endangered

Promotes linguistic diversity by educating the public
about languages with small numbers of speakers.

Secondary Goals

Support metadata standardization and



Develop tools for collaborative linguistic research

Endangered Language Query Room

Wordlist Tool

Collaborative document editing/creation (new site)


The Long Now Foundation

Parent organization of The Rosetta Project

Projects, seminars on topics that foster long term thinking

The National Science Digital Library

U.S. National Science Foundation Program

Goal is to bring online high quality STEM (Science,
Technology, Engineering, and Math) resources for education

Sponsor of Rosetta Project (NSF 333727)

Stanford University

Online and offline storage of Rosetta materials

The Long Now Foundation

The National Science Digital

Stanford University Libraries

Project History:

The 1000 Language Archive

Initiated by The Long Now Foundation

Wanted to experiment with new
microetching technology, looking for
suitable content

Decided to collect basic descriptive
information for 1000 of the world’s
approximately 7000 languages

Why language information?

Most natural human languages are products of
millenia of human history (therefore a good long
term thinking project)

Repositories of cultural information

Languages showcase

Human intellectual sophistication

Cultural diversity

To draw attention to the critical issue of language

The Rosetta Disk

Next generation microfiche

etched 2" nickel disk at
densities of up to 200,000
page images per disk

Developed by Los Alamos
Laboratories and Norsam

Reading the disk requires a
microscope, either optical or
electron, depending on the
density of encoding

The Rosetta Stone

Not us! (196 BC)

Parallel text written in
three scripts:


Demotic (script form)


The key to deciphering
Egyptian Hieroglyphs

Rosetta Stone



(Also not us!)

Design of the Disk

Original design has human
eye readable text (Genesis
text) and micro
etched text
inside an index

New design has human
readable text (instructions)
on one side and microetched
images on the reverse

House Scanning

HP CapShare Scanners

Scan printed page in
multiple passes, any

Page is ‘assembled’ into
one image

Stores about 50 pages at a
time (300 dpi bitmap .tif)

Uploads numerically
sequenced images to
computer by infrared port

House Scanning

Minolta PS 7000 Overhead

Bitmap and grayscale scans up
to 600 dpi

Multiple sizes, orientations

Single page / double page
spread (good for text
collections with verso

Best for fragile books,
manuscripts that would be
damaged by hand scanning

Categories of Collection (1)


General information from about
language affiliation, where spoken, number of speakers,
dialects, alternate language names.


General description of the language. Origin and current
distribution of language, number of speakers, family,
typology, history, etc.


Maps of the geographic distribution of a language and its
relationship to other languages in the region.


Writing system(s) of the language with any
accompanying guide to pronunciation, use, etc.


A description of the basic sound units in a language
(phonemes) and how they combine to form utterances.

Categories of Collection (2)


How a language combines the smallest units of meaning
(morphemes) to create words and words to create

Core Word

A common word list of 100 or 200 terms typically
collected in linguistic fieldwork (“Swadesh Lists”), often
used for comparative purposes.


A description of the numbering system(s) in a language
with a list of basic terms.

Parallel Texts

A common text with translation for each language.
Initially Genesis Chapters 1
3 (a commonly collected
text). Now also the UN Declaration of Human Rights.

Glossed Texts

Transcribed indigenous texts with word glosses, free
translations and grammatical markup.

Language Curation

Ethnologue Description

Grammar (1167)

General Description (1651)

Core Word Lists (3098)

Maps (376)

Numbering Systems (215)

Orthography (1052)

Main Parallel Texts (1109)

Phonology (1731)

Glossed Vernacular Texts (869)

Rosetta Project Web Site


Search for a language

Language overview page

Browse (by name, family, country)

Wordlist tool



Language Overview



Endangered Language Query Rooms

Digital Online Curation Services for
Endangered Language Archives

Wordlist Tool


Endangered Language Query Rooms

Query Room Virtual Keyboard

Potawatomi Query Room

Re: Bozho

by Donald Perrot (host) on July 9 2004, 8:53 PM

Nmedagwe'ndan e'gi nebye'ge'yen ngom. Neaseno ndesh ne kas ge' nin, mine E'shkanabe'
e'nda ge' nin. I like what you have written. I am called Neaseno (Southwind) myself, and I
live in Escanaba, MI.

Re: Bozho

by Justin Neely on September 7 2004, 1:16 PM

Bozho Neaseno mine Lameen Zagnenibi ndeznekas. Nishnabe ndaw ipi Bodewadmi
ndaw. Shi shi ban nee yek ndebendagwes. Zego ndotem. Kansas City,Mo ndoch bya.
Eskanabe edayen ge nin. Bama pi ngom Zagnenibi ndeznekas

[Hello Neaseno and Lameen my name is Zagnenibi. I’m Native and Potawatomi. I belong
to the Citizen Band. I’m Crane Clan . I’m from Kansas City, Missouri. I also live in
Escanaba. Bye for now, Zagnenibi.]

Taking Conversational Risks

by [TL] on July 17 2004, 10:30 AM

mbesuk onago ngi zhyamen . nseze wgi bye tot i jiman ewi nepamshkamen be gishek.
wabek nuwi zhya men ibe eje shna mbesuk . ngi wabmak gode chemokmanuk demojgewat.
wabek nin gezhe ni demojgeyan gnebech. bama mine mtego

[I went to the lake yesterday. My brother brought a canoe so we could float around all day.
Tomorrow we’ll go there to the lake. I saw the white folks fishing. Tomorrow I’ll fish too,
maybe. So long for now, Mtego.]

Re: onago egi zhejkeyak

by [JN] on July 17 2004, 8:12 PM

mbesek ndazhya ngom. Mbish ksenyak shode. Nedwendan ode Mbish gshatek. Megwa
Nwinebyege ode bodewadmi kiktowenen bama. Megwetch Zagnenibi nin se

[I should go to the lake today. The water is cold here. I wish the water were warm. I’ll
write more of this Potawatomi conversation later. Thanks, yours truly Zagnenibi.]

Factors in query room success



Speech community


~25 native

Robust use

On Nias



Indonesia, West

US, Ontario

Internet access

Only in diaspora


Online community



Rooms requested

By speaker

By speaker

DOCS Project

Digital Online Curation
Services for Endangered
Language Archives

Many small language
archives are beginning to
digitize their materials

Lack technical
infrastructure to bring
resources online

Goal is to provide access
through Rosetta

DOCS Project Archives

Endangered Language Fund (ELF)

Survey for California and Other American
Indian Languages (SCOIL)

The Alaska Native Languages Center

Planck Institute for Evolutionary
Anthropology (Leipzig)

Wordlist Tool

Swadesh lists (100, 200, 207 terms) from:

Tryon's Comparative Austronesian Dictionary (rekeyed)

Tim Usher's Indo
Pacific database (2002 version)

Paul Whitehouse's Australian and New Guinea database (2002

George Starostin's Dravidian database

Ilya Peiros' Mon Khmer database

Total of 1,384 languages, 3,090 lists online

Additional 3000 lists, up to 1850 terms per
list, most 300
500 words in length.


A linguistic “Wayback Machine”

Language resource location and aggregation

Use alternate language names, spellings

Deutsch, Hochdeutsch, High German, Allemande

Fadicca, Fadicha, Fedija, Fadija, Fiadidja, Fiyadikkya, and Fedicca

Character identification (inventory, distribution)

Dera (Chadic, Nigeria)

Dera (Trans
New Guinea, Indonesia)

Seed crawler with Wordlist terms (see previous slide),
weighted towards longer terms

Archiving through Internet Archive

Serve results through the Rosetta site


Electronic Metastructure for Endangered
Languages Data (E

General Ontology for Linguistic Description

Open Language Archives Community


Electronic Metastructure for Endangered
Language Data

School of Best Practice

Guidelines and examples for putting linguistic data
into best practice digital formats

XML with XML Schema or DTD

Mapping terminology to ontology (GOLD)

FIELD lexical database tool

Online collaborative tool to build linguistic
dictionaries, backed by ontology (GOLD)


General Ontology for Linguistic Description

Built in OWL (Web Ontology Language), linked to
SUMO (Suggested Upper Merged Ontology)

Best practice resources should include a mapping
between the researcher’s terms, and a standard set,
known as the ‘profile’

‘independent’ (mine) = ‘main clause’ (GOLD)

‘obviative’ (mine) = ‘fourth person’ (GOLD)

The standard terminology set can then allow
sophisticated searches across disparate resources.

GOLD Community Model


Open Language Archives Community

Set of 23 metadata elements and controlled
vocabularies (based on Dublin Core)

Subject.language (language described, rather than audience
language) uses SIL language codes

Type.linguistic (grammar, lexicon, text)

IMDI (Isle Metadata Initiative) has 135 elements

Recommended extensions (Discourse Types, Linguistic
Field, Participant roles

Enables searches across a network of archives that use
OLAC metadata


Electronic Metastructure for Endangered Language Data (E

(School of Best Practice, FIELD Tool).

Endangered Language Query Rooms

The Ethnologue

General Ontology for Linguistic Description (GOLD)

ISLE MetaData Initiative (IMDI)

National Science Digital Library (NSDL)

Open Language Archives Community (OLAC)

The Rosetta Project,
. A preview of the
new Web site (currently under construction) is available at


This project is funded by the US National
Science Digital Library (NSF 333727)