Linguistic concerns in teaching with language corpora

Natalie Kübler and Pierre
Yves Foucou (Paris, France)

We show how the Web
based environment we developed for language teaching, is currently
being adapted and extended. We present the implications from
the linguistic point of view,
and how they lead to a very pedagogical approach in teaching linguistics, and more generally
to a multilingual access to natural language processing.

Goal & issues

Our goal consists in providing an access to various multilingu
al resources that can be used for
different purposes. Teachers and researchers must be provided with corpora and corpus
tools, as well as CALL tools. Our department offers several types of courses and syllabuses:
English for specific purposes, specia
lized translation and language engineering applications or
cultural studies.

English as a specialist or general language is already being taught using our Web
environment; students have access to sets of concordances coming with corpus
ical remarks; they can then access corpus
based exercises related to the grammar
points. Working with the environment, and building pedagogical material, we uncovered
specific linguistic needs, and are gradually furnishing a linguistic description of the p
needed. This led us to using the lacks and drawbacks of our tools to teach linguistics and NLP
issues to our second and third level students.

Data & Processing

The tools embedded in the environment are robust enough to deal with large corpora and
rious types of texts. Users access the corpora and tools on a Web site. Current corpora are
monolingual and bilingual in French, English, and Spanish. They are extracted from
newspapers, as well as from various Web
based documents in specialized subject ar
Corpus query tools consist in a concordancer and a tokenizer that can sort words by frequency
order, both using perl
like regular expressions; advanced users can also have access to our
exercise generating tools. The corpora accessible to students are

tagged without any
disambiguation. Students can thus find out the issues at stake in linguistic analysis, which
leads them to NLP issues. Some examples :

Sentence segmentation: Students are asked to think of possible sentence definitions and let
them test

their hypotheses. Known problems are abbreviations, symbols and special
characters, capital letters, punctuation, etc. (Mr. e.g. .38

Tokenization: Students work on words boundaries and the role of spaces, hyphens,
apostrophes in multi
word units

and sentence components.

Morphology: Lemmatization and word formation are dealt with.

Syntax: Students are confronted with POS ambiguity; they must study syntactic structures
and design sequences allowing the system to disambiguate words.

Translation pro
blems: Using aligned corpora, students work on lexical or stuctural
ambiguities between two languages, or on collocations and idioms.

Linguistics exercises are also generated with our specific tools. Working with tagged
corpora allows us to generate gap
illing exercises on POS tagging for example.

Results in teaching class

Using corpora in the teaching of linguistics prompts enthusiastic reactions among the
students: because they can test hypotheses and obtain results, they grasp much more quickly
the p
roblems. They can also see when and why the analysis they give is false because the
query immediately gives incorrect results. Students are then shown how taggers work and
which choices are made. In the next phase, they work on a project consisting in buil
ding their
own bilingual corpus and querying it for various purposes (terminology, translation,
extraction, etc.). Further explorations in the use of multilingual corpora in the field of cultural
studies are envisaged.