atural Language Processing
will communicate with my class TRA
via these NLP
memos. These memos will be available at two websites:
In NLP memo #1,
gave a very brief discussion of the term NLP, and how
students will be assessed in this course. In addition to these memos
also occasionally make available powerpoin
y office is in 229 Ho Sing Hang Engineering Bldg, tel.26098456
can receive voice messages
plan to go
to my office after each
class for an hour or so for conversation with students.
traveling quite a bit t
surer to reach me by
Since a major purpose of the course is to encourage research in the area of
natural language processing, the student should begin exploring the various
resources as early as possible
in order to identify a project of interest to him.
Here are s
ome useful websites
When using the internet for
study and research purposes, it is important to keep in mind that the quality
and reliability of the resources may be quite unev
is maintained by the Summer Institute of Linguistics, a missionary organization.
is often associated with Barbara Grimes.
t gives the most up
date global picture
6000 some languages in the world, based on the fieldwork of numerous
points to the massive materials
of the Linguistic Data Consortium,
maintained by the University of Penns
One of its founders is Mark
The Digital Signal Processing Laboratory of the CUHK EE dept
has a complete subscription to these materials, under the care of Mr. Arthur
is aimed at revealing the evolution o
f human language, and does this by
compiling large etymological dictionaries.
t is based at the Santa Fe
Institute, under the direction of Murray Gell
Mann, a Nobel laureate in
physics, and involves Sergei Starostin [Moscow] and Merritt Ruhlen
This database contains as a subset a corpus of pronunciations in
some 20 Chinese dialects, called DOC, originally compiled at Berkeley
under the direction of W.S
was started under the direction of the psychologis
t George Miller, and
a wealth of grammatical and semantic information on English
Child Language Data Exchange System is directed by the psychologist Brian
MacWhinney of Carnegie Mellon University.
transcribed data of
children learning their first language in several different languages.
European Language Resources Association
International Computer Archive of Modern English
Oxford Text Archive
The Brown University corpus was initiated by two linguists, Nelson Francis
and Henry Kucera, on American English. It is perhaps the first such large
The British National Corpus.
ation of the course.
[#indicates that the
The nature of language, and the nature of translation.
, Dimensions of fidelity in translation, with special
ference to Chinese. 1967.
Y.Wang. The Chinese language.
riday 10.21 at the latest,
a project prospectus on what he wi
shes to work on.
By Friday 10.28 at the latest, students should have
received approval on their prospectus by consultation
with the instructor, by email, telephone, or in person.
The sounds of language
how they are produced,
, and perceived.
coustic properties of speech sounds as revealed by computer
analysis, PRAAT software; statistical methods in speech
n, Hidden Markov Models.
Data bases for speech
technology: Cantonese, Putonghua,
Review quiz on materials covered in lectures.
Final form of project due.
Ideally, your project should be based on some topic that has interested you
for some time. Hopefully, your project report can be based on enough work
d original thinking that it can be accepted for publication by a major
journal. However, if you are looking around for ideas, here are some
possibilities to get you started
 Hong Kong is a multilingual society, with three major languages
Putonghua, and English] and many other languages with fewer
speakers from South Asia and Southeast Asia.
One often hears a great deal
of code switching and code mixing on radio and tv, and sees them in written
materials, such as cartoons. What is the nat
ure of the problems people face
in such a context?
ow are these problems different in nature for the
 A major impetus for a language to change is contact with other languages.
Because of its complex sociolinguistic context, Hong Kong Ca
undergoing rapid change at many levels of its structure, presumably
differently from Guangzhou Cantonese. What are the major changes going
on, and how does one go about studying such processes?
The most difficult problem facing computer an
alysis of texts, spoken or
written, is the very high degree of ambiguity in any corpus. Parsing
programs that have been developed in natural language processing are
notoriously unsuccessful. What are the major types of ambiguity in
Cantonese, and how may
some of them be resolved by computer?
 Choose two pieces of text, either in Chinese or in English, one from
literature and one from natural science. Translate these two pieces into the
other language. What are the different difficulties you encounte
r with these
two genres? How would you relate your efforts to the three dimensions of
translation discussed by Y.R.Chao?
 Explore some online corpus, such as the WORDNET, and perform some
semantic analysis by algorithm.
 Explore some online corpus
, such as DOC, and perform some
phonological analysis by algorithm.
 Using some phonetic software, such as PRAAT, and perform some
phonetic analysis on any linguistic problem of theoretical interest.