Natural Language Processing

estonianmelonAI and Robotics

Oct 24, 2013 (3 years and 7 months ago)

135 views


1

NLP

memo
#
2

N
atural Language Processing

William S
-
Y. Wang

September
10

2005




Occasionally
I

will communicate with my class TRA
-
7204

via these NLP
memos. These memos will be available at two websites:


www.cuhk.edu.hk/tra/macat/nlp
/


www.ee.cuhk.edu.hk/~wsywang/


In NLP memo #1,
I

gave a very brief discussion of the term NLP, and how
students will be assessed in this course. In addition to these memos
,
I

will
also occasionally make available powerpoin
t files.


M
y office is in 229 Ho Sing Hang Engineering Bldg, tel.26098456

which
can receive voice messages
.
Typically,
I

plan to go
to my office after each
class for an hour or so for conversation with students.
Since
I

need

to be
traveling quite a bit t
his semester,
it is

surer to reach me by

e
mail :
wsywang@ee.cuhk.edu.hk.


Since a major purpose of the course is to encourage research in the area of
natural language processing, the student should begin exploring the various
resources as early as possible

in order to identify a project of interest to him.

Here are s
ome useful websites

worth exploring
.
When using the internet for
study and research purposes, it is important to keep in mind that the quality
and reliability of the resources may be quite unev
en.


-

http://www.ethnologue.com/


is maintained by the Summer Institute of Linguistics, a missionary organization.
It
is often associated with Barbara Grimes.
I
t gives the most up
-
to
-
date global picture
of the
6000 some languages in the world, based on the fieldwork of numerous
linguists.


-
http://www.ldc.upenn.edu/


points to the massive materials
of the Linguistic Data Consortium,
maintained by the University of Penns
ylvania.

One of its founders is Mark
Liberman.

The Digital Signal Processing Laboratory of the CUHK EE dept
has a complete subscription to these materials, under the care of Mr. Arthur
Luk.


2


-
http://ehl.santafe.edu/


is aimed at revealing the evolution o
f human language, and does this by
compiling large etymological dictionaries.
I
t is based at the Santa Fe
Institute, under the direction of Murray Gell
-
Mann, a Nobel laureate in
physics, and involves Sergei Starostin [Moscow] and Merritt Ruhlen
[Stanford]
.

This database contains as a subset a corpus of pronunciations in
some 20 Chinese dialects, called DOC, originally compiled at Berkeley
under the direction of W.S
-
Y.Wang.


-
http://wordnet.princeton.edu/

was started under the direction of the psychologis
t George Miller, and
contains
a wealth of grammatical and semantic information on English
words.


-
http://childes.psy.cmu.edu/

Child Language Data Exchange System is directed by the psychologist Brian
MacWhinney of Carnegie Mellon University.
I
t contains
transcribed data of
children learning their first language in several different languages.


-
http://www.elra.info/

European Language Resources Association


-
http://helmer.aksis.uib.no/icame.html

International Computer Archive of Modern English



-
http://ota.ahds.ac.uk/

Oxford Text Archive


http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brow
n/br
own.html

The Brown University corpus was initiated by two linguists, Nelson Francis
and Henry Kucera, on American English. It is perhaps the first such large
scale corpus.


-
http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#bnc

The British National Corpus.




3

Organiz
ation of the course.

[#indicates that the
reading

is downloadable.]


I.

9.10


The nature of language, and the nature of translation.

#
Y.R.
Chao
, Dimensions of fidelity in translation, with special

re
ference to Chinese. 1967.

#W.S
-
Y.Wang. The Chinese language.
Scientific American
1973.

II.

9.17




III.

9.24


10.1


National Day


IV.

10.8




V.

10.15


B
y
F
riday 10.21 at the latest,
students
should submit
by
email
a project prospectus on what he wi
shes to work on.

VI.

10.22

By Friday 10.28 at the latest, students should have
received approval on their prospectus by consultation
with the instructor, by email, telephone, or in person.


VII.

10.2
9


The sounds of language


how they are produced,

classifi
ed
, and perceived.


VIII.

11.5


speech technology
1
[Dr.
PENG Gang]

A
coustic properties of speech sounds as revealed by computer
analysis, PRAAT software; statistical methods in speech
recognitio
n, Hidden Markov Models.


IX.

11.12


speech technology
2
[Dr.
PENG Gang]

Data bases for speech

technology: Cantonese, Putonghua,
and English.




Review quiz on materials covered in lectures.


X.

11.19


student presentations

on projects.

XI.

11.26


student

presentations

on projects.

XII.

12.3


student presentations

on projects.


Final form of project due.


4

Ideally, your project should be based on some topic that has interested you
for some time. Hopefully, your project report can be based on enough work
an
d original thinking that it can be accepted for publication by a major
journal. However, if you are looking around for ideas, here are some

possibilities to get you started
:


[1] Hong Kong is a multilingual society, with three major languages
[Cantonese,
Putonghua, and English] and many other languages with fewer
speakers from South Asia and Southeast Asia.
One often hears a great deal
of code switching and code mixing on radio and tv, and sees them in written
materials, such as cartoons. What is the nat
ure of the problems people face
in such a context?
H
ow are these problems different in nature for the
computer?


[2] A major impetus for a language to change is contact with other languages.
Because of its complex sociolinguistic context, Hong Kong Ca
ntonese is
undergoing rapid change at many levels of its structure, presumably
differently from Guangzhou Cantonese. What are the major changes going
on, and how does one go about studying such processes?


[3]
The most difficult problem facing computer an
alysis of texts, spoken or
written, is the very high degree of ambiguity in any corpus. Parsing
programs that have been developed in natural language processing are
notoriously unsuccessful. What are the major types of ambiguity in
Cantonese, and how may

some of them be resolved by computer?


[4] Choose two pieces of text, either in Chinese or in English, one from
literature and one from natural science. Translate these two pieces into the
other language. What are the different difficulties you encounte
r with these
two genres? How would you relate your efforts to the three dimensions of
translation discussed by Y.R.Chao?


[5] Explore some online corpus, such as the WORDNET, and perform some
semantic analysis by algorithm.


[6] Explore some online corpus
, such as DOC, and perform some
phonological analysis by algorithm.


[7] Using some phonetic software, such as PRAAT, and perform some
phonetic analysis on any linguistic problem of theoretical interest.