Speech Perception and Spoken Word Recognition

joinherbalistΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

47 εμφανίσεις

(Introduction to) Language History and Use: Psycholinguistics

-

1

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight

Speech Perception and Spoken Word Recognition

I.

Introduction

How do we recognise words when people speak to us? In this lecture we explore how we
find the correct entry in our word store (mental lexicon). NB. Here we are dealing
purely with accessing the
entries for individual words. We are not dealing with how we
extract the meaning from larger chunks of speech. This will be covered in the lecture on
Sentence and Discourse comprehension.

II.

Speech Perception

The first stage in the recognition process is th
e perception of speech.

A.

Is Speech Special?

Humans seem to process speech sounds in different ways to other sounds (Cf. sine wave
speech.) We can understand speech at a rate of 20 phonemes per second but can only
identify sequences of non
-
speech sounds at
a rate slower than 1.5 sounds per second.
Speech is at an advantage over non
-
speech sounds when heard against background noise.

B.

The Nature of speech

There is no one to one mapping between the acoustic, physical properties of speech and
the sounds we perce
ive. This is due to 2 related problems.

1.

The Invariance and Segmentation problems

The same phoneme may be produced in different ways depending on the context i.e. there
is a lack of
invariance
. Phonemes vary according to the surrounding context. They tak
e
on some of the properties of the surrounding phonemes (assimilation) as the vocal tract
begins to move to the position for the next sound (coarticulation). This also means that
segmentation

is impossible. Sounds slur together and the signal cannot be d
ivided into
discrete time
-
slices each representing a single phoneme.

a)

Disadvantage of these properties

Sounds cannot be identified by matching then to a mental template.

b)

Advantage of these properties

For speakers, speech can be faster as each segment does n
ot have to be produced
separately. For listeners, information about each segment is spread over time and each
point in time may carry information about more than one segment.

C.

Adult Word Segmentation

The segmentation problem does not just apply to phonemes
. It would also be difficult to
identify word boundaries based solely on phonological information. Look at the
following pairs:

Ice cream



I scream

Nitrate




Night rate

Take Gray to London


Take Greater London

(Introduction to) Language History and Use: Psycholinguistics

-

2

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight

1.

Strategies for determining the location of

word boundaries.

a)

Stress Patterns

The majority of English words have a
strong
-
weak

stress pattern (
trochaic
). Therefore a
strategy based on expecting word boundaries before strong syllables could be efficient.
One such strategy is the Metrical Segmentati
on Strategy. Cutler and Butterfield (1992)
played faint utterances such as “con
duct

a
scents

up
hill
”. Subjects reported hearing e.g.
“The
doc
tor
sends

the
bill
,” suggesting they had inserted word boundaries before the
strong syllables.

b)

Phonotactics

Differ
ent languages allow different patterns of phonemes to occur. For example /kn/ is
not a legal onset in English but is in Dutch. So English speakers hearing /kn/ can
segment words by assuming that there must be a word boundary between them.

c)

Allophones

With
in a language a single phoneme may be produced in different ways depending on its
position within a word (allophones). E.g. /p/ is aspirated in ‘pin’ but is unaspirated in
‘spin’. Therefore hearing an aspirated /p/ suggests that it is word initial i.e. t
hat there is a
word boundary before it. Smith and Hawkins (2002) constructed nonsense words in
which real words were embedded such as ‘puzoath’. Listeners could spot ‘oath’ faster if
it had originally been produced with a word boundary before it suggesti
ng that listeners
are sensitive to patterns of allophonic detail.

d)

Knowledge of real words

Listeners do not like to leave sounds unattached to possible words. The possible
-
word
constraint (Norris
et al

1997) suggests that “fill a green bucket” will be pref
erred to
“filigree n bucket” because in the second possibility the /n/ does not form part of a word.

e)

Summary

There are several different types of cues that depend upon sound and prosodic patterns of
language. The implication is that speakers of different
languages will employ different
strategies for segmenting words.

D.

Infant Speech Perception

What are the speech perception abilities of infants like? How do these abilities develop
over the first year of life? (Jusczyk (1997) provides a comprehensive review
of the
following issues).

1.

Native vs. non
-
native phonetic contrasts

a)

Birth

Very young babies can discriminate both native and non
-
native phonetic contrasts.

(Introduction to) Language History and Use: Psycholinguistics

-

3

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight

b)

6
-
12 months

In the second six months of life infants lose the ability to discriminate many of the
c
ontrasts that are not present in the ambient language

2.

Features used in adult word segmentation

a)

Are infants sensitive to the features of language that
adults use to segment speech?

(1)

Stress

Infants prefer to listen to speech containing the dominant stress pat
terns of their own
language such as trochaic patterns for English. 9 month old Americans were played lists
of English words that were either weak
-
strong or strong
-
weak. They listened longer to
the lists of strong
-
weak words even when the lists were
low
-
p
ass filtered

(removing
phonotactic

cues). 6 month olds showed no preference for either list suggesting that this
sensitivity develops between 6 and 9 months of age.

(2)

Phonotactics

Infants prefer to listen to speech with the phonotactics of their native lang
uage. 9 month
olds were played lists of words either from their native language (English) or another
language (Dutch). The infants listened longer to the lists of words from their native
language. There was no difference in listening time between the two

lists when they
were low
-
pass filtered or when they were played to 6 month olds. This suggests that
somewhere between 6 and 9 months old infants become sensitive to the phonotactics of
their own language.

(3)

Allophones

Infants can distinguish between pairs
such as ‘night rate’ and ‘nitrate’ even when they are
cross
-
spliced so that only allophonic and not prosodic differences remain.

(4)

Summary

Between 6 and 9 months of age infants develop sensitivity to the patterns of their native
language that adults used fo
r word segmentation.

b)

Do infants use these strategies to segment speech?

(1)

Stress

7.5 month olds familiarised with isolated strong
-
weak words such as ‘doctor’ and
‘candle’ listen longer to sentences containing these words than to sentences containing
controls

suggesting that they can segment the individual words. They cannot perform the
same task when familiarised with weak
-
strong words.

(2)

Phonotactics

10.5 month olds can perform the above task with weak
-
strong words if there are good
phonotactic cues.

(Introduction to) Language History and Use: Psycholinguistics

-

4

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight

(3)

Allophon
es

10.5 month olds familiarised with either ‘night rate’ or ‘nitrate’ listen significantly longer
to sentences containing the familiar item than to ones containing the unfamiliar item
suggesting they can segment the familiar from fluent speech on the basis

of allophonic
cues. 9 month olds cannot perform this task.

(4)

Summary

Infants begin to segment using something like the Metrical Segmentation Strategy but
then begin to use other cues (by 10.5 months old) enabling them to segment words with
weak
-
strong patt
erns. Jusczyk (1997) suggests that babies’ speech perception abilities
develop in order to facilitate the segmentation of individual words from the speech
stream.

III.

Spoken Word Recognition

A.

Important Issues



What are the processes by which adults recognise sp
oken words?



Are these processes autonomous or interactive (i.e. can processes see each other
and feed back as well as forwards)



What use is made of context?

B.

Experimental Methods

1.

Lexical Decision

An item is played to the subject who must say whether it is
a real word (film) or a non
-
word (flim). The time taken to make the decision is measured. Researchers investigate
how the reaction time is affected by context, word frequency etc.

2.

Priming

Two words are played. The subject must say if the second word is
a real word or a non
-
word and their reaction is timed. Researchers vary the first word to see how reaction time
and errors are affected.

3.

Gating

Increasingly longer stretches of a word are played. The subject must identify the word as
soon as they can. Re
searchers can measure the point at which different words are
recognised and how this is affected by factors such as context.

4.

Shadowing

Subjects must repeat back speech as they are hearing it (the delay is usually 150
-
250ms).
The input speech contains erro
rs such as missing or replaced phonemes. Researchers
measure if these errors are corrected in the subject’s repetition (fluent restoration). The
location of errors and the type of speech (e.g. semantically unpredictable) is varied to see
if this affects
the subject’s repetition

(Introduction to) Language History and Use: Psycholinguistics

-

5

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight

C.

Fundamental findings

There are some classical findings that have been replicated many times. Any model of
Spoken Word Recognition must attempt to explain these findings.

1.

Word Frequency Effects

Words that are frequent are recognise
d more quickly than words that are
infrequent

In lexical decision tasks, frequent real words are recognised more quickly than infrequent
real words. E.g. ‘rain’ vs. puddle.

2.

Word Supremacy Effect

People are quicker to decide that a given target is a word t
han to recognise it as a
non
-
word.

In lexical decision tasks people are quicker to identify ‘film’ than to reject. ‘flim’.

3.

Context Effects

Words in context are recognised more quickly than isolated words.

In the gating task, isolated words are recognised i
n 333ms (mean) whereas words in
context are recognised in 199ms (mean). E.g. ‘
camel
’ vs. ‘at the zoo the children rode on
a
camel’.


However, the situation is not a simple one and it is important to distinguish between
different types of context (e.g. lex
ical, syntactic, semantic) and also to consider how and
when context has its effect. For example, can context affect which words are considered
for recognition or only help to eliminate candidates or only help to integrate a recognised
word into the large
r utterance? (see Harley for a review).

4.

Distortion Effects

Words that are distorted at the beginning are recognised more slowly than words
that are distorted at the end.

In shadowing tasks, distortions are more likely to be restored if they occur in the 3
rd

rather than the 1
st

syllable e.g. trage
t
y vs.
d
ragedy.

D.

A word of warning!

The terms 'word recognition' and 'lexical access' are used differently by many researchers.
In this lecture 'word recognition' is the point where you know a word exists in your
lexicon and you know what the word itself is but don't know anything about it (i.e.
whether it’s a noun or a verb, what it means). 'Lexical access' is the point where all the
semantic and syntactic information about the word is available to you. Please b
ear in
mind that this is not the case in every article or textbook. Sometimes the words are used
interchangeably and sometimes the definitions I give here are reversed. This is very
confusing so be careful and also make sure you know how you are using th
e terms.

(Introduction to) Language History and Use: Psycholinguistics

-

6

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight

E.

Models of SWR

1.

The Original Cohort Model

a)

Summary

This model assumes that on the basis of the first 250 ms of speech a cohort of possible
words is set up. As more speech input is heard items from that cohort are eliminated.
Word Recognition occurs

when only one item is left in the cohort.


Process to recognise the word 'slave'


a. Listener hears /sl/ and sets up an initial cohort (access stage)







b.1 Listener hears /ei/ and eliminates non
-
matching candidates (selection stage)




b.2.
Listener hears /v/ and eliminates final non
-
matching candidate





c. Listener uses knowledge about syntax and semantics to integrate the word into the
sentence (integration stage)

b)

How well does the Cohort Model account for the
evidence?

Word Frequency


The original model does not explain why frequent words are recognised quickly.


Word Supremacy



You have to eliminate all the words in the cohort before you can decide a target is not a
real word whereas real words can even be recognised before they are f
inished.


Context Effects


Context can be used to help to eliminate candidates from the cohort so will improve
recognition speed.


But, context might to be too powerful here.


slow

slip

slack

slide

slave

sleigh

slave

sleigh

slave

(Introduction to) Language History and Use: Psycholinguistics

-

7

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight

Distortion Effects


An initial distortion will be more problematic as you will s
et up your cohort wrongly.
A later distortion may come after the recognition point and other (contextual) cues can be
used to help you.


However in a way this works too well. How can you ever recognise the word if you've
set up the wrong cohort?

2.

The Revi
sed Cohort Model

a)

Modifications

The principle is the same as in the original model in that listeners still set up an initial
cohort of candidates. However, the elimination process is no longer all
-
or
-
nothing.
Items that do not receive further positive inf
ormation decay in activation rather than
being eliminated and a word is recognised when it has a higher relative activation than
other words in the cohort. This allows for backtracking if you mishear a word or it is
distorted. Word frequency is built int
o this model by saying that frequent words may
become activated more quickly than infrequent words. Context loses some of its power,
as it cannot be used to influence the items that form the initial cohort.

3.

The TRACE Model


The TRACE model is a connectio
nist network. Important features of these models are:




There are lots of simple processing units



Processing units are arranged into levels



Units are joined together by weighted, bi
-
directional connections



Input units are activated by incoming information



Activation spreads along the connections to other units

a)

Summary

The levels of processing units in TRACE are features, phonemes and words. Within a
single level the connections between units are inhibitory. Connections between
consistent units on differen
t levels are excitatory. Information flows in both directions
and top
-
down information (context) affects activation at lower levels.

(Introduction to) Language History and Use: Psycholinguistics

-

8

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight























Figure
1

A simplified diagram of TRACE


An example
-

to recognise the wo
rd
-

/pan/

a.

Listener hears input with the features
-
sonorant and
-
voice. These features become
activated. This activation inhibits other feature nodes

b.

The activation spreads to the next level and activates the phonemes with these
features (/p/, /k/) which

then exerts an inhibitory influence on the surrounding
phonemes. Activation also feeds back to the feature level to reinforce the activation
of
-
sonorant and
-
voice

c.

The /p/ and /k/ phonemes activate the words /pan/ and /kan/ in the word level which
inhib
it other word nodes. Activation also feeds back to activate /p/ and /k/ and in turn
the relevant features.

d.

All the time, contextual information is used which helps to activate word nodes that
are syntactically or semantically appropriate for the context

b)

How well does TRACE account for the evidence?

Word Frequency


The model does not currently explain why frequent words are recognised quickly but
modifying it in the same way as the Cohort model would help.


Word Supremacy



Real words feed back down to the

lower levels to help reinforce earlier information.
Nonwords do not have this advantage. However it is not clear exactly how the model
would finally make a decision that a word didn't exist in the inventory.

pan

ban

nan

can

/p
/

/b/

/k/

+voi

-
nas

-
son

-
voi

CONTEXT

SENSORY

INPUT

(Introduction to) Language History and Use: Psycholinguistics

-

9

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight


Context Effects


Context feeds down to affec
t the perceptual level.


But, again, context might to be too powerful here.


Distortion Effects


It’s easy to recover from distortions because you are not so reliant on hearing
everything perfectly.


Word initial sounds contribute more to the activation of

word nodes than later sounds.


IV.

Conclusions

Both of the models discussed can explain much of the experimental evidence concerning
recognition. However, no single model seems to be able to explain all of the evidence
and many models make predictions that a
re not supported by the evidence. The role of
context and the ability to segment the speech stream are still problematic for many
models of Spoken Word Recognition.

V.

Glossary

Strong syllable

A syllable that is stressed and does not have a reduced vowel

Wea
k syllable


A syllable that is not stressed and may contain a reduced vowel

Trochaic


A strong
-
weak stress pattern (e.g.
doc
tor)

Phonotactics


Sequential constraints that operate on adjacent phonetic segments

Low
-
pass filtering

Removal of high frequencies
(removes segmental identity but
preserves prosodic characteristics)


(Introduction to) Language History and Use: Psycholinguistics

-

10

-

Speech Perception and Spoken Word Recognition

Rachael
-
Anne Knight


VI.

Reading and References

Reading


Harley, T. (2001)
The Psychology of Language
, Cambridge CUP,
Chapter 8


Goldinger, S., D. Pisoni, & P. Luce (1996) “Speech perception and spoken word
rec
ognition: research and theory”. In N. Lass, (ed.),
Principles of Experimental
Phonetics
, St. Louis: Moseby.
277
-
327
.


Harris, M. and Coltheart, M. (1989)
Language Processing in Children and Adults
,
London: Routledge,
159
-
171
(available on reserve in the M
ML library


ask at the front
desk)


Lively, S. and Goldinger, S. (1994) “Spoken Word Recognition: Research and Theory”
In M. Gernsbacher (ed,)
Handbook of Psycholinguistics
, San Diego: Academic Press
265
-
301


Jusczyk, P. (1997)
The Discovery of Spoken
Language
, Cambridge, Massachusetts: MIT
Press,
Chapters 2 and 3

(for an incredibly comprehensive and easy to read survey of
infant speech perception abilities)


Grosjean, F. and Frauenfelder, U. (eds.) (1997)
A Guide to Spoken Word Recognition
Paradigms, H
ove: Psychology Press



References

Cutler, A. and Butterfield, S. (1992) “Rhythmic cues to speech segmentation: Evidence
from juncture misperception.”
Journal of Memory and Language, 31,

218
-
236


Norris, D., McQueen, J., Cutler, A., and Butterfield, S. (1
997) “The possible
-
word
constraint in the segmentation of continuous speech.”
Cognitive Psychology, 34,

191
-
243


Smith, R. and Hawkins S. (2000) “Allophonic influences on word spotting experiments.”
Proceedings of the ISCA Workshop on Spoken Word Access
Processes, Nijmegen, The
Netherlands, 139
-
142