Mandarin Chinese Speech Recognition

spectacularscarecrowAI and Robotics

Nov 17, 2013 (3 years and 11 months ago)

59 views

Mandarin Chinese

Speech Recognition

Mandarin Chinese


Tonal language (inflection matters!)


1
st

tone


High, constant pitch (Like saying “aaah”)


2
nd

tone


Rising pitch (“Huh?”)


3
rd

tone


Low pitch (“ugh”)


4
th

tone


High pitch with a rapid descent (“No!”)


“5
th

tone”


Neutral used for de
-
emphasized syllables


Monosyllabic language


Each character represents a single base syllable and tone


Most words consist of 1, 2, or 4 characters


Heavily contextual language


Mandarin Chinese and Speech
Processing


Accoustic representations of Chinese
syllables


Structural Form


(consonant) + vowel + (consonant)

Mandarin Chinese and Speech
Processing


Phone Sets


Initial/final phones [1]


e.g. Shi, ge, zi = (shi + ib), (ge + e), (z + if)


Initial phones: unvoiced


1 phone


Final phones: voiced (tone 1
-
5)


Can consist of multiple phones

Mandarin Chinese and Speech
Processing


Strong tonal recognition is crucial to
distinguish between homonyms [3]
(especially w/o context)


Creating tone models is difficult


Discontinuities exist in the F0 contour
between voiced and unvoiced regions

Prosody


Prosody: “the rhythmic and intonational
aspect of language” [2]


Embedded Tone Modeling[4]


Explicit Tone Modeling[4]


Tone Modeling


Embedded Tone Modeling


Tonal acoustic units are joined with spectral
features at each frame [4]


Explicit Tone Modeling


Tone recognition is completed independently
and combined after post
-
processing [4]



Pitch, energy, and duration (Prosody) combined
with lexical and syntactic features improves
tonal labeling


Coarticulation


Variations in syllables can cause variations in tone:
Bu4 + Dui4 = Bu2 Dui4 (wrong)




Ni3 + Hao3 = Ni2 Hao3 (hello)


Tone Modeling

Emebedded Tone Modeling:

Two Stream Modeling

Ni, Liu, Xu


Spectral Stream

MFCC’s
(Mel frequency cepstral
coefficients)



Describe vocal tract information


Distinctive for phones (short time duration)


Pitch/Tone Stream


requires smoothing


Describe vibrations of the vocal chords


Independent of Spectral features


d/dt(pitch) aka tone and d2/dt2(pitch) are added


Embedded in an entire syllable


Affected by coarticulation (requires a longer time
window)


i.e. Sandhi Tone


context dependency


Embedded Tone Modeling:

Two Stream Modeling [4]


Tonal Identification Features


F0


Energy


Duration


Coarticulation (cont. speech)


Initially use 2 stream embedded model followed
by explicit modeling during lattice rescoring
(alignment?)


Explicit tone modeling uses max. entropy framework
[4] (discriminative model)


Explicit Tone Modeling [4]

No.

Feature Description

# of Features

1

Duration of current, previous, and following
syllables

3

2

Previous syllable is or is not sp

1

3

Slope and intercept of F0 contour of current
syllable, its delta, and delta
-
delta

6

4

Statistical Parameters of pitch and log
-
energy of
current syllable (i.e. max, min, mean, etc.)

10

5

Normalized max and mean of pitch and energy in
each syllable in the context window

12

6

Location of current syllable within word

1

7

Tones of preceding and proceding syllables

2

Other Work

Chang, Zhou, Di, Huang, & Lee [1]


3 Methods


Powerful Language Model (no tone modeling)


CER = 7.32%


Embedded 2 Stream


Tone Stream + Feature Stream


CER = 6.43%


Embedded 1 Stream


Developed Pitch extractor


pitch track added to feature vector


CER = 6.03%


Other Work

Qian, Soong [3]


F0 contour smoothing


Multi
-
Space Distribution (MSD)


Models 2 prob. Spaces


Unvoiced: Discrete


Voiced (F0 Contour): Continuous


Other Work

Lamel, Gauvain, Le, Oparin, Meng [6]


Multi
-
Layer Perceptron Features


Combined with MFCC’s and Pitch features


Compare Language Models


N
-
Gram: Back
-
off Language Model


Neural Network Language Model


Language Model Adaptation


Other Work

O. Kalinli [7]


Replace prosodic features with biologically inspired
auditory attention cues


Cochlear filtering, inner hair cell, etc.


Other features are extracted from the auditory
spectrum


Intensity


Frequency contrast


Temporal contrast


Orientation (phase)


Other Work

Qian, Xu, Soong [8]


Cross
-
Lingual Voice Transformation


Phonetic mapping between languages


Difficult for Mandarin and English


Very different prosodic features



References

[1] Eric Chang, Jianlai Zhou, Shuo Di, Chao Huang, & Kai
-
fu Li, “Large Vocabulary
Mandarin Speech Recognition with different Approached in Modeling Tones”

[2] Meriam
-
Webster Dictionary,
http://www.merriam
-
webster.com/

[3] Yao Qian & Frank Soong, “A Multispace Distribution (MSD) and Two Stream Tone
Modeling Approach to Mandarin Speech Recognition”, Science Direct, 2009

[4]Chongjia Ni, Wenju Liu, & Bo Xu, “Improved Large vocabulary Mandarin Speech
Recognition using Prosodic and Lexical Information in Maximum Entropy Framework”

[5] Yi Liu & Pascale Fung, “Pronunciation Modeling for Spontaneous Mandarin Speech
Recognition”, International Journal of Speech Technology, 2004

[6] Lori Lamel, J.L. Gauvain, V.B. Le, I. Oparin, S. Meng, “Improved Models For Mandarin
Speech to Text Transcription, ICASSP, 2011

[7] O. Kalinli, “Tone and Pitch Accent Classification Using Auditory Attention Cues”,
ICASSP, 2011