Distributed Speech Distributed Speech Recognition Recognition

movedearAI and Robotics

Nov 17, 2013 (3 years and 11 months ago)

87 views

Distributed Speech
Distributed Speech
Recognition
Recognition
David Pearce
Motorola Labs
bdp003@motorola.com

Where is 358
Madison
Avenue”
2
Voice & Multimodal
Voice & Multimodal
User enters
commands via:
SPEECH
KEYPAD
System
responds:
SPEECH
SOUNDS
Voice-enabled Services
Keypad IN
Speech IN
Audio
OUT
Screen OUT
GRAPHIC
S
TEXT
Multimodal-enabled Services
3
Distributed Speech Recognition
Distributed Speech Recognition
IP
Netwo
rk
Content
Servers
[Wireless]
Packet Data
Network
Voice Gateway / Server:

VoiceXML / mm Browser

Speech Resources (ASR, TTS, etc.)
Client
Devices
Conventional
Circuit Switched
Mobile Voice Channel
Speech
Coder
Speech
Decoder
ISDN
ASR
Front-end
ASR
Decoder
DSR
Packet Data Channel
e.g. GPRS or CDMA 1x
ASR
Front-end
ASR
Decoder
4
Benefits of DSR
Benefits of DSR


Improves performance over wireless channels


Minimises impact of codec & channel errors


Consistent performance over coverage area


Improved performance in background noise


53% reduction in error rate


Ease of integration of combined speech and data
applications



Use packet data channel for both DSR and other data
80
85
90
95
100
Baseline
error free
strong
medium
weak
GSM signal strength
Word Ac
curacy (%)
EFR Coded Speech
DSR
5
DSR Standards
DSR Standards
Distributed Speech Recognition
Speech Enabled Services
Fixed point DSR standard created
DSR selected as the recommende
d
codec for SES

(Approved June 04)
DSR Advanced front-end

(Oct 2002)
DSR Extended Advanced Front-end

(Nov 2003)
Speech Enabled Services
New Work Item
(Approved Jan 2005)
3GPP2
IETF
RTP payload formats for DSR
Specifications standardised rfc4060
6
DSR Advanced Front-end (ES 202
DSR Advanced Front-end (ES 202
050)
050)

Noise Robust Front-end

Half error rate cf mel-cepstrum in background noise

Double Wiener filtering noise suppression

Waveform processing

Blind equalisation

Representation: 12 cepstral coeffs, C0, logE

Compression gives bit rate of 4.8kbit/s
Feature Extraction
Waveform
Processing
Cepstrum
Calculation
Blind
Equalization
VAD
input
signal
to feature
compression
Noise
Reduction
Waveform
Processing
Cepstrum
Calculation
Blind
Equalization
8 & 16 kHz
VAD
7
DSR Extension (ES 202 212)
DSR Extension (ES 202 212)

Enables Speech waveform reconstruction at server for human
listening

Adds 800bps containing pitch (
total 5.6kbps
):

Assists recogniser with tonal language recognition (e.g. Mandarin, Cantonese)
Pitch & Class
Estimation
Pitch Tracking
and Smoothing
Speech
Reconstruction
Pitch & Class
@ 800 bps
C
H
A
N
N
E
L
ETSI Standard
DSR Front-End
DSR
Back-End
MFCC & log-E
@ 4800 bps
Tonal
Information
Speech
In
Speech
Out
8
Results of ASR vendor evaluations in
Results of ASR vendor evaluations in
3GPP
3GPP

8
kHz

Number
of
db

tested

AMR
4.75

Average
Absolute
Performance

DSR

Average
Absolute
Performance

Average

Improvement

Digits

11

13.2

7.
7

39.9
%

Sub
-
word

5

9.1

6.5

30.0%

Tone confusabil
i
ty

1

3.6

3.1

14.8%

Channel errors

4

6.1

2.4

52.8
%

Weighted
Average


36
%



Extensive testing on 21 different speech databases

Covering different languages, tasks and environments

Tests performed with IBM and Scansoft commercial recognisers

Results above are for low data-rate comparison for packet data (<
8kbit/s)
9
Packet Switched Channel Errors
Packet Switched Channel Errors
Robustness to block errors narrow-band (8kHz)
86.0
88.0
90.0
92.0
94.0
96.0
98.0
0
1
2
3
4
Block error rate (%)
W
o
r
d a
c
c
ura
cy
(
%
)
DSR
AMR 12.2
AMR 4.75



Aurora-3 Italian speech database


GPRS network simulation for distribution of errors
3GPP Feb 2004
10
Coded speech vs DSR (Aurora-3
Coded speech vs DSR (Aurora-3
Italian)
Italian)
-73%
86.3
92.4
Average
-104%
76.8
88.6
High mismatch
-68%
83.9
90.4
Med mismatch
-57%
94.4
96.5
Well matched
Degradation
AMR 4.75
DSR
-159%
80.4
92.4
Average
-160%
70.5
88.6
High mismatch
-151%
75.9
90.4
Med mismatch
-165%
90.6
96.5
Well matched
Degradation
EVRC
DSR
11
Distributed Multimodal Architecture
Distributed Multimodal Architecture
Handset device

Input modalities (i.e.,
DSR,
keypad input, pen entry)

Output media (e.g., Visual
rendering, Decoded speech
output)

Application Environment
(Java or WAP Browser)

Protocols (SIP /
RTP,
Multimodal remote control)
MM Gateway
Handset
GPRS

or 3G
Network

J2ME
Application
Application
Multi-Modal
Browser
Multimodal
Browser
DSR
ASR
Decoder
R
TP
& SIP
RTP/SIP
RTP & SIP
RTP/SIP
DSR Front End
DSR Front End
VoiceXML
HTTP

Content Server

Multimodal
Applications
and content
Multimodal Gateway

DSR Decoder

Multimodal
VoiceXML browser

Protocols
Applications and
content

Content authoring

Content delivery