How to deal with the noise in real systems?

spectacularscarecrowAI and Robotics

Nov 17, 2013 (3 years and 10 months ago)

87 views

PCS Research & Advanced Technology Labs

Speech Lab

How to deal with the noise in real systems?


Hsiao
-
Chun Wu

Motorola PCS Research and Advanced
Technology Labs, Speech Laboratory

richardw@srl.css.mot.com

Phone: (815) 884
-
3071


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Why do we need to study noise?

Noise exists everywhere. It affects the performance of signal
processing in reality. Since the noise cannot be avoided by system
engineers, modern “noise
-
processing” technology has been
researched and designed to overcome this problem. Hence many
related research areas have been emerging, such as signal
detection, signal enhancement/noise suppression and channel
equalization.



PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000


Spectral Truncation


Spectral Subtraction (1989):


Time Truncation


Signal Detection:



Spatial and/or Temporal Filtering


Equalization:



Array Signal Separation (Blind Source Separation):

How to deal with noise?
Cut it off!!!!

)
(
)
(
~
)
(
)
(
)
(
~
)
(
f
S
f
N
f
N
f
S
f
S
f
R











noise
T
n
r






),
(
)
(
:
~
)
(
)
(
)
(
)
(
)
(
~
)
(
t
s
t
s
t
h
t
w
t
s
t
r









)
(
)
(
)
(
)
(
)
(
~
)
(
t
S
t
S
t
H
t
W
t
S
t
R









PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000



Session 1.
On
-
line Automatic End
-
of
-
speech Detection
Algorithm

(Time Truncation)



1
.
Project goal.

2. Review of current methods.

3. Introduction to voice metric based end
-
of
-
speech
detector.

4. Simulation results.

5. Conclusion.


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

1. Project Goal:



Problem


Digit
-
dial recognition with unknown digit string length


Solution 1


fixed length window such as 10 seconds? (inconvenience to users)


Solution 2


Dynamic termination of data capture? (need a robust detection
algorithm)

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000


Research and design a robust dynamic termination mechanism for speech
recognizer.


a new on
-
line automatic end
-
of
-
speech detection algorithm with small
computational complexity.


Design a more robust front end to improve the recognition accuracy for
speech recognizers.


a new algorithm can also decrease the excessive feature extraction of redundant
noise.

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

2. Review of Current Methods:


Most speech detection algorithms can be characterized into three categories.


Frame energy detection


short
-
term frame energy (20 msec) can be used for speech/noise
classification.


it is not robust at large background noise levels.


Zero
-
crossing rate detection



short
-
term zero
-
crossing rate can also be used for speech/noise
classification.


it is not robust in a wide variety of noise types.


Higher
-
order
-
spectral detection


short
-
term higher
-
order spectra can be used for speech/noise
classification.


it implies a heavy computational complexity and its threshold is
difficult to be pre
-
determined.


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

3. Introduction to Voice Metric Based End
-
of
-
speech
Detector:


End
-
of
-
speech detection using voice metric features is based on the Mel
-
energies. Voice metric features are robust over a wide variety of background
noise. Originally voice metric based speech/noise classifier was applied for
IS
-
127 CELP speech coder standard. We modify and enhance voice
-
metric
features to design a new end
-
of
-
speech detector for Motorola voice
recognition front end (VR LITE III).


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

voice metric score table

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Pre
-
S/N

Classifier

Voice

Metric

Mel
-

Spectrum

SNR

Estimate

EOS

Buffer

Threshold

Adaptation

raw data

FFT

Speech
Start?

Silence

Duration

Threshold

Post
-
S/N

Classifier

voice metric scores

Original VR LITE Front End

End
-
of
-
speech Detector

data capture stops

yes

no

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

VR LITE
recognition
engine

feature vector
frame buffer

segmentation
of speech into
frames

data capture
terminates

end of
speech?

yes

no

frame i

next frame i+
1

speech
input

front end
with end
-
of
-
speech
detector

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

6.51 seconds

3.78 seconds

4.81 seconds

raw data

end point

detected end point

String “2
-
2
-
9
-
1
-
7
-
8” in Car 55 mph

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Correct
detection

End
point

False
detection

false
detection
time error

correct
detection
time error

String “2
-
2
-
9
-
1
-
7
-
8” in Car 55 mph

seconds

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

4. Simulation Results
:
(Simulation is done over Motorola digit
-
string
database, including 16 speakers and 15,166 variable
-
length digit strings in 7
different conditions. Silence threshold is 1.85 seconds.)


A.
Receiver Operating Curve (ROC)
:

ROC curve is the
relationship between the end
-
of
-
speech detection rate versus the
false (early) detection rate. We compare two different methods,
namely, (1) new voice
-
metric based end
-
of
-
speech detector and
(2) old speech/noise flag based end
-
of
-
speech detector.



PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

ROC curve

false detection rate (%)

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000


B.
String
-
accuracy
-
convergence (SAC) curve
:

SAC
curve is the relationship between the string recognition accuracy
versus the false (early) detection rate. We compare two different
methods, namely, (1)
new voice
-
metric based end
-
of
-
speech
detector

and (2)
old speech/noise flag based end
-
of
-
speech
detector
.


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

false detection rate (%)

string recognition accuracy (%)

SAC curve

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

C. Table of detection results:
(This table illustrates the result among
the Madison sub
-
database including data files with 1.85 seconds or more of
silence after end of speech.)


Condition
Average
Time Error
Average
False
Detection
Time Error
Average
Correct
Detection
Time Error
False
Detection
Rate
String
Numbers
Total
Detection
Rate
Overall
1.98 sec
1.68 sec
1.85 sec
0.47%
7,418
86.08%
Office
Close-talk
1.97 sec
0 sec
1.93 sec
0%
907
94.82%
Office
Arm-
length
1.98 sec
0 sec
1.93 sec
0%
988
93.62%
Café
Close-talk
2.17 sec
0 sec
2.00 sec
0%
1,147
81.87%
Café Arm-
length
2.31 sec
0.14 sec
2.00 sec
0.11%
898
57.57%
Car Idle
(HF)
1.91 sec
1.02 sec
1.84 sec
0.08%
1,210
93.97%
Car
35mph
(HF)
1.93 sec
0.96 sec
1.77 sec
0.71%
1,130
87.61%
Car 55
mph (HF)
1.66 sec
2.00 sec
1.59 sec
2.20%
1,138
89.63%
PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

(This table illustrates the result over the small database collected by Motorola
PCS CSSRL. All digits strings are recorded in 15 seconds of fixed window)


Condition

Average
Time Error


Average
False
Detection
Time Error

Average
Correct
Detection
Time Error

False
Detection
Rate

String
Numbers

Total
Detection
Rate

String
Recognition
Accuracy

(w/i EOS)

String
Recognition
Accuracy

(w/o EOS)

Overall

1.82
seconds

0 seconds

1.82
seconds

0%

121

96.69%

50.41%

29.75%

Office
Close
-
talk

1.85
seconds

0 seconds

1.85
seconds

0%

21

100%

66.67%

61.90%

Office
-
Arm
-
length

1.84
seconds

0 seconds


1.84
seconds


0%

20

100%

65.00%

65.00%

Café
Close
-
talk

1.76
seconds

0 seconds


1.76
seconds


0%

40

100%

40.00%

15.00%

Café Arm
-
length

1.85
seconds

0 seconds


1.85
seconds

0%

40

90%

45.00%

10.00%

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Analysis of the Simulation Result:
Why didn’t EOS
detection work well in babble noise?

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Optimal Detection Decision


Bayes classifier




Likelihood Ratio Test

)]
|
(
log[
)]
|
(
log[
x
n
f
H
H
x
n
s
f
n
s



]
)
(
)
(
log[
,
)
(
)]
|
(
log[
)]
|
(
log[
)
(
n
s
f
n
f
T
T
H
H
x
L
n
x
f
n
s
x
f
x
L
Bayes
Bayes
n
s







PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Digit “one” in close
-
talking mic, quiet office

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Digit “one” in handsfree mic, 55 mil/h car

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Digit “one” in far
-
talking mic, cafeteria

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

5. Conclusion
:



New voice
-
metric based end
-
of
-
speech detector is robust over a wide
variety of background noise.


Only a small increase in the computational complexity will be brought by
new voice
-
metric based end
-
of
-
speech detector and it can be real
-
time
implementable.


New voice
-
metric based end
-
of
-
speech detector can improve recognition
performance by discarding extra noise due to the fixed data capture
window.


New voice
-
metric based end
-
of
-
speech detector needs further improvement
in the babble noise environment.


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000



Session 2. Speech Enhancement Algorithms: Blind
Source Separation Methods

(Spatial and Temporal
Filtering)



1.
Motivation and research goal.

2. Statement of “blind source separation” problem.

3. Principles of blind source separation.

4. Criteria for blind source separation.

5. Application to blind channel equalization for digital


communication systems.

6. Simulation and comparison.

7. Summary and conclusion.


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

1. Motivation:


Mimic human auditory system to differentiate the subject signals from other
sounds, such as interfered sources, background noise for clear recognition of
the subject contents.




One of the most striking facts about our ears is that we have two of them
--
and yet we hear one acoustic world; only one voice per speaker
.’ (E. C.
Cherry and W. K. Taylor. Some further experiments on the recognition of
speech, with one and two ears. Journal of the Acoustic Society of America,
26:554
-
559, 1954)



The ‘‘cocktail party effect’’
--
the ability to focus one’s listening attention on a
single talker among a cacophony of conversations and background noise
--
has
been recognized for some time. This specialized listening ability may be
because of characteristics of the human speech production system, the
auditory system, or high
-
level perceptual and language processing.


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Research Goal:

Design a preprocessor with digital signal processing speech
enhancement algorithms. The input signals are collected through
multiple sensor (microphone) arrays. After the computation of
embedded signal processing algorithms, we have clearly separated
signals at the output.

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Audio Input

Blind Source Separation Algorithms

Enhanced Output

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

2. Problem Statement of Blind Source Separation:

What is “
Blind Source Separation
”?



Sensor 1
Sensor
N
Signal 1
Signal
M
Received input signals
Given the
N

linearly mixed received input signals,

we need to recover the
M

statistically independent

sources as much as possible ( ).

M
N

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Formulation of Blind Source Separation Problem:

A received signal vector from the array,
X
(
t
)
, is the original source vector
S
(
t
)

through the channel distortion
H
(
t
)
, such that
X
(
t
) =
H
(
t
)
S
(
t
)
, where




and




We need to estimate a separator

W
(
t
)

such that


where






T
M
T
N
t
s
t
s
t
S
t
x
t
x
t
X
)
(
)
(
)
(
,
)
(
)
(
)
(
1
1















)
(
)
(
)
(
)
(
)
(
)
(
1
1
11
t
h
t
h
t
h
t
h
t
h
t
H
NM
N
ij
M







)
(
)
(
0
0
)
(
)
(
)
(
~
1
t
X
t
W
t
s
t
s
t
S
T
M
















)
(
)
(
)
(
)
(
)
(
)
(
1
1
11
t
w
t
w
t
w
t
w
t
w
t
W
NN
N
pq
N




PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

3. Principles of Blind Source Separation:

The independence measurement:

Shannon’s

Mutual information
.

0
)
,
,
,
(
)
(
)
,
,
,
(
2
1
1
2
1





N
N
i
i
N
y
y
y
H
y
H
y
y
y
I






y
i
i
y
N
Y
N
y
f
E
y
y
y
f
E
y
y
y
I
i
1
2
1
2
1
)]}
(
{log[
)]}
,
,
,
(
{log[
)
,
,
,
(



PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

4. Criteria to Separate Independent Sources:


Constrained Entropy (Wu, IJCNN99):





Hardamard Measure (Wu, ICA99):





Frobenius Norm (Wu, NNSP97):





Quadratic Gaussianity (Wu, NNSP99):










N
i
i
i
i
y
f
W
J
1
0
1
)]
,
,
(
log[
]
)
det(
log[


)
]
[
log(
}
])
[
(
log{
2
T
T
YY
E
YY
E
diag
J


2
3
])
[
(
]
[
F
T
T
YY
E
diag
YY
E
J




i
i
G
i
Y
dy
y
f
y
f
J
i
2
4
)
(
)
(






PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

We apply the minimization of modified constrained entropy


to adapt an equalizer
w
(
t
) =[
w
0
,
w
1
, ....] for

a digital channel
h
(
t
). Assume a PAM signal constellation with symbols
s
(
t
) =
, passing through a digital channel
h
(
t
) = [
c
(
t
, 0.11) + 0.8
c
(
t
-
1, 0.11)
-

0.4
c
(
t
-
3,
0.11)]
W
6
T
(
t
),

where is raised
-
cosine function with

roll
-
off factor


and is a rectangular window. the input signal

to the equalizer is where
n
(
t
) is the background
noise. We applied generalized anti
-
Hebbian learning to adapt
w
(
t
)

such that

.



5. Application to Blind Single Channel Equalization
for Digital Communication Systems:






N
i
i
i
i
N
y
f
w
J
1
0
1
)]
,
,
(
log[
)
log(


1

2
2
2
4
1
)
cos(
)
(
sin
)
,
(
T
t
T
t
T
t
c
t
c






)
(
)
(
)
(
t
t
h
t
w



)
6
(
6
T
t
rect
W
T








)
(
)
(
)
(
)
(
t
n
s
t
h
t
x
PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Signal
-
to
-
noise Ratio (dB)

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

Signal
-
to
-
noise Ratio (dB)

Bit Error Rate

PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

6. Simulation and Comparison:

The simulation results for comparison among our generalized
anti
-
Hebbian learning, SDIF algorithm and Lee’s Informax method
(Lee IJCNN97) over three real recordings downloaded from Salk
Institute, University of California at San Diego.


PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

New VR LITE Frontend:

Blind Source Separation

+
End
-
of
-
speech Detection

schemes
Average
Detection
Time Error
Average
False
Detection
Time
Error
Average
Correct
Detection
Time
Error
Number of
Strings
False
Detection
Rate
Total
Detection
Rate
EOS
only
0.256
seconds
0.155
seconds
0.317
seconds
14
7.14%
42.86%
BSS+
EOS
0.236
seconds
0.125
seconds
0.322
seconds
14
7.14%
50.00%
PCS Research & Advanced Technology Labs

Speech Lab

November 14, 2000

7. Conclusion and Future Research:


The computational efficiency of blind source separation needs
to be reduced.


Test BSS for EOS detection under microphone arrays of the
same kind.


Incorporate other array signal processing (beamformer?)
technique to improve speech detection and recognition.