PowerPoint-Präsentation - Humaine

ocelotgiantΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

88 εμφανίσεις

Occasion:

HUMAINE / WP4 / Workshop



"
From Signals to Signs of Emotion and Vice Versa"




Santorin / Fira, 18th


22nd September, 2004



Talk:



Ronald Müller

Speech Emotion Recognition

Combining Acoustic and Semantic Analyses

Institute for

Human
-
Machine Communication

Technische Universität München

Slide
-
2
-



System Overview



Emotional Speech Corpus



Acoustic Analysis



Semantic Analysis



Stream Fusion



Results

Outline

Outline

Slide
-
3
-

System Overview

System Overview

Speech signal

Prosodic features

ASR
-
unit

Semantic interpretation

(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion

Slide
-
4
-



Emotion set:


Anger, disgust, fear, joy, neutrality, sadness, surprise




Corpus 1: Practical course



404 acted samples per emotion




13 speakers (1 female)



Recorded within one year



Corpus 2:
Driving simulator



5
00 spontaneous emotion samples



200 acted samples (disgust, sadness)

Emotional Speech Corpus

Emotional Speech Corpus



2828
i
E







700
i
E
Slide
-
5
-

System Overview

System Overview

Speech signal

Prosodic features

ASR
-
unit

Semantic interpretation

(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion

Slide
-
6
-

Acoustic Analysis

Acoustic Analysis



Low
-
level features



Pitch contour (AMDF, low
-
pass filtering)



Energy contour



Spectrum



Signal



High
-
level features



Statistic analysis of contours



Elimination of mean, normalization to standard dev.



Duration of one utterance (1
-
5 seconds)

Slide
-
7
-

Acoustic Analysis



Feature selection
(1/2)




Initial set of 200 statistical features




Ranking 1: Single performance of each feature


(nearest
-
mean classifier)




Ranking 2: Sequential Forward Floating Search




wrapping by nearest
-
mean classifier


Slide
-
8
-

Acoustic Analysis



Feature selection
(2/2)



Top 10 features

Acoustic Feature

SFFS
-
Rank

Single Perf.

Pitch, maximum gradient

1

31.5

Pitch, standard deviation of distance
between reversal points

2

23.0

Pitch, mean value

3

25.6

Signal, number of zero
-
crossings

4

16.9

Pitch, standard deviation

5

27.6

Duration of silences, mean value

6

17.5

Duration of voiced sounds, mean value

7

18.5

Energy, median of fall
-
time

8

17.8

Energy, mean distance between
reversal points

9

19.0

Energy, mean of rise
-
time

10

17.6

Slide
-
9
-

Acoustic Analysis



Classification



Evaluation of various classification methods

33 features

Classifier

Error, %

Speaker indep.

Speaker dep.

kMeans

57.05

27.38

kNN

30.41

17.39

GMM

25.17

10.88

MLP

26.86

9.36

SVM

23.88

7.05

ML
-
SVM

18.71

9.05

Output: Vector of (pseudo
-
) recognition confidences

Slide
-
10
-

Acoustic Analysis



Classification



Multi
-
Layer Support Vector Machines

acoustic feature vector

ang, ntl, fea, joy / dis, sur, sad

ang, ntl / fea, joy

dis, sur / sad

ang / ntl

fea / joy

dis / sur

ang

ntl

fea

joy

sad

dis

sur



No confidence vector to forward to fusion

Slide
-
11
-

System Overview

System Overview

Speech signal

Prosodic features

ASR
-
unit

Semantic interpretation

(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion

Slide
-
12
-

Semantic Analysis

Semantic Analysis



ASR
-
Unit



HMM
-
based



1300 words german vocabulary



No language model



5
-
best phrase hypotheses



Recognition confidences per word



Example output
(first hypothesis)
:

I

can‘t

stand

this

every

tray

traffic
-
jam

69.3

34.6

72.1

20.0

36.1

15.9

55.8

Slide
-
13
-

Semantic Analysis

Semantic Analysis



Conditions



Natural language



Erroneous speech recognition



Uncertain knowledge



Incomplete knowledge



Superfluous knowledge



偲潢慢楬楳瑩挠獰潴瑩湧⁡灰牯慣s



Bayesian Belief Networks

Slide
-
14
-

Semantic Analysis

Bayesian Belief Networks


Acyclic graph of nodes and directed edges


One state variable per node (here states , )


Setting node
-
dependencies via cond. probability matrices





Setting initial probabilities in root nodes




Observation
A

causes evidence in a child node

(i.e. is known)


Inference to direct parent nodes and finally to root nodes


Bayes‘ rule :

i
X
i
x
i
x









)
|
(
)
|
(
)
|
(
)
|
(
|
~
)
(
)
(
P
C
P
C
P
C
P
C
P
Parent
C
Child
x
x
P
x
x
P
x
x
P
x
x
P
X
X
P


C
x
P




T
R
R
R
x
P
x
P
X
P
)
(
)
(



)
(
)
(
)
|
(
|
C
P
P
C
C
P
X
P
X
P
X
X
P
X
X
P

Slide
-
15
-

Semantic Analysis



Emotion modelling


...

I

...

I_hate

Bad

Adhorrence

first_person

Joy

Negative

Positive

Disgust

Inputlevel

Words

Superwords

Phrases

Super
-
phrases

Disgust

I can‘t stand this nasty every tray traffic
-
jam

can‘t

stand

nasty

cannot

stand

bad

disgusting

Interpretation

Good

Anger

Clustering

Sequence

Handling

Clustering

Clustering

Spotting

I_like

...

...

...

...

...

...

...

...

...

...

...

Output: Vector of “real“ recognition confidences

Slide
-
16
-

System Overview

System Overview

F&F of HMC


Overview

Speech signal

Prosodic features

ASR
-
unit

Semantic interpretation

(Bayesian Networks)

Classifier

(SVM)

Stream fusion

(MLP)

Emotion

Slide
-
17
-

Stream Fusion

Stream Fusion



Pairwise mean





Discriminative
f
usion applying
MLP



Input layer: 2 x 7 confidences



Hidden layer: 100 nodes



Output layer: 7 recognition confidences




n
fusion
n
E
P
max
arg






n
semantic
n
acoustic
n
fusion
E
P
E
P
E
P






Slide
-
18
-

Results

Results

Emotion

ang

dis

fea

joy

ntl

sad

sur

Mean

%

95.5

61.3

78.7

75.1

78.5

62.1

68.3

74.2

Acoustic recognition rates (SVM):

Semantic recognition rates:

Emotion

ang

dis

fea

joy

ntl

sad

sur

Mean

%

78.4

71.2

53.4

57.7

56.0

35.0

65.5

59.6

Slide
-
19
-

Results

Results

Emotion

ang

dis

fea

joy

ntl

sad

sur

Mean

%

98.0

78.7

88.3

95.9

98.2

91.7

95.8

92.0

Recognition rates after discriminative fusion:

Acoustic
Information

Language
Information

Fusion

by means

Fusion

by MLP

%

74.2

59.6

83.1

92.0

Overview:

Slide
-
20
-

Summary

Summary



Acted Emotions



7 discrete emotion categories



Prosodic feature selection via



Singe feature performance



Sequential forward floating search



Evaluative comparision of different classifiers



Outperforming SVMs



Semantic analysis applying Bayesian Networks



Significant gain by discriminative stream fusion

Slide
-
21
-