Intelligent Rooms Perceptual User Interfaces Smart Rooms ...

bouncerarcheryΤεχνίτη Νοημοσύνη και Ρομποτική

14 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

60 εμφανίσεις

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Multimodale Räume


“Smart Rooms”

“Intelligent Environments”


Seminar SS 03




Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


User Interfaces



In the beginning
:
Wimpy Computing


Windows, Icons, Menus, Pointing



Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


2
nd

Generation:Human
-
Machine Interaction

“Please show me… hm… all


Hotels in
THIS

area.. er..part

of the city"


Speaking


Pointing,


Gesturing


Hand
-
Writing


Drawing


Presence/Focus of Attention


Combination


Sp+HndWrtg+Gestr.


Repair


Multimodal NLP & Dialog


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


“Perceptual” User Interfaces



Perceptive


human
-
like perceptual capabilities (what is the user
saying, who is the user, where is the user, what is he
doing?)


Multimodal



People use multiple modalities to communicate (speech,
gestures, facial expressions, …)



Multimedia


Text, graphics, audio and video


(Matthew Turk (Ed.), Proceedings of
the 1998 Workshop on Perceptual
User Interfaces)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Next: Pervasive Computing

Human
-
Computer

Interaction not the Only Exchange

Humans Want to Interact with
Other Humans



Computers in the Human Interaction Loop (CHIL)


The Transparent, Invisible Computer


Computers Needs to be Context Aware


Should Require little or no Learning or Attention


Should be
proactive

rather than command driven


Produce Little or No Distraction


Permit a HCI and CHIL Mix


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Smart/Intelligent Rooms



Use of computation to enhance everyday activity


Integrate computers seamlessly into the real world
(e.g. offices, homes)


Use “natural” interfaces for communication
(voice, gesture, etc. )


Computer should adapt to the human, not vice
-
versa!



Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Perception



In order to respond appropriately, objects/room
need(s) to pay attention to


People

and


Context



Machines have to be
aware

of their environment:


Who, What, When, Where and Why
?



Interfaces must be
adaptive

to


Overall situation


Individual User



Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Intelligent Environments




Classroom 2000 (Georgia Tech)


Mozer’s Adaptive House


Enhanced Meeting Rooms


Kids Room (MIT)





Enhanced Objects such as Whiteboards, Desks,
Chairs, …


See also the
Intelligent Environments Resource
Page

(
http://www.research.microsoft.com/ierp/
)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Intelligent Rooms,
Univ. California, San Diego


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Classroom 2000


Capturing activity in a
classroom


Speaker’s voice


Video


Slides


Handwritten Notes

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Classroom 2000

Presenting (recorded)
lectures through a web
-
based interface



Integration of Slides,
Notes, Audio, Video



Searching


Adding additional
material




Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Microsoft Easy Living Project


XML
-
based distributed agent system


Computer vision for person
-
tracking and visual user interaction.



Multiple sensor modalities combined.



Use of a geometric model of the world to provide context.



Automatic or semi
-
automatic sensor calibration and model
building.



Fine
-
grained events and adaptation of the user interface.

Device
-
independent communication and data protocols.

Ability to extend
the system in many ways.

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Mozer’s Adaptive House


Operated as an ordinary home


Usual light
-
switches, thermostats, doors etc.


Adjustments are measured and used to train the
house to


automatically adjust temperature


adjust lighting


choose music or TV channel


The house infers the users desires from their
actions and behaviours

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Adaptive House (Mozer)

Sensors:


Light Level


Sound Level


Temperature


Motion


Door status


Window status


Light settings


Fan


Heaters




(M. Mozer, Univ. of Colorado, Boulder)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Issues in Perception


Visual


Face
-
detection / Tracking


Body
-
Tracking


Face Recognition


Gesture Recognition


Action Recognition


Gaze Tracking / Tracking Focus of Attention


Auditory


Speech Recognition


Speaker Tracking


Auditory Scene Analysis


Speaker Identification


Other
: Haptic, Olfactoric, … ?


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Enhanced Meeting Rooms

Capturing of Meetings




Transcription



Summarization



Dialog Processing




Who was there ?



Who talked to whom ?


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Work at ISL


Face Tracking


Facial Feature Tracking (Eyes, Nose, Mouth)


Head Pose Estimation / Gaze Tracking


Lip
-
Reading (Audio
-
Visual Speech Reco.)


3D Person Tracking


Pointing Gesture Tracking



Other Modalities: Speech (!!!, see John), Dialogue,
Translation, Handwriting, ...

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Tracking of Human Faces



A face provides different functions:



identification



perception of emotional expressions




Human Computer Interaction requires tracking of faces:



lip
-
reading



eye/gaze tracking



facial action analysis / synthesis




Video Conferencing / video telephony application:



tracking the speaker



achieving low bit rate transmission

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Demo: FaceTracker

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Color Based Face Tracking

Human skin
-
colors:



cluster in a small area of a color space



skin
-
colors of different people mainly differ in intensity!



variance can be reduced by color normalization



distribution can be characterized by a Gaussian model

B
G
R
R
r



B
G
R
G
g



Chromatic colors
:

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Color Model

Advantages:



very fast



orientation invariant



stable object representation



not person
-
dependent



model parameters can be
quickly adapted

Disadvantages:



environment dependent



(light
-
sources heavily affect color
distribution)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Tracking Gaze and Focus of Attention


In meetings:


to determine the addressee of a speech act


to track the participants attention


to analyse, who was in the center of focus


for meeting indexing / retrieval


Interactive rooms


to guide the environments focus to the right application


to suppress unwanted responses


Virtual collaborative workspaces (CSCW)


Human
-
Robot Cooperation


Cars (Driver monitoring)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Tracking a User’s Focus of Attention


Focus of Attention tracking
:


To detect a person’s interest


To know what a user is interacting with


To understand his actions/intentions


To know whether a user is aware of
something




In meetings:


to determine the addressee of a speech act


to understand the dynamics of interaction


for meeting indexing / retrieval








Other areas



Smart environments



Video
-
conferencing



Human
-
Robot
Interaction


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Head Pose Estimation



Model
-
based approaches:


Locate and track a number of facial features


Compute head pose from 2D to 3D correspondences
(Gee & Cipolla '94, Stiefelhagen et.al '96, Jebara & Pentland '97,Toyama
'98)



Example
-
based approaches:


estimate new pose with function approximator (such as
ANN)
(Beymer et.al.'94, Schiele & Waibel '95, Rae & Ritter '98)


use face database to encode images
(Pentland et.al. '94)


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Model
-
based Head Pose estimation

Image

3D Model

Real World

Y

Z

X

Feature Tracking

Pose Estimation


Find correspondences between points in a 3D model
and points in the image




Iteratively solve linear equation system to find pose
parameters (
r
x
, r
y
, r
z
, t
x
, t
y
, t
z
)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Demo: Facial Feature Tracking

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Demo: Model
-
based Head Pose

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Model
-
based Head Pose


Pose estimation accuracy depends on correct
feature localization!



Problems:


Choice of good features


Occlusion due to strong head rotation


Fast head movement


Detection of tracking failure / re
-
initialization


Requires good image resolution



Video


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen



Estimating Head Pose with ANNs



Train neural network to estimate head orientation




Preprocessed image of the face used as input



Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen



Network Architecture

Hidden Layer:

40 to 150 units

Pan (Tilt)

Input Retina:


up to 3 x 20x30 pixel


1.800 units

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Tracking People in a Panoramic View


Camera View

Panoramic View

Perspective

View

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Training



Separate nets for pan and tilt


Trained with Std.
-
Backprop with Momentum Term



Datasets:


Training on 6100 images from 12 users


Crossevaluation on 750 images from same users


Tested on 750 images from same users



Additional User Independent Testset:


1500 images from two new users

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Results

training set
test set
new users
histo
6.6 / 5.0
9.4 / 6.9
11.3 /
9.1
edges
6.0 / 2.6
10.8 / 7.1
13.3 / 10.8
both
1.4 / 1.5
7.8 / 5.4
9.9
/ 10.3
Average Error in degrees for pan / tilt
histo: Histogram
-
normalized image used as input

edges: Horizontal
-

and Vertical Edge Image used as input

both: Both, Histogram
-
image plus Edge Images used

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Demo

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Spatial
-
Awareness in Smart Rooms

Tracking people indoors




To focus sensors on


people




To resolve spatial


l
relationships




To avoid bumping into


humans



To analyze activity


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Person Tracking

Vision based localization of
people/objects:




Single Perspective:



Pfinder
-

W
3
S
-



Hydra
-

etc.



Multiple Perspective:



AVIARY
-

Easy Living

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Person Tracking in the ISL Smart Room

Cam2

Cam1

Cam0


Features


Features

Tracking

agent

Feature

extractor

People

Cam3

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Personen
-
Tracking mit mehreren Kameras

Ziel: 3D Tracking von Personen in Räumen



Segmentierung von
Vordergrundobjekten in
jedem Bild




„3D Schnitt“ der
Strahlen durch die
Objektmitten




Kalman
-
Filter


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Adaptive Silhouette Extraction

Background subtraction
:




Adaptive Multi
-
Gaussian


background model


[Stauffer et al., CVPR 1998]




Morphological operators


smooth foreground output




Connected components


form silhouettes

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Locating people

Location

Hypotheses:


i) (X,Y)

ii) (X,Y)

1

2

3

a

b

a

a

b

b



Extract reference point: Centroid




Use calibrated sensors to


calculate absolute position




Create list of location hypotheses

1

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Tracking people

Best Hypothesis Tracking:




Match location hypotheses


a
to tracks




Smooth tracks with Kalman


a
filter


Hypotheses



i) (X,Y)

ii) (X,Y)

Track 1


Track 2

i)

ii)

Track 1

Track 2

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Tracking Problems

Imperfect and

Merged silhouettes
:


Counterstrategies




Better Vision algorithm




Probabilistic Multi
-
Hypothesis


a
Tracking




Reference point: Head



Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen




Use head as reference


point instead of


centroid









Head tracker has


significantly lower


tracking error and


false alarm rate

Reference point: Head

Head
Centroid



0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0,08
0,09
0,1


-

Tracking error


-

False alarm rate


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Demo

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Erkennung von Zeigegesten


Ziele:


Menschliche Zeigegesten erkennen


Zeigerichtung in 3D extrahieren



Einsatzgebiete:


Mensch
-
Roboter
-
Interaktion


smart rooms



Anforderungen:


Personenunabhängig


Echtzeitbetrieb


Kamerabewegung möglich

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Erkennung von Zeigegesten

Stereokamera

Linkes/rechtes Bild

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


3D
-
Tracker: Verarbeitungsschritte

Kamera

Hautfarbe

Disparität

3D
-
Clustering von Hautfarbpixeln
liefert Hinweise auf Position von
Kopf und Hände.

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Gestenerkennung: Bewegungsphasen


Zeigegesten bestehen aus

drei intuitiv unterscheidbaren
Bewegungsphasen:



Beginn


Halten


Ende




Genaue Lokalisierung der
Haltephase wichtig zur
Bestimmung der Zeigerichtung

Mittlere Dauer der Bewegungsphasen

μ

[sec]

σ

[sec]

Komplette Geste

1.75

0.48

Beginn

0.52

0.17

Halten

0.76

0.40

Ende

0.47

0.12

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Gestenerkennung: Modelle


Modellierung der 3 Phasen
mit separaten Modellen



Kontinuierliche HMMs mit 2
Gaussians pro Zustand



Null
-
Modell als Schwellwert
für die Phasen
-
Modelle



Training auf handgelabelten
Daten


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Gestenerkennung: Detektion


Eine Zeigegeste wird
erkannt, wenn 3 Zeitpunkte t
B

< t
H

< t
E

gefunden werden, so
dass


P
E
(t
E
) > P
B
(t
E
) und P
E
(t
E
) > 0


P
B
(t
B
) > P
E
(t
B
) und P
B
(t
B
) > 0


P
H
(t
H
) > 0




Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Gestenerkennung: Merkmale


Merkmalsvektor:
(r,
Δθ
,
Δ
y
)



Experimente: zylindrische
Koordinaten besser als sphärische und
kartesische



Hand relativ zum Kopf


unabhängig
von Position im Raum



Δθ
,
Δ
y


keine Anpassung an
Zeigeziele

aus dem Training



Spline
-
Interpolation der Merkmals
-
sequenzen auf konstant 40Hz.

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Zeigerichtung



Kopf
-
Hand
-
Linie


Sehstrahl Auge
-
Hand


Einfach zu messen



Unterarmlinie


Potenziell überlegen bei
abgewinkeltem Arm


Schwieriger zu messen

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Audio
-
Visual Speech Recognition

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Lip Tracking Module


Feature based



detects localization
failures and automatic
recover from failures



tracks facial features

(pupils, nostrils, lips)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Audio
-
Visual Recognition

hyp
c

=
l
a

hyp
a


l
v

hyp
v

1 =
l
a


l
v

Kombinations Methoden




SNR Gewichte




Entropie Gewichte




trainierte Gewichte

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Fusion Levels


Word Level (Vote, Decide based on A and V score)


Phoneme Level (Combine by Diff. Weighting
Schemes)


Feature Level (Combine Features)

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Audio
-
Visual Speech

0
10
20
30
40
50
60
70
80
90
100
clean
16 dB
SNR
8 dB
SNR
acoustic
visual
combined
Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Mögliche Themen


Personentracking


Gestenerkennung


Attentive Interfaces


Face Detection


Lippenlesen

(Audio
-
Visual
Speech Reco.)


Audio
-
Visual Tracking


Emotion Recognition


Person Identification


Microphone
-
Arrays


Sensor
Fusion



Smart Room
Infrastructure



Intelligent Camera Control


Self
-
Calibration


Other Smart Room Projects

(MIT, Georgia Tech
, IM2
)


Other Sensors: Pressure, IR,
etc


Speech Recognition



in Meetings


Far
-
Field


Efficient


Microphone
-
Arrays

Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen


Seminar Multimodale Räume SS 2003


Einführung 7. Mai
-

Rainer Stiefelhagen