Performance Assessment and System Design in Human Robot Interaction

fencinghuddleAI and Robotics

Nov 14, 2013 (3 years and 4 months ago)

80 views



Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Performance Assessment and System Design
in Human Robot Interaction
Sven Wachsmuth
Bielefeld University
May, 2011


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities

... supercomputing power is already
on the order of estimated
human brain capacity,
but intelligent or human-simulating
machines do not yet exist ...“.
[futurememes.blogspot.com/2010/04]
What are the Flops of cognitive systems?


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
„…
[Richard Murphy] says,
They've designed the
benchmark [Graph 500]
to spur both researchers
and industry toward mastering
architectural problems of
next-generation supercomputers.“
Beyond Flops ...


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Limits of benchmarking

Evaluation and benchmarking is an inherently multi-
dimensional problem (how to define progress?)

Benchmarks significantly influence the design of
system architectures

Evaluation metrics do not necessarily make us aware
of architectural bottlenecks

Benchmarks do not capture the richness of
applications


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
[Perona, ICCV Workshop, 2007]
Benchmarks needs to be scalable


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Limits of offline datasets

Ground truth is not always easy to capture

Image datasets ignore the acquisition step (sensing)

Image datasets ignore the relevance of results

Offline processing ignores system aspects

Fokus on experimental studies

Need for live systems

Need for live users/interaction partners


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Human-Robot Interaction

Human-Robot Interaction scenarios

Home-tour (navigation tasks / Human initiative teaching)

Curious robot (manipulation tasks / Mixed initiative learning)

Museum guide (assistive tasks / Robot initiative explanation)



Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Challenges in defining benchmarks

How to measure progress?

Multi-dimensionality

System complexity

Small datasets

How to define ground truth?

User behavior is highly variable

How to prevent architectural bottlenecks?

Tests are task and platform specific


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Evaluation criteria / interacting levels

Human:

User experience / User performance

System:

Task performance

Architecture:

Reliability / robustness

Simplicity

Components:

Accuracy / efficiency


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Overview of methodologies (each level)


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Interaction between levels

Systemic Interaction Analysis (SinA)
feedback
changes
component
changes
architectural
changes
define prototypical
script of task
identify deviation
pattern
identify causes for deviation patterns
(system and interaction level)
Estimate impact of deviation patterns
expectation-driven (based on video data)
annotation
& system logging
statistical
analysis
system analysis
judging results
Lohse, M., M. Hanheide, K. Pitsch, J. Rohlfing, Katharina, and G. Sagerer (2009).

Improving HRI design by applying Systemic Interaction Analysis (SinA)”, Interaction Studies (Special Issue: Robots in the Wild:
Exploring HRI in naturalistic environments), 10(3). John Benjamins Publishing Company, pp. 299-324.


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Statistical Analysis of ELAN files in
Matlab (SALEM)
Hanheide, M., M. Lohse, and A. Dierker (2010). “SALEM – Statistical AnaLysis of Elan files in Matlab”, Workshop on Multimodal Corpora:
Advances in Capturing, Coding and Analyzing Multimodality, 7th Intl.l Conf. on Language Resources and Evaluation (LREC), Malta, pp. 121-123.


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
How to measure progress?

Benchmarking questions:

Did the overall number of different than problem-related
tasks decrease?

Did the perceptage of time the users spent on problem-
related tasks (compared to social and functional tasks)
decrease?

Did the mean duration of problem-related tasks decrease?

Did the handling of problem-related tasks improve?

When did the problem-related tasks occur in the task
structure?
Siepmann, F., Lohse, M., & Wachsmuth, S., Towards robot architectures for user-driven system
design (in preparation).


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Challenges in defining benchmarks

How to measure progress?

Multi-dimensionality

System complexity

Small datasets

How to define ground truth?

User behavior is highly variable

How to prevent architectural bottlenecks?

Tests are task and platform specific


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Social cues in teaching scenarios

Valence judging by non-verbal cues (facial expressions)

Reduction to the evaluation of a single skill

How to provoke natural user behavior?
[Lang et al., ROMAN, 2009]


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Uncertain ground truth in HRI

Human judgements
(without sound)

44 judges,
88 video sequences of 11 subjects
success videos
failure videos


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Assistance in real applications

Supporting cognitively disabled persons in ADLs
(epilepsy, autism, learning disorders, hemiparesis)

Cooperation with Bodelschwingsche Anstalten Bethel

WOZ-Study (23 trials including 7 users): teeth cleaning

Feedback by audio/video prompts


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Individual reaction behavior

WOZ study: User reactions on prompts


wizard (WIZ) vs. caregiver (CG)


audio (A) vs. audio/video (A/V)


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Challenges in defining benchmarks

How to measure progress?

Multi-dimensionality

System complexity

Small datasets

How to define ground truth?

User behavior is highly variable

How to prevent architectural bottlenecks?

Tests are task and platform specific


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Scalability and transfer of system
frameworks and skills
interactive manipulation
service robotics
task assistence
tutoring,
receptionist,
motivation
DACS, ASR,
Dialog, ...
Multi-modal
anchoring,
Person attention,
...
XCF,
Active Memory, ...
Task state pattern,
Dialog framework,
...
BonSAI,
...
Social
feedback,
...
Working
memory,
...
Application
design,
...
1993
2011


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Scalability in competitions

Robocup@Home
Graz, 2009
Singapure, 2010


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
RoboCup@Home
Desired abilities

Navigation

Fast and easy setup

Object recognition

Object manipulation

Recognition of humans

Human robot interaction

Speech recognition

Gesture recognition

Robot applications

Ambient intelligence
Tests

Robot inspection

Follow me

Go get it

Who is who

Open challenge

Enhanced who is who

General purpose service robot

Shopping Mall

Demo challenge

Final


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
RoboCup@Home
Tests are not completely pre-specified ...

Open Challenge allows free performance

General Purpose Test includes task specification

Shoppingmall includes real unknown environment

Demo Challenge focusses on application domains
Points are given for (partial) task completion (time limit)
Judging is (partially) subjective!


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
System development is implicit part of
the competition

Team effort of 10-12 people

Major team change from
2009 to 2010

Large number of modules

Limited computing power

Prototyping of tasks

Short evaluation cycles

Robot needs to perform
instantly


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Conclusions

Benchmarking cognitive systems is inherently
multi-
dimensional
(there is no FLOPs measure)

Evaluation needs to be based on
live-systems
(performance is not characterized by offline error rates)

System frameworks and skills significantly profit from
transfer
to other scenarios and platforms

System integration and evaluation is
costly
(there is no free lunch)

Internal
system analysis
and external
interaction
analysis
needs to be coupled


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Conclusions

Benchmarking tasks
should not be overspecified

Human behavior
is shaped by the system response
(human input cannot be normed)

Ground truth
needs to be defined by the setup
(otherwise it might be ill defined)

Human behavior is
highly individual
(there is no „average user“)

Competitions
in HRI are inherently not completely
fair, but they are good for research


Applied Informatics, Faculty of Technology
& CITEC – Central Lab Facilities
Thanks to a lot of people ...