[WG1: Web and Cloud Applications]

convertingtownSoftware and s/w Development

Nov 4, 2013 (3 years and 10 months ago)

107 views

[WG2
:
Crowdsourcing
]

(
Status:
2011/
11
/
15
)

Members: Tobias Hoßfeld (UWue)
, Raimund Schatz (FTW)
, Patrick Le Callet (UNantes), Matthias Hirth
(UWue)
, Bruno Gardlo (UZil)

This document describes the Qualinet WG
2

Task Force

on “
Crowdsourcing
”.


The goal is



to identify the scientific challenges and problems
for QoE assessment via crowdsourcing

but
also the
strengths

and benefits
,



to derive a methodology and setup for crowdsourcing in QoE assessment,



to challenge crowdsourcing QoE
assessment

approach with usual “lab”
methodologies,




to

develop mechanisms and statistical approaches for identifying reliable ratings from
remote crowdsourcing users,
and



to define requirements onto crowdsourcing platforms for improved QoE assessment.

As a result,

a common roadmap for th
is task force is envisioned

and joint activities among the
members of Qualinet

are
to be stimulated.

Motivation and
p
roblem statement



Subjective user
studies are typically carried out by a test panel of real users in a laboratory
env
ironment. While many and possibly even diverging views on the quality of the media
consumption can be taken into account


entailing accurate results and a good understanding of
the QoE and its sensitivity


lab
-
based user studies

can be time
-
consuming and

costly, since the
tests have to be conducted by a large number of users for statistically relevant results. Costs and
time demands further increase if the design and the execution of the tests as well as the analysis
of the user ratings are performed in a
n iterative way. This means that the QoE model
s

are

developed through repeated cycles of design, implementation, and statistical analysis of the
tests. This iterative approach is unavoidable when touching new QoE aspects
, e.g. for 3D video or
cloud applica
tions
.

We
will
also discuss panel limitations wit
h

standard
approach
es.




Cr
owdsourcing

seems to be an appropriate alternative approach. Crowdsourcing means to
outsource a job (like video quality testing) to a large, anonymous crowd of users in the form of an
open call. Crowdsourcing platforms in the Internet, like Amazon Mechanical Turk or
Microworkers, offer access to a large number of internationally widespread users in the Internet
and distribute the work submitted by an employer among the users. With crowdsourcing,
subjective user studies can be efficiently conducted at low costs with ad
equate user numbers for
getting statistically significant QoE scores. In addition, the desktop
-
PC based setting of
crowdsourcing provides a highly realistic setting for scenarios like online video

streaming or
other Internet applications
. However,
reliabil
ity of results

cannot be assumed due to the
anonymity and remoteness of participants. Some subjects may submit incorrect results in order
to maximize their income by completing as many jobs as possible; others just may not work
correctly due to lack of sup
ervision. Therefore, it is necessary to develop an appropriate
methodology that addresses these issues and ensures consistent behavior of the test subjects
throughout a test session and thus obtain reliable QoE results.

Furthermore, due to the
remoteness o
f the participants it is necessary to monitor the users


environment, e.g. viewing
distance of users, since the environment cannot be set up equally for all users.


We could leverage also some issues on how to trigger and sustain sufficient interest for th
e
assessor
.




Social networks
can also be seen as crowdsourcing platform, since the crowdsourcing platform is
often used
only for acquiring the users while the test user survey is implemented on
a different

web server

(belonging to the researcher issuing th
e user survey). Hence
, the same user survey
can be done with social network users.

The same problems as with crowdsourcing occur, since
reliability of user ratings is not guaranteed.



Which other problems do occur? Please feel free to add/modify/comment.

Ke
y applications for crowdsourcing based QoE assessment



Poss
i
ble:
Crowdsourcing platforms are typically accessible via a web browser and the
crowdsourcing jobs (like a user study) are also running in the web browser. T
he desktop
-
PC
based setting of
crowdsourcing provides a highly realistic setting for scenarios like online video

streaming or other Internet applications. Thus, key applications are all kind of applications
running in a web browser
.

o

Web browsing

o

Online video streaming

o

Cloud application
like cloud gaming

o

etc.



Not possible:
The applications must be feasible in an Internet setting. For example, QoE
assessment of 3D HD video streaming via crowdsourcing is currently not feasible, since the
available Internet access bandwidth is not
sufficient.

In addition, special hardware may be
required for 3D television, which is not available to the end users. In general, crowdsourcin
g tests
can hardly be conducted
, if special hardware is needed. These test still require
a laboratory
environment
.



Development of a
specific video player suitable for crowdsourcing tests

to pre upload all the
necessary
material?



QoE of crowdsourcing platform

itself is not investigated so far.



Please feel free to add/modify/comment.

Required methodology



The overall goal is to define a
‘Certified’ (=agreed) methodology for QoE Assessment based on
crowdsourcing
. This includes test design methods, monitoring of tests and environments,
statistical measures for reliability analysis.



T
est

design methods

are imp
ortant to be included in crowdsourcing studies. These methods
allow a filtering of the users.

These task design methods are illustrated for YouTube QoE tests.

o

Gold Standard Data.

The most common mechanism to detect unreliable workers and to
estimate the qu
ality of the results is to use questions whereof the correct results are
already known. These gold standard questions are interspersed among the normal tasks
the worker has to process. After results submission by the worker, the answers are
compared to gol
d standard data. If the worker did not process the gold standard
questions correctly, the non
-
gold standard results should be assumed to be incorrect
too.
For example,
the worker affirmed the question
“Did you notice any pauses in the
video?”, but the vide
o is played without any interruptions.

o

Consistency Tests.
In this approach, the worker is asked the same question multiple times
in a slightly different manner. For example, at the beginning of the survey the worker is
asked how often she visits the YouTub
e web page, at the end of the survey she is asked
how often she watches videos on YouTube. The answers can slightly differ but should lie
within the same order of magnitude. Another example is to ask the user about his origin
country in the beginning and a
bout his origin continent at the end. The ratings of the
participant are disregarded, if not all answers of the test questions are consistent. An
unresolved problem concerns subjects that are not willing to provide correct personal
data and that provide in
consistent data. In that case, the user ratings are rejected,
although they could provide valid quality ratings.

o

Content Questions.
After watching a video, the users were asked to answer simple
questions about the video clip. For example, ``Which sport was

shown in the clip? A)
Tennis. B) Soccer. C) Skiing.'' or ``The scene was from the TV series... A) Star Trek
Enterprise. B) Sex and the City. C) The Simpsons.'' Only correct answers allow the user's
ratings to be considered in the QoE analysis.

o

Application

based

Us
er

Monitoring
. Monitoring users during the tasks completion can
also be used to detect cheating workers. The most common approach here is measuring
the time the worker spends on the task. If the worker completes a task very quickly, this
might ind
icate that she did the work sloppy. However, it has to be noted that the
reaction times of different subjects may differ significantly depending on the actual
person.

A more robust method is to monitor browser events in order to measure the
focus time, wh
ich is the time interval during which the browser focus is on the website
belonging to the user test.



Monitoring and downloading

of test
:

Since the tests are conducted remotely and the users have
to download the contents of the user test (within the web b
rowser), two options are possible.

o

In order to avoid effects of the Internet transmission on the QoE test, the user survey is
downloaded completely (e.g. the video clips to be evaluated) before the test starts.

o

The user test has to be monitored on network
and application layer, in order to take into
account additional impairments from the network transmission on application layer (e.g.
i
nsufficient bandwidth leads to artifacts or pauses in the video).



Statistical measures

to identify unreliable user ratings

o

Inter
-

and intra
-
rater reliability
, e.g. based on

Spearman’s rank correlation coefficient as
a non
-
parametric measure of statistical dependence between user rating and
test
conditions
.

o

Confidence intervals are often misused as measure for reliability of s
ubjective tests

o

SOS hypothesis
:
Considering the MOS values alone does not allow drawing any
conclusions about
unreliable

users and the credibility of the presented subjective test
results. However, considering the standard deviation of opinion scores (SOS)

in addition
to the MOS values help identifying incredible results
.

o

Other

tool
s
:
e.g.
panelcheck

http://www.panelcheck.com/



The important fact about designing the test methods is that the user should not be overloaded
with too many questions of the same
kind. When dealing with investigation of the QoE using
various crowdsourcing or social network platforms, it is necessary to stress up the fact that the
users are evaluating mostly in their home environmen
t. I
f they are

feeling

bored with exhausting
questi
onnaires, they will simply loose interest in further evaluating and they will quit the
application, or
(perhaps it is even worse) they will just evaluate the given sequence with random
grade.

The exact same issue is related to the overall duration of the a
ssessment. For the
subjective testing in the laboratory, it is generally advised not to exceed overall duration of 30
minutes. However, this is absolutely inadequate for online testing and crowdsourcing. Whereas in
laboratory it is possible to present user
s for example 30 videos in single testing session, for the
crowdsourcing testing scenario it would be more efficient to break these 30 videos into several
testing sets, present them to several groups and then combine the results.



Concerning the downloadin
g of the set. If one would want to evaluate the effect of long duration
sequences (e.g. 30 minutes long video sequences) on QoE, it is also possible to use RTMP
streaming with flash player, which is able to adapt the video source according to the current
i
nternet connection. There will be already prepared video sequences with vari
ous bitrates on the
server side
. This kind of approach will be in my opinion more reasonable and will better reflect the
real world scenario conditions, than requesting the users t
o download one big file with slow
Internet connection. Of course, the application monitoring
o
n the user side is necessary, but easy

to implement.

Comparison of
D
ifferent

QoE

Testing Methods



There are basically
four different

types of subjective user studies

o

laboratory studies (with paid users

in a controlled test environment, with a test
moderator)

o

crowdsourcing studies (with paid users conducting a test remotely)

o

social networking studies (without any payments; some addition
al, but unreliable
“social” information of the users conducting remotely the test)

o

field trials (in an uncontrolled, but realistic environment)



The main differences between crowdsourcing and lab studies are compared
qualitatively
, i.e.
considering various
effects emerging in subjective studies

and their test design
, and
quantitatively
, i.e. considering the impact on
concrete user ratings and

QoE

models
.



Uncontrolled tests require a

monitoring of the environment.



Please feel free to add/modify/comment.

Desir
ed features of Crowdsourcing platforms



The following features would be nice to
be integrated in a crowdsourcing platform. Since
UniWue has

very good connections to one crowdsourcing provider

(microworkers.com)
,
it
may be

possible

to integrate some desired features directly into the platform



Special demands on user demographics are of interest. For example, to specify that 50% of
the test subjects are younger than 30 years and 50% are older than 30 years.



Targeting different types
of users is possible in the Facebook social network.

Collaboration with Microworkers.com Provider



UniWue has collaboration with Microworkers.com which allows us to ask for integrating
features. Since UniWue is actively using this platform, we can offer som
e support for
Qualinet members, e.g. how to design tests, how to launch tests, etc.



C
urrently available features on Microworkers.com

o

Microworkers.com provides access to more than 200.000 registered users

which can
be identified by unique, public users IDs. The crowd is distributed all over the world,
with a large majority located in Asia. Details can be found in
[1]
.

o

The location of the individual workers is verified by IP monitoring and a postal
identification during the payout process. The home country of the workers is
available in her public user profile.

o

Tasks are described in a simple textual manner and the
workers submit their work
results using a web form. For QoE tests the workers can be pointed to an external
server hosting the test environment. A payment code generated at the end of the
test is then submitted at Microworkers as work proof.

o

Besides task
s with are offered to the whole crowd, the platform offers means to
choose only a selected group of workers for a certain tasks. The workers can be
chosen e.g. based on their location or based on their performance in previous tasks.

o

To support more sophis
ticated tasks, the interface will be redesigned in the next
months and an API for automated interactions with the platform is currently
developed.



Available support from UniWue

o

We can provide
introductions and support for the
usage of the Microworkers.com

platform
, like the account and task creation. Further we can help during the task
design, which highly affects the result quality.

o

The results from the current crowdsourcing tests can be used to form an initial pool
of trustworthy workers for QoE tests.
This pool can be extended and adapted
depending on the results of other users QoE tests.

o

Resulting for our current tests, we have ready
-
to
-
use hardware to run web based
crowdsourcing task and validated mechanisms to integrate these tasks into the
microw
orkers.com platform, like payment
-
key generation strategies.

Interaction with other
Working Groups /
Subgroups

/ Task Forces



WG1 subgroup on

Web and Cloud Applications




Please feel free to add/modify/comment.

Joint Activities



Submission to
standardization

o

Example: Updated subjective test methodologies and metrics for Web Browsing QoE
(ITU
-
T)



Cross
-
validation of test methodologies and results

o

Example: YouTube QoE tests based on crowdsourcing vs. lab vs. field



Creation of common test data sets

which would allow cross
-
validation



Please feel free to add/modify/comment.

Requirements for other WGs



WG1:
Application areas



WG3: Quality metrics



WG4: Databases and validation

o

Databases: yet to be created (by Qualinet members)



WG5: Standardization,
certification and dissemination

o

Guidelines for quality testing based on crowdsourcing (ITU
-
T, ETSI)

References

(not intended as self
-
promotion, but to show own activities in this direction; papers can be sent to
interested people)

[1]

Matthias Hirth, Tobias Ho
ßfeld, Phuoc Tran
-
Gia. Anatomy of a Crowdsourcing Platform
-

Using the Example of Microworkers.com. Workshop on Future Internet and Next Generation
Networks (FINGNet), Seoul, Korea, June 2011.

[2]

Tobias Hoßfeld, Raimund Schatz, et al. Quantification of YouTub
e QoE via Crowdsourcing,
currently under submission, available as technical report no. 483 at the University of
Würzburg
http://www3.informatik.uni
-
wuerzburg.de/TR/tr483.pdf

[3]

Clemens Horch
, Christian Keimel, Klaus Diepold.
QualityCrowd
-

Crowdsourcing for
Subjective Video Quality Tests. Technical Report (in German), available online
https:/mediatum.ub.tum.de/node?id=1082600

[4]

Kuan
-
Ta

Chen, Chen
-
Chi Wu, Yu
-
Chun Chang, and Chin
-
Laung Lei. 2009. A crowdsourceable
QoE evaluation framework for multimedia content. In Proceedings of the 17th ACM
international conference on Multimedia (MM '09). ACM, New York, NY, USA, 491
-
500.

[5]

Kuan
-
Ta Chen, Chi
-
Jui Chang, Chen
-
Chi Wu, Yu
-
Chun Chang, and Chin
-
Laung Lei. 2010.
Quadrant of euphoria: a crowdsourcing platform for QoE assessment. Netwrk. Mag. of Global
Internetwkg. 24, 2 (March 2010), 28
-
35.

[6]

B. Gardlo, M. Ries, M. Rupp, and R. Jarina
, “A QoE evaluation methodology for HD video
streaming using social networking,” in Multimedia (ISM), 2011 IEEE International Symposium
on, dec. 2011, pp. 222

227.