Evaluation of the Sealife

religiondressInternet and Web Development

Oct 21, 2013 (3 years and 5 months ago)

67 views

Data Triangulation in a User
Evaluation of the Sealife

Semantic Web Browsers

Helen Oliver

Patty Kostkova

Ed de Quincey

City eHealth Research Centre (CeRC)

City University London

User
-
Centred Evaluation of
Semantic Web Browsers


The Semantic Web for Life Sciences


Browse for meaning


Find answers to critical questions faster


Computer scientists
love

SWBs!


First
-
ever user
-
centred evaluation of SWBs
recruiting REAL
-
WORLD users


Do real users love SWBs too?


Realistic user
-
centred evaluation has been
neglected for SWBs!




User
-
Centred Evaluation of
Semantic Web Browsers


Use
Triangulation

to consider all angles


Essential to our innovative evaluation framework

(
Quantitative data:



Web server logs


Questionnaire results

+
Qualitative data:


Semi
-
structured interviews )

=
(Validation AND Completeness)


Triangulation has been neglected in
user
-
centred evaluations of SWBs!




Group A1: Infectious Disease
Professionals


CORESE
-
based SWB vs NeLI

COHSE vs NeLI

Group A2: Microbiologists

GoPubMed/GoGene vs PubMed

Use of Triangulation for Semantic
Web


Quantitative Data
Sources:


Web Form Questionnaires


Pre
-
questionnaire


Post
-
task questionnaires


Post
-
questionnaire


Web Server Logs


Qualitative Data Sources:


Semi
-
Structured Interviews
(subset of participants)


Evaluation Settings:


Online


Workshops


Value of Data Triangulation in
Interpreting the Results


Questionnaires


Findability


Usability


System Speed


Relevance


Likeability


Web Server Logs


Task Completion Time


Usage of Semantic Links


# of External Pages
Viewed


Views of Target
Documents




Semi
-
Structured
Interviews


Answers to questions we
didn’t think to ask…


Observe participants to
assess system
intuitiveness

Sealife Results


COHSE:

67 respondents

39 online

28 in workshops

CORESE: 14 respondents

2 online (only 1 completed)

12 in workshops

GoPubMed:

137 online

4 in workshop

GoGene + Extended GoPubMed:

14 in workshop


Qualitative results not statistically significant (few interviews
conducted)

Web Server Logs



PubMed was faster than GoGene


Faster => Better…


So, users liked PubMed better than
GoGene


right?


Web Server Logs Don’t Lie!

Questionnaires


Best for:


Likeability


Information Findability


Relevance


System Speed



GoPubMed/GoGene


Usability


COHSE


Highest Number of Positive Ratings:


GoPubMed/GoGene


Largest Positive Mode Differences Between Control and Intervention:


GoPubMed/GoGene


Fewest Negative Mode Ratings Compared to Control:


GoPubMed/GoGene NEVER had worse mode scores than PubMed!

Semi
-
Structured Interviews


So the winner is GoPubMed/GoGene


COHSE was rated the most usable


what more could we want?


Well…


Critiques in GoPubMed/GoGene interviews were about the details


Critiques in COHSE/CORESE interviews were about being able to
use the systems
at all


At first, it turned out that some could not tell control from intervention!


When asked for critiques of COHSE or CORESE, users gave abundant
detail… about NeLI!


Yes, but what about COHSE?
“Those awful little boxes? They were really
distracting, I didn’t really understand what they were.”


Presentations explaining the SWBs improved users’
understanding





Validation


We were expecting discrepancy between logs, questionnaires,
and interviews


True for COHSE’s findability ratings


Workshop users rated it as adequate or good


Logs showed that none of these users had found the answer


Triangulation revealed discrepancies in plausible results


Otherwise users were generally consistent


We suspected one user of giving fake answers because she was
exceptionally positive in her questionnaires and interview


Task logs showed that she was one of the fastest (1
-
2 min per task)

»
…but 2 others were faster!


Logs showed that she activated 4 link boxes

»
…matching the median for all respondents


Logs showed that she viewed only 1 external page

»
…but some users didn’t view any and of those who did, 1 page was the
mode


Triangulation validated suspicious results




Completeness


Logs showed that interviewees who spoke
negatively about COHSE often had spent a long time
on it


Longer than 5 minutes


Longer than they spent on the control platform


Several users spent more time on GoGene than on
PubMed or the extended GoPubMed,
but:


Said GoGene was their favourite


Rated it highly on the questionnaires


Triangulation shows the whole picture


Faster ! => better


Slower ! => worse





Discussion


GoPubMed/GoGene workshop confirmed positive impressions


CORESE workshop confirmed negative questionnaire results


GoPubMed/GoGene workshop also confirmed:


That problems with this SWB were the most trivial


That
somewhat
higher questionnaire results masked
dramatically
better user experiences


Impressions that COHSE was more usable were quashed by
contact with users at workshop


Severity of problems would have gone undetected without interviews


Low number of interviews means triangulation was not complete


Recruitment difficult given time pressures on user base


Workshops are resource
-
intensive


Future work: carefully sample a subset for interview


Time constraints prevented gathering of observational data in situ


Future work: use video and/or eye tracking software







Conclusion


We have developed a method of triangulating
quantitative and qualitative data in user
-
centred
evaluation of SWBs


This addresses a need for greater attention to a technique
which is essential for accurate interpretation of data


Having applied our evaluation framework we
triangulated:


Quantitative data from the web server logs and from
questionnaires


Qualitative data from semi
-
structured interviews eliciting
users’ opinions on matters
they
identified as important







Conclusion


Triangulation was indispensable for an accurate view
of the results


Log data gave system speed


Questionnaires and interviews gave the meaning of the log
data


Log data showed usage of semantic links


Log data showed whether users found the answers


Questionnaires and interviews revealed discrepancies between
what users said and what they did


Questionnaires showed system intuitiveness


Only the interviews showed the full significance of the
questionnaire results


Only triangulation could answer the ultimate
questions about user satisfaction


If any one data source had been left out, the results could
have been misinterpreted