Managing Information Quality

woodruffpassionateInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

61 εμφανίσεις

Managing Information Quality

in e
-
Science

using Semantic Web technology

Alun Preece, Binling Jin, Edoardo Pignotti

Department of Computing Science, University of Aberdeen

Paolo Missier
, Suzanne Embury, Mark Greenwood

School of Computer Science, University of Manchester

David Stead, Al Brown


Molecular and Cell Biology, University of Aberdeen

www.qurator.org

Describing the Quality of Curated e
-
Science Information Resources

Combining the strengths of UMIST and

The Victoria University of Manchester

E
-
science

experiment

Information and quality in e
-
science



Scientists required to place their data in the public domain



Scientists use other scientists' experimental results as part
of their own work

Lab

experiment

In silico
experiments

(eg Workflow
-
based)

How can I decide
whether I can trust
this data?


Variations in the quality of the data


No control over the quality of public data


Difficult to measure and assess quality
-

No standards

Public
BioDBs

Combining the strengths of UMIST and

The Victoria University of Manchester

A concrete scenario

Qualitative proteomics
: identification of proteins in a cell sample

Step 1

Step n

Candidate Data

for matching

(peptides peak lists)

Match algorithm

Reference DBs

-

MSDB

-

NCBI

-

SwissProt/Uniprot

Wet lab

Information service

(“Dry lab”)

Hit list:

{ID, Hit Ratio, Mass Coverage,…}

False negatives
: incompleteness of reference DBs, pessimistic matching

False positives
: optimistic matching

Combining the strengths of UMIST and

The Victoria University of Manchester

Quality is personal

Scientists tend to express their quality requirements for data
by giving
acceptability criteria


These are personal and vary with the expected use of the
data


“What is the right trade
-
off between false positives and false
negatives?”



Combining the strengths of UMIST and

The Victoria University of Manchester

Requirements for IQ
ontology

1.
Establish a common vocabulary


Let scientists express quality concepts and criteria in a
controlled way


Within homogeneous scientific communities


Enable navigation and discovery of existing IQ concepts


2.
Sharing and reuse: let users contribute to the ontology
while ensuring consistency


Achieve cost reduction


3.
Making IQ computable in practice


Automatically apply acceptability criteria to the data

Combining the strengths of UMIST and

The Victoria University of Manchester

Quality Indicators

Quality Indicators:

measurable quantities that can be used
to define acceptability criteria:


“Hit Ratio”, “Mass Coverage”, “ELDP”


provided by the matching algorithm

Match algorithm

Information service

(“Dry lab”)

Hit list:

{proteinID

Hit Ratio, Mass Coverage,…}

Experimentally established correlation between these
indicators and the probability of mismatch

Combining the strengths of UMIST and

The Victoria University of Manchester


Data acceptability criteria


Indicators used as indirect “clues” to assess quality


Quality Assertions

(QA)

formally capture these clues as
functions of indicators


Data classification or ranking functions:


ex:
PIClassifier
defined as

f(proteinID, Hit Ratio, Mass Coverage, ELDP)


{ (proteinID, rank) }


This provides a custom ranking of the match results



Formalized acceptability criteria

are conditions on QAs


accept(proteinID) if PIClassifier(ProteinID,…) > X OR …



Combining the strengths of UMIST and

The Victoria University of Manchester

IQ ontology backbone

Class restriction:

MassCoverage




is
-
evidence
-
for . ImprintHitEntry

Class restriction:

PIScoreClassifier




assertion
-
based
-
on
-
evidence . HitScore

PIScoreClassifier




assertion
-
based
-
on
-
evidence . Mass Coverage

assertion
-
based
-
on
-
evidence:


QualityAssertion


QualityEvidence

is
-
evidence
-
for:

QualityEvidence


DataEntity

Combining the strengths of UMIST and

The Victoria University of Manchester

Quality properties

Users may add to a collection of generic
quality properties

Accuracy

Currency

Consistency

Completenes
s

Conformity

Timeliness

Conciseness

PI
-
acceptability

?

User
-
defined

Quality

property

Generic

quality properties

Part of the backbone

How do we ensure
consistent specialization
?

Combining the strengths of UMIST and

The Victoria University of Manchester



Specializations of base ontology concepts

Concrete assertion (informal):

“the property
Accuracy of Protein Identification

is based upon the
Hit Ratio
indicator for
Protein Hit
data”

Proteomics

Protein

identification

Data

Entity

Quality

Indicator



Abstract assertion (informal):

“a Quality Property is based upon

one or more Quality Indicators for a Data Entity ”

Quality

Property



Accuracy

Property

Protein Hit

Accuracy of

Protein identification

Hit Ratio

Combining the strengths of UMIST and

The Victoria University of Manchester

Maintaining consistency by reasoning


Axiomatic definition for
Accuracy
:

(


QtyProperty
-
from
-
QtyAssertion .


(


QA
-
based
-
on
-
evidence . ConfidenceEvidence))

PI
-
TopK

PMF
-
Match

Ranking

PI
-
acceptability

Mass Coverage

Hit Ratio

PIMatch

Confidence

Characterization

Accuracy

QtyProperty
-
from
-
QtyAssertion

Pref
-
based
-
on
-
evidence

Based
-
on

Output
-
of



Has
-
quality

characterization

Is a

Combining the strengths of UMIST and

The Victoria University of Manchester

Computing quality in practice


Annotation model:

Representation of indicator values as semantic annotations:


model: RDF schema


annotation instances: RDF metadata



Binding model:

Representation of the mapping between


Data ontology classes


data resources


Functions ontology classes


service resources


Goal:

to make quality assertions defined in the ontology
computable in practice

Combining the strengths of UMIST and

The Victoria University of Manchester

Data resource annotations

Resource = Data items at various granularity

Data item


indicator values

Combining the strengths of UMIST and

The Victoria University of Manchester

Data resource bindings

Data class


data resource


Account for different granularities, data types

Combining the strengths of UMIST and

The Victoria University of Manchester

Service resource bindings


Function class


(Web) service implementation


Eg annotation function, QA function

Combining the strengths of UMIST and

The Victoria University of Manchester

The complete quality model

Combining the strengths of UMIST and

The Victoria University of Manchester

IQ Service Example

Combining the strengths of UMIST and

The Victoria University of Manchester

Summary


An extensible OWL DL ontology for Information Quality


Consistency maintained using DL reasoning



Used by e
-
scientists to share and reuse:


Quality indicators and metrics


Formal criteria for data acceptability



Annotation model:

generic schema for associating quality metadata to data resources


Binding model:

generic schema for mapping ontology concepts to (data, service) resources



Model tested on data for proteomics experiments