The Data Mining Visual Environment

sentencehuddleData Management

Nov 20, 2013 (3 years and 6 months ago)

74 views

The Data Mining Visual Environment

Motivation



Major problems with existing DM
systems


They are based on n
on
-
extensible frameworks
.


They provide a n
on
-
uniform mining environment

-

t
he user is presented with

totally different interface(s) across implementations of different DM techniques
.


Major needs


An
overall framework

that can support the entire Knowledge Discovery (KD)
process (accommodate and integrate all KD phases seamlessly)
.



Placing the
user at the center

of the entire KD process/in the framework. In
fact the corresponding system should provide a
consistent, uniform
and

flexible visual interaction environment

that supports the user throughout
the entire discovery process
.

The Data Mining Visual Environment

Primar
y layers



User layer



Engine layer



Data layer


Main features



Open



Modular with


well defined


modification/extension points



Possible integration of


different tasks


eg output reuse by another task



User flexibility and enablement

to
:


process
data
and
knowledge
,


drive
and

guide
the entire KD



process

System Architecture

At present, there is a partial
prototype
, complete

implementation is
underway.

The Data Mining Visual Environment

Primar
y layers



User Interface/GUI

Container


(interacts with specific DM


visual environments)


Developed using Java




Abstract DM Engine/Wrapper of


DM algorithms (interacts with


specific DM algorithms)


Developed using Java


Note



The
DM
algorithms

may be


implemented by third parties


in possibly any language
.



DM methods (but not limited to):


MQs, ARs, and clustering

System Architecture
: The Prototype

Stephen

Stefano

The Data Mining Visual Environment


Visual Environment



A
consistent
,
uniform
,
flexible

and
intuitive

GUI,
with support throughout the
whole DM process. The principal focus is to support the user in:


Visual construction of the task relevant dataset:
The user directly interacts
with data
.

For
this

task, there are two intuitive interaction spaces
.


Visual construction of the mining query:
The user directly interacts with data

and other parameters (
e.g.

threshold values) in making queries
e.g.,
in the
Metaquery Environment, the user can suggest patterns by linking attributes,
while the Association Rule Environment offers ‘visual baskets’
.


Visual output presentation and interaction:
Exploiting relevant effective
visualizations and where necessary, we have designed novel visualizations
.


Planning:
E
.g.
,


advertising


relevant prior knowledge
.



Handling the non
-
static nature of user’s quest:

E.g., enabling user to
adjust
.

The Data Mining Visual Environment

Visual Environment:
Overall

The Data Mining Visual Environment

Visual Environment
:
Tree View (‘
Progress Companion’
)

Before user settings

After user settings

After DM results

The Data Mining Visual Environment

Visual Environment
:
Clustering
-

Inpu
t

The Data Mining Visual Environment

Visual Environment
:
Clustering
-

Outpu
t

... for more on the prototype, demo

The Data Mining Visual Environment

Usability


U
sability heuristics
:
Done, but

regular
reference to the same
will go on.



M
ock
-
up tests
:
Done with DM experts. The

experts gave an encouraging
feedback
and even
suggestions on how to improve the

interface.

(These tests were done at the end of 2001.)


Questionnaire experiments:

The experiments involved: the application simulation, a case study, data schema
and user tasks corresponding to the case study, and a questionnaire.

Positive interface features: consistency, layout/organization, visual exploration.

Negative interface features: size of some visual elements small/big

(These tests were done in July 2002.)


Formal usability tests
:
In the pipeline.


The Data Mining Visual Environment

The Clustering Engine


Clustering method:

Generalizations of three techniques:

homogeneity,
separation, density.


Clustering based on homogeneity/separation:
Homogeneity (separation) is a
global measure of the similarity between points belonging to the same
cluster (to different clusters)


Clustering based on density:
Clusters are regions of the object space where
objects are located “most frequently”


Clustering based:
The system selects the “best” clustering according to a cost
function


For homogeneity/separation
-
based clustering the cost function is computed
by evaluating pointwise, clusterwise, and partitionwise similarity/dissimilarity


For density
-
based clustering, the cost function is derived from an estimated
density function

The Data Mining Visual Environment

Formal Semantics of the Input Environment


Visual language:

abstract syntax + semantics


Abstract syntax:
defined in terms of multi
-
graphs


Visual components are vertices of the multi
-
graph


Spatial relations between visual components are edges of the multi
-
graph


Semantics
:


Clustering:

defined by a mapping between multi
-
graphs and cost functions
and predicates expressing optimality


Metaqueries/association rules:

defined by a mapping between multi
-
graphs
and rules



The Data Mining Visual Environment

Operational Specification


1.
Concrete, high
-
level syntax of the tasks proposed in the usability tests


Describes “legal” click
-
streams allowed to occur during operation


Standard grammar notation

2.
Communication protocol between the abstract clustering engine and the data
mining engines


XML DTDs based on PMML 2.0


Extension of PMML 2.0 to:

1.
Specification of input

2.
Broader spectrum of clustering methods

3.
Concrete semantics of the clustering task by mapping on symbols in the
tasks grammar to structures of the communication protocol


Interpretation function recursively defined on the grammar rules of the
high
-
level syntax of tasks


The interpretation of a legal click
-
stream is an XML document satisfying
the DTD of the input specification