Knowledge Discovery in Grid Datasets Goals,

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

62 εμφανίσεις


University of Vienna

P. Brezany

1

Knowledge Discovery in Grid Datasets


Goals,
Design
Concepts and the Architecture


Peter Breza
ny

University of Vienna


University of Vienna

P. Brezany

2

Collecting Data

Data Re
-

positories

Satellites

Laboratories

(microscopes,

MRI/CT scanners, ...)

Computer simulations

Experiments

(high energy physics,...)

Analysis

Business


University of Vienna

P. Brezany

3

Motivation


Computational Grid



a new
-
generation infrastructure



Challenge: Advanced analysis of data managed by Grid



Typical data in modern Grid applications:


files, file collections, relational and XML DBs, virtual data, data objects



The data is often is large, geographically distributed and
its complexity is increasing; some applications require
special security precautions.



Our research aims:


Phase 1 : Knowledge discovery Grid system (
GridMiner
)


Phase 2 : Intelligent Grid system (
WisdomGrid
)


University of Vienna

P. Brezany

4

Outline


Motivation



Background and Related Work



Basic Concepts and GridMiner Architecture



Grid Data Integration System



Data Mining Layer



Implementation Issues and Experiments



Future Research



Conclusions


University of Vienna

P. Brezany

5

Background and Related Work


Basic Grid development (Globus 1)


metacomputing



Data Grid (Globus 2, DataGrid of CERN, etc.)



Semantic Grid (myGrid)



Open Grid Service Architecture (Globus 3, OGSA
-
DAIS)



Parallel and Distributed Data Mining and Data Warehousing



Knowledge Grid (GridMiner and work of others)



Web Intelligence


University of Vienna

P. Brezany

6

GridMiner Requirements


Open architecture



Data distribution, complexity, heterogeneity, and large data size



Applying different kinds of analysis strategies



Compatibility with existing Grid infrastructure



Openness to tools and algorithms



Scalability



Grid, network, and location transparency



Security and data privacy



OLAP support


University of Vienna

P. Brezany

7

GridMiner (Layered) Abstract
Architecture

Computational & Data Grid

Information Grid

Knowledge Grid

Data to

Knowledge

Control

User Interface

Built on the K.G. Jeffery‘s proposal


University of Vienna

P. Brezany

8

GridMiner Conceptual Architecture

J

o

b


C

o

n

t

r

o

l


University of Vienna

P. Brezany

9

Service Architecture

Based on OGSA
-
DAIS


University of Vienna

P. Brezany

10

Data Distribution Scenarios

1.
Single data source


2.
Federated data sources with different types of partitioning


University of Vienna

P. Brezany

11

Example

Vertical and horizontal distribution of the virtual data source


University of Vienna

P. Brezany

12

Mapping Schema



University of Vienna

P. Brezany

13

Grid Data Mediation Services


University of Vienna

P. Brezany

14

Architecture of a Data Mining System


University of Vienna

P. Brezany

15

Components of the Data Mining Layer


GridMiner Service Factory



GridMiner Service Registry



GridMiner Data Mining Service



GridMiner Preprocessing Service



GridMiner Presentation Service



GridMiner Orchestration Service


University of Vienna

P. Brezany

16

Centralized Data Mining

GMDMS
GS
GMDM
NSrc
GMPPS
GS
GMPP
NSrc
2. create GMPPS
query SDEs
notifications
4. use it
notifications
query SDEs
8. perform
6. creat e GMDMS
10. evaluate Model
GS
R
GMSR
Client
1. browse
factory GSHs
GS
F
GMSF
GS
F
GMSF
GS
F
GDSF
GDS 1
GS
GDS
NSrc
GDT
GDS 2
GS
GDS
NSrc
GDT
3. create GDS
7. create GDS
5. use it
5. use it
9. use it
9. use it
DataSource
<read>
<read>
<write>
(a)

University of Vienna

P. Brezany

17

Parallel and Distributed Data Mining

GMDMS 0
1. browse
factory GSHs
<read>
2. create
GMDMS
GMDMS 1
GMDMS 2
GMSF
GMSF
dat1
dat2
<read>
8. control
8. control
4. create
3. create
SOAP / RMI /
JXTA / MPI /
etc.
GS
R
GMSR
GS
F
GMSF
GS
GMDM
NSrc
Client
notifications
query SDEs
7. perf orm DataMining
9. evaluat e Model
GMDMS 3
GMDMS 4
GMSF
GMSF
dat3
dat4
<read>
<read>
8. control
8. control
6. creat e
5. creat e
8. perf orm
(b)

University of Vienna

P. Brezany

18

GridMiner Orchestration Service

GMOrchS
GS
GMDM
NSrc
notifications
query SDEs
3. execute Workflow
2. create
GMDMS
GMPPS 1
GMDMS
GMPRS
GMPPS 2
GMSF
GMSF
GMSF
GMSF
5. perf orm
Activity
7. perf orm
Activity
9. perf orm
Act ivity
11. perform
Activity
10. create
8. creat e
6. creat e
4. creat e
<read>
<read>
<read>
<write>
<write>
<read>
GS
F
GMSF
Workflow Engine
Workflow
Out line
GridMin er Job Description
Header
Resource Declarations
Workflow
use GMPPS for filling missing
values, remove noi se
Activity
use GMPPS for selection
and preliminary aggregations
Activity
use GMDMS for
generati ng a decis ion tree
Activity
use GMPRS for a graphic al,
interactive representation
Activity
<write>
Client
1. browse
GSHs >
GS
R
GMSR

University of Vienna

P. Brezany

19

GridMiner
Job
Specification

Language


University of Vienna

P. Brezany

20

Implementation Prototype


Implementation of the Mediation Service for
horizontal data partitioning



Implementation of Data Mining Services for decision
tree construction as OGSA conformous Grid service,
based on the Globus Toolkit 3 Release



We use


a freely available Java
-
based data mining system Weka (data
preprocessing and data mining tasks)


(main memory oriented)


a home
-
grown Java implementation of the algorithm SPRINT
(disk
-
oriented)


University of Vienna

P. Brezany

21

Experimental Environment


Test data suites


synthetical data (generated by an extended version of the IBM
Quest Synthetic Data Generation Code)


TBI (Traumatic Brain Injury) databases


Grid testbed


Vienna


CERN


Dublin


Zagreb


Cracow


Goals in the first phases


Verifying model accuracy


Overhead of the service layers


University of Vienna

P. Brezany

22

Extending the

Functionality


University of Vienna

P. Brezany

23

OLAM


University of Vienna

P. Brezany

24

Example: Mining Patterns for Data
Classification and Associations

use database

dat1, dat2

mine classifications

analyze

patient_outcome

using

g_parsimony

display as

tree

use database

DBs attributes

mine associations

using

method_attributes

display as

rules


University of Vienna

P. Brezany

25

Workflow 1: Interactive Mode


University of Vienna

P. Brezany

26

Workflow 2: Batch Mode


University of Vienna

P. Brezany

27

Workflow 3: Hybrid Mode


University of Vienna

P. Brezany

28

Execution Model Based on Static Workflow


University of Vienna

P. Brezany

29

Execution Model Based on Dynamic Workflow


University of Vienna

P. Brezany

30

Towards the Wisdom Grid
(WG)


University of Vienna

P. Brezany

31

WG Architecture

Wisdom Grid

Agent Grid Service

Knowledge Base Service

Knowledge Discovery Service

Agent Platform

External Services

External Knowledge Base

Domain Knowledge Agents

Knowledge Explorer Agent

End User (personal) Agent

Grid

KB


University of Vienna

P. Brezany

32

Work
-
Flow

End User Agent

Knowledge Agent

Knowledge Explorer Agent

Knowledge Base

service

External Agents

Knowledge Base

Agent Service

Knowledge discovery

service

Services

...


University of Vienna

P. Brezany

33

Knowledge Discovery Service

Client for other services

Knowledge Discovery in Databases

GridMiner


data mining


on
-
line analytical processing (OLAP)

Web Mining

semantic web

Online libraries

Web/Grid Services

Knowledge Explorer Agent



University of Vienna

P. Brezany

34

Knowledge Base Service / KB

KBS
-

Search, Query, Expand Knowledge Base

KB
-

Database that stores particular data about real
objects and relations between these objects and their
properties

Consists of ontologies and instances

Information about resources (location, query lang.)


on the Web


web/grid services ,agents


references to the online database

Languages

XML/RDF/DAML
-
OIL/DAML
-
S/OWL


University of Vienna

P. Brezany

35

Ontology
-

example

Patient

Age

Human

has

is

DAML
-
OIL Language:


<daml:Class rdf:ID=“Human”>


<rdfs:subClassOf>


<daml:Restriction cardinality=“1”>


<daml:onProperty rdf:resource= “#Age”/>


</daml:Restriction>


</rdfs:subClassOf>


</daml>


<daml:DatatypeProperty about:ID=“Age”>


<rdf:domain rdf:resource = “#Human”/>

</daml:DatatypeProperty>


<daml:Class rdf:ID=“Patient”>


<daml:subClassOf rdf:resource=“#Human”/>

</daml:Class>


University of Vienna

P. Brezany

36

Knowledge Base
-

example

Patient

Temperature

Human

has

has

has

Database

Tables

jdbc://foo/hospital

table:PATIENTS

attribute:PAT_ID

is

Value

Attribute

has


University of Vienna

P. Brezany

37

Semantic mediator



Distributed heterogeneous databases


Different database schemas


Different query languages


Different names of attributes/tables…


but the same semantics !



WG enables semantics mediation at a higher level



University of Vienna

P. Brezany

38

Semantic mediator (cont.)

PATIENTS

PAT_ID

PAT_AGE

PAT_BLOOD_TYPE

...





PAT_TAB

ID

AGE

BT

...





Patient

Age

Human

has

is

Blood Type

has

AGE

PAT_AGE

samePropertyAs

BT

PAT_BLOOD_TYPE

samePropertyAs

Database in Hospital X

Database in Hospital Z


University of Vienna

P. Brezany

39

Distributed Knowledge base

is subclass

has property

Class


Class

property

uri:fooX#Patient

uri:fooY#Human

uri:fooZ#Temperature

class

uri:fooX#Ill_Person

Is same class as


University of Vienna

P. Brezany

40

Agent Grid Service

Supports system with ability to communicate with the
outside world in standard languages


FIPA Standards


ACL


Agent Communication Language


KQML
-

Knowledge Query and Manipulation Language

Agent Platform (JADE,FIPA
-
OS)

Agents

Domain Knowledge Agent

Knowledge Explorer Agent

End
-
user Agent (personal)



University of Vienna

P. Brezany

41

Querying


End
-
user agent


with own ontology


subset of ontology


Merging of ontologies


without own ontology


Negotiating about domain of interest


Queries created from ontology


Templates


<Patient rdf:ID=“ID001”>


<Temperature/>

</Patient>


University of Vienna

P. Brezany

42

Answers



Mined Knowledge (GridMiner)


Decision trees/ rules

»
(clinical pathways)


Association rules



Instances of domain ontology


Particular data


References


Links to Web sites


Information about another knowledge providers


University of Vienna

P. Brezany

43

Case Study
-

Medical Application

End User (personal) Agent

Q: Outcome?

+ data about

patient’s condition

Knowledge Agent

Training

set

GridMiner

Testset

Hospital Databases

Knowledge

Discovery

Service

Knowledge Base

Semantic
Web/Grid

A: probability

of survival

+ references to


the diagnoses

Knowledge Explorer Agent

resources


University of Vienna

P. Brezany

44

Conclusions and Future Work


Application and extension of the Grid technology to
knowledge discovery


an important, but non
-
traditional Grid application domain



Introduction of a new Grid Data Mediation Service



Future work


Performance evaluation on large synthetic data volumes


Coupling of the Data Minining services architecture with the OLAP
services architecture


Development of a knowledge discovery oriented Grid Workflow
Language and the appropriate Workflow Engine


Application of GridMiner to a real medical application (management
of patients with severe traumatic brain injuries)


Development of the Wisdom Grid