Data Stream Mining Applications: Toward Inductive DSMS

capybarabowwowSoftware and s/w Development

Oct 30, 2013 (3 years and 11 months ago)

139 views

1

Data Stream Mining Applications:

Toward Inductive DSMS



CS240B Notes

by

Carlo Zaniolo

UCLA Computer Science Department

Spring 2008




http://wis.cs.ucla.edu

21
-
Mar
-
08

2

Data Stream Mining and DSMS


Mining Data Stream: an emerging area of important
applications


Many
fast & light

algorithms developed for mining data
streams: Ensembles, Moment, SWIM, etc
.


Deployemnt of these algorithms on data streams a challenge


To deal with bursty arrivals, synopses, QoS, scheduling


Analysts want to focus on
high
-
level

mining tasks, leaving
such
lower
-
level

issues to the DSMS


Integration

of mining methods and DSMS technology is
needed

but it faces difficult research challenges:


Data mining: a big problem for SQL
-
based DBMS


http://wis.cs.ucla.edu

21
-
Mar
-
08

3

Road Map for Next Three Weeks


Data Mining query languages and systems


The Inductive DBMS dream and the reality:


Oracle, IBM DB2, MS DMX, Weka


Fast& Light Algorithms for Mining Data Streams


Classifiers and Classifier Ensembles,


Clustering methods,


Association Rules,


Time series


Supporting these Algorithms in a DSMS


Data Mining Query Languages and support for

the mining process


http://wis.cs.ucla.edu

21
-
Mar
-
08

4

The DM Experience for DBMS:

from dreams to reality


Initial attempts to support mining queries in
relational
DBMS
:
Unsuccessful


OR
-
DBMS

do not fare much better [Sarawagi’ 98].


In 1996, a ‘
high
-
road
’ approach was proposed by Imielinski &
Mannila who called for a quantum leap in functionality based on:


High
-
level declarative languages for Data Mining (DM)


Technology breakthrough in DM query optimization.



The research area of
Inductive DBMS

was thus born



Inspiring significant work:
DMQL
,
Mine Rule
,
MSQL
, …


Suffer from limited
generality

and
performance

issues.


http://wis.cs.ucla.edu

21
-
Mar
-
08

5

DB2 Intelligent Miner


Model creation


Training:

CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS',


'TASK', 'ID', 'HeartClasTask',


'IDMMX.CLASSIFMODELS',


'MODEL', 'MODELNAME', 'HeartClasModel' );


Prediction


Stored procedures and virtual mining views


Outside the DBMS (like Cache Mining)


Data transfer delays


http://www
-
306.ibm.com/software/data/iminer
/


http://wis.cs.ucla.edu

21
-
Mar
-
08

6

DB2 Intelligent Miner


Model creation


Training

CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS',


'TASK', 'ID', 'HeartClasTask',


'IDMMX.CLASSIFMODELS',


'MODEL', 'MODELNAME', 'HeartClasModel' );


Prediction


Stored procedures and virtual mining views


Outside the DBMS (like Cache Mining)


Data transfer delays


http://www
-
306.ibm.com/software/data/iminer
/


http://wis.cs.ucla.edu

21
-
Mar
-
08

7

Oracle Data Miner


Algorithms


Adaptive Naïve Bayes


SVM regression


K
-
means clustering


Association rules, text, mining, etc.


PL/SQL with extensions for mining


Models as first class objects


Create_Model, Prediction, Prediction_Cost,
Prediction_Details, etc.


http://www.oracle.com/technology/products/bi/odm/index.html


http://wis.cs.ucla.edu

21
-
Mar
-
08

8

OLE DB for DM (DMX)


Model creation

Create mining model MemCard_Pred (


CustomerId long key, Age long continuous,


Profession text discrete,


Income long continuous,


Risk text discrete predict)

Using Microsoft_Decision_Tree;


Training

Insert into MemCard_Pred OpenRowSet(



“‘
sqloledb

,

sa

,

mypass
’”
,




SELECT CustomerId, Age,




Profession, Income, Risk from Customers

)


Prediction Join

Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)

From MemCard_Pred AS MP Prediction Join Customers AS C

Where MP.Profession = C.Profession and AP.Income =
C.Income


AND MP.Age = C.Age;


http://wis.cs.ucla.edu

21
-
Mar
-
08

9

Defining a Mining Model


Define


The format of “training cases” (top
-
level entity)



Attributes, Input/output type, distribution


Algoritms and parameters


Example


CREATE MINING MODEL CollegePlanModel

(

StudentID LONG KEY,


Gender TEXT DISCRETE,

ParentIncome LONG NORMAL CONTINUOUS,

Encouragement TEXT DISCRETE,

CollegePlans TEXT DISCRETE PREDICT

) USING Microsoft_Decision_Trees



http://wis.cs.ucla.edu

21
-
Mar
-
08

10

INSERT INTO CollegePlanModel


(StudentID, Gender, ParentIncome,




Encouragement, CollegePlans)

OPENROWSET(

<provider>

,

<connection>

,




SELECT

StudentID,





Gender,





ParentIncome,





Encouragement,





CollegePlans




FROM CollegePlansTrainData

)

Training


http://wis.cs.ucla.edu

21
-
Mar
-
08

11


SELECT t.ID, CPModel.Plan


FROM CPModel PREDICTION JOIN

OPENQUERY(

,

SELECT * FROM






NewStudents

) AS t


ON CPModel.Gender = t.Gender AND



CPModel.IQ = t.IQ

ID

Gender

IQ

ID

Gender

IQ

Plan

CPModel

NewStudents

Prediction Join


http://wis.cs.ucla.edu

21
-
Mar
-
08

12

OLE DB for DM (DMX) (cont.)


Mining objects as first class objects


Schema rowsets


Mining_Models


Mining_Model_Content


Mining_Functions


Other features


Column value distribution


Nested cases


http://research.microsoft.com/dmx/DataMining/


http://wis.cs.ucla.edu

21
-
Mar
-
08

13

Summary of Vendors’ Approaches


Built
-
in library of mining methods


Script language or GUI tools


Limitations


Closed systems (internals hidden from users)


Adding new algorithms or customizing old ones
--

Difficult


Poor integration with SQL


Limited interoperability across DBMSs


Predictive Markup Modeling Language (PMML) as
a palliative


http://wis.cs.ucla.edu

21
-
Mar
-
08

14

PMML


Predictive Markup Model Language


XML based language for vendor independent definition
of statistical and data mining models


Share models among PMML compliant products


A descriptive language


Supported by all major vendors



http://wis.cs.ucla.edu

21
-
Mar
-
08

15

PMML Example

The Data Mining Software Vendors


Market Competition




The Data Mining World According to

Disclaimer

Disclaimer

This presentation contains preliminary information that may be changed substantially prior to final
commercial release of the software described herein.


The information contained in this presentation represents the current view of Microsoft Corporation on the
issues discussed as of the date of the presentation. Because Microsoft must respond to changing
market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft
cannot guarantee the accuracy of any information presented after the date of the presentation.


This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.


Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this presentation. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this information does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.


© 2005 Microsoft Corporation. All rights reserved.

Major Data Mining Vendors


Platforms


IBM


Oracle


SAS


Tools


SPSS


Angoss


KXEN


Megaputer


FairIsaac


Insightful

Competition

SQL Server 2005

Oracle 10g

IBM

SAS

Product

SQL Server Analysis
Services

Oracle Data Mining

DB2 Intelligent Miner,
WebSphere

Enterprise Miner

Link

http://otn.oracle.com/pr
oducts/bi/odm/odminin
g.html

http://www
-
306.ibm.com/software/data/imin
er/

http://www.sas.com/technologies/analytics/data
mining/miner/factsheet.pdf

API

OLEDB/DM, DMX,
XMLA, ADOMD.Net

Java DM, PL/SQL

SQL

MM/6 based on UDF,
SQL SPROC

SAS Script

Algorithms

7 (+2)

8

6

8+

Text Mining

Yes

Yes

Yes

Yes

Marketing Pages

N/A

18

10

Dozens

Client Tools

Embeddable Viewers,
Reporting Services

Analysis tools, Web
-
based targeted
reports

Discoverer

WebSphere Portal (vertical
solution)

IM Visualization

Excel AddIn

None

Distribution

Included

Additional Package

Additional Packages

Separate Product

Target

Developers

Developers

DB2 IM Scoring module is for
developers; Other modules
are for analysts.

Analysts

Strengths

Powerful yet simple
API

Integration with other
BI technologies

New GUI

Good credibility with
enterprise customers

New GUI, Leader of
JDM API

CRM Integration

Mature product (6 years).
Good service model.
Scoring inside relational
engine. Strong partnership
with SAS

Mature, Market Leader. Extensive
customization and modelling abilities. Robust,
industry tested and accepted algorithms and
methodologies. Export to DB2 Scoring.

Weaknesses

Not in
-
process with
relational engine Lacking
statistical functions

Poor Analyst experience

API overly complex

Inconsistent

High price. Standard
Functionality. Poor API
(SQL MM). Confusing
product line.

Expensive. Proprietary. Customer relations
range from congenial to hostile.

Major DM

Platforms


IBM


Oracle


SAS,

Tools


SPSS


Angoss


KXEN


Megaputer


FairIsaac


Insightful


SAS Institute (Enterprise Miner)


IBM (DB2 Intelligent Miner for Data)


Oracle (ODM option to Oracle 10g)


SPSS (Clementine)


Unica Technologies, Inc. (Pattern
Recognition Workbench)


Insightsful (Insightful Miner)


KXEN (Analytic Framework)


Prudsys (Discoverer and its family)


Microsoft (SQL Server 2005)


Angoss (KnowledgeServer and its
family)


DBMiner (DBMiner)


etc…

Vendors

ORACLE

Strengths


Oracle Data Mining (ODM) Integrated into relational engine


Performance benefits


Management integration


SQL Language integration


ODM Client


“Walks through” Data Mining Process


Data Mining tailored data preparation


Generates code


Integration into Oracle CRM


“EZ” Data Mining for customer churn, other applications


Full suite of algorithms


Typical algorithms, plus text mining and bioinformatics


Nice marketing/user education

ORACLE

Weaknesses


Additional Licensing Fees (base $400/user, $20K proc)


Confusing API Story


Certain features only work with Java API


Certain features only work with PL/SQL API


Same features work differently with different API’s


Difficult to use


Different modeling concepts for each algorithm


Poor connectivity


ORACLE only


SAS


Entrenched Data Mining Leader


Market Share


Mind Share


“Best of Breed”


Always will attract the top ?% of customers


Overall poor product


Only for the expert user (SAS Philosophy)


Integration of results generally involves source code


Integrated with ETL, other SAS tools


Partnership with IBM


Model in SAS, deploy in DB2


http://wis.cs.ucla.edu

21
-
Mar
-
08

24

Our View ...


Progress toward high level data models and
integration with SQL, but


Closed systems
,


Lacking

in coverage and
user
-
extensibility
.


Not as popular as dedicated, stand
-
alone DM
systems, such as
Weka
.



http://wis.cs.ucla.edu

21
-
Mar
-
08

25

Weka



A
comprehensive

set of DM algorithms, and tools.



Generic

algorithms over arbitrary data sets.


Independent on the number of columns in tables.



Open

and
extensible

system based on Java.



These are the features that we want in our Inductive
DSMS
---
starting from SQL rather than Java!



http://wis.cs.ucla.edu

21
-
Mar
-
08

26

References


[
Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database
perspective on knowledge discovery.
Commun. ACM
, 39(11):58

64, 1996.



Carlo Zaniolo: Mining Databases and Data Streamswith Query Languages and Rules:
Invited Talk, Fourth International Workshop on Knowledge Discovery in Inductive
Databases, KDID 2005.


http://wis.cs.ucla.edu

21
-
Mar
-
08

27

Thank you!