Data Mining and Data Visualization

sentencehuddleData Management

Nov 20, 2013 (3 years and 8 months ago)

73 views

Data Mining

and

Data Visualization

SOM 485

Fall 2007

Getting Started


What is Data Mining?


Online Analytical Processing


Data Mining Techniques


Market Basket Analysis


Limitations and Challenges to Data Mining


Data Visualization


Siftware Technologies

What is Data Mining (DM)?


Group of activities used to find different patterns
in data


Information provided through a Data Warehouse


Provides valuable information for different types
of research.

Applications of DM

Customer Relationship
Management

(CRM)

software is an
application that can
benefit DM

Activities of CRM


One
-
to
-
One Marketing


Sales Force Automation


Sales Campaign
Management


Marketing Encyclopedia


Call Center Automation

Verification of DM


Requires a lot of prior knowledge on the
decision maker’s part


Used mainly in casinos


i.e. Can determine if a new customer is a high roller, a souvenir
buyer, a ticket purchaser, etc.



Uses
Siftware

to help discover new
patterns of customer spending habits


Allows effective targeting to a specific group of customers


Online Analytical Processing


Online Analytical Processing (OLAP) was
introduced by E. F. Codd in 1993


OLAP: computer process that allows a
user to extract data from different view
points


Scientific and Academic organizations
store about 1 terabyte (1 trillion bytes) of
new data each day.

OLAP continue…

Codd’s 12 Rules for OLAP

1.
Multidimensional View

2.
Transparent to the User

3.
Accessible

4.
Consistent Reporting

5.
Client
-
Server architecture

6.
Generic Dimensionality

7.
Dynamic Sparse Matrix Handling

8.
Multi
-
user Support

9.
Cross
-
Dimensional Operations

10.
Intuitive Data Manipulation

11.
Flexible Reporting

12.
Infinite Levels of Dimension and
Aggregation

OLAP: MOLAP & ROLAP


OLAP data is stored in a
Multidimensional
Database

(MBD)


MOLAP
: OLAP application that accesses
data from a multidimensional database


MBD are frequently created using input
from an existing
Relational Database



ROLAP:
Relational Database server that
can work with SQL for portability and
scalability.

DATA MINING
TECHNIQUES








FOUR MAJOR
CATEGORIES

1.
Classification

2.
Association

3.
Sequence

4.
Cluster

CLASSIFICATION

-
Mining processes
intended to discover
rules that define
whether an item
belongs to a particular
class of data

-
Two Sub
-
processes:


1) Building a Model


2) Predicting
Classifications


ASSOCIATION


Techniques that employ association
search all details from operational systems
for patterns with a high probability of
repetition



Example: Market Basket Analysis



SEQUENCE


Time series analysis methods relate
events in time based on a series of
preceding events


Through analysis, various hidden trends,
often highly predictive of future events,
can be discovered.


Example: Mail Industry

CLUSTER


To create partitions so that all members of
each set are similar according to some
metric


Simply a set of objects grouped together
by virtue of their similarity or proximity to
each other


Example: Credit Card Transactions

DATA MINING
TECHNOLOGIES


Providing new answers to old questions


Developing new knowledge and understanding
through discovery


Statistical Analysis


statistically evaluating
products and making a decision based on logical
reasoning


Neural Networks


attempts to mirror the way
the human brain works in recognizing patterns
by developing mathematical structures with the
ability to learn



DATA MINING
TECHNOLOGIES CONT’


Genetic Algorithms and Fuzzy Logic


machine
learning techniques derive meaning from
complicated and imprecise data and can extract
patterns from and detect trends within the data
that are far too complex to be noticed by
humans


Decision Trees


assists in data mining
applications by the classification of items or
events contained within the warehouse


NEW APPLICATIONS FOR
DATA MINING


Two new categories of applications

1) Text Mining


summarizes, navigates, and
clusters documents contained in a database

2) Web Mining


integrates data and text mining
within a Web site; enhances the Web site with
intelligent behavior, such as suggesting related
links or recommending new products to the
consumer

Market Basket Analysis


Market Basket Analysis


Market Basket Analysis



Market Basket Analysis is an algorithm that
examines a long list of transactions in order to
determine which items are most frequently
purchased together.



It takes its name from the idea of a person in a
supermarket throwing all of their items into a
shopping cart (a "market basket").



Market basket analysis one of the most
common and useful types of data analysis for
marketing.



With the data gathered from MBA, marketers
can group products that customers like and group
them together.



Market basket analysis can improve the
effectiveness of marketing and sales tactics.


Benefits of Market Basket Analysis:


A good indication of consumer behavior


Increase in sales


Improves customer satisfaction


Tracks what types of products interest
consumer and finds relative alternative ones to
introduce to the consumer.






ASSOCIATION RULES for MBA



Support



Confidence



Lift


Method

Association rules
-

are a common undirected data mining
technique and complement market basket analysis.

These rules are unidirectional

Left
-
hand side rule
IMPLIES

Right
-
hand side rule

ex. Pasta
IMPLIES

Wine, but Wine

IMPLIES

Pasta may not hold

40% of transactions that contain Pasta also
contain Wine. 4% of transaction contain both
of these items.

Support
-

% measure of baskets where the association rule is true
between the Left
-
hand side & the Right
-
hand side.

ex. 4% of transactions contain both

Confidence
-

Probability that the Right
-
hand side item is present
once the Left
-
hand side item is present.

ex. 40% of transactions that contain Pasta… p=.40

Lift
-

compares the likelihood of finding the right
-
hand side item in
any random basket. Measures how well and associative rules
performs by comparing how well an item can sell without the other
item (improvement).

Method

Frozen
Pizza

Milk

Cola

Potato Chips

Pretzels

Frozen
Pizza


2

1

2

0

0

Milk


1

3

1

1

1

Cola

2

1

3

0

1

Potato
Chips

0

1

0

1

0

Pretzels

0

1

1

0

2

Market Basket Analysis


Market Basket analysis
-

determines what products
customers purchase together

Limits to Market Basket Analysis










A large number of data is req. to obtain meaningful
data, but data’s accuracy is compromised if all the
products don’t occur w/in similar frequency.



ex. Milk sells almost every transaction, but Elmer’s glue sells
sporadically, its not effective to put them in same basket analysis.



Sometimes presents results that are actually due to
the success of previous market campaigns.



ex. Discounted price of cola with purchase of pizza.









Using Data from MBA


Once information has been gathered about different
items and how they sell with respect to other items,
a store may want to change their layout of items to
improve their profits.


ex. Lunchboxes and School Supplies




For business without an actual storefront, they may want
to offer promotions for products that sell together
-
increasing sales.

MARKET BASKET ANALYSIS In a
Nutshell

Current Limitations and
Challenges to Data Mining

Current Limitations & Challenges to
Data Mining


New and underdeveloped field



Identification of missing information


Most companies run legacy systems


Not DW (data warehouse) friendly


DW designers have to convert existing ODSs
(operational data stores) to homogenous form
of DW

Current Limitations & Challenges to
Data Mining


Not all knowledge about application
domains are present in the data



ODSs are normally limited to those
needed by the operational application
associated with that DB



Data warehouse designers need to include
mechanisms for “inventorying” data

Data noise & missing values



Most operational databases contain data
errors in their values and/or classification


Errors lead to misclassification



Future data mining systems must incorporate
more sophisticated mechanisms for treating
“noisy data”


Bayesian technique


a statistical technique

Large Databases & high
dimensionality



Databases are large & dynamic


Contents are always changing



Data patterns must be constantly updated



New discovery applications have to portion
problems into smaller chunks of manageable
data without losing any essential attributes of
the data

Data Visualization


Process by which numerical data are
converted into meaningful 3
-
D images


Example




Intended to analyze complex data



Data from: satellite photos, sonar
measurements, surveys, or computer
simulations


History of Data Visualization


Originated from statistics and science


Example of 2
-
D



Advancement credited to NCSA


National Center for Supercomputing
Applications



Newest developments by
Xerox PARC
in
virtual reality


Human Visual Perception


Human visual cortex dominates our
perception



Accelerates the identification of hidden
patterns in data


“A picture is worth a thousand words”


Geographical Information Systems
(GIS)


A special
-
purpose DB which common spatial
coordinate system is primary means of
reference



Requires:

1.
Data input

2.
Data storage, retrieval, and query

3.
Data transformation, analysis, and modeling

4.
Data reporting



Integrates info. and aids in
decision making

GIS continued


Spatial Data


elements stored in map
form


Contain three basic components:

1.
Points

2.
Lines

3.
Polygons


Attribute Data


describes spatial data


Example of GIS

Applications of Data Visualization
Techniques


Retail Banking


Government


Insurance


Health Care and Medicine


Telecommunications


Transportation


Capital Markets


Asset Management

Siftware Technologies

Siftware Technologies


IBM


Informix


Red Brick


DB2


Oracle


Silicon Graphics


Sybase




Offers several Data Mining solutions, depending
on users need.



IBM Information Warehouse Solutions



IBM Visualizer



Red Brick

Informix


Three
-
tier model


Tier 1: “Client” presentation layer



Tier 2: Hewlett
-
Packard hardware



Tier 3: Data layer INFORMIX

OnLine
database




Sybase Warehouse WORKS


Assemble data from may sources



Transform data for a consistent and understandable
view



Distribute data where needed



Provide high
-
speed access to the data





Leading company for large
-
scale data mining



Data spread across mutliple databases



Data spread across processors for faster
queries





Discover new patterns and trends that may not
be realized using traditional SQL



Three
-
dimensional Visualization



Visual models can save days and even months
from the review process

Review


Data mining (DM)



Techniques used to mine data



Market Basket Analysis: The King of DM
Algorithms


Review continued…..


Current Limitations and Challenges to
Data Mining



Data Visualization



Siftware Technologies