Data mining - Institute of Technology Sligo

levelsordData Management

Nov 20, 2013 (3 years and 8 months ago)

87 views

Data Mining

Knowledge Discovery in Databases

Data 3

1

Data Mining


Data mining is a

capability to support the
recognition of previously unknown but
potentially useful relationships within large
databases/ data warehouses
.



Aim: find useful patterns in the data.


Uses statistical, mathematical, artificial
intelligence, and machine
-
learning techniques



Data 3

2

Data Mining Tools


Data mining tools use statistical or rules
-
based
methods to identify patterns and create predictive
models.


Tools look for patterns using a variety of models


Statistical methods e.g. correlation


Decision trees


Case based reasoning


Neural computing


Intelligent agents


Genetic algorithms



Data 3

3

Text Mining


Text Mining


Analyse text documents.




Find Hidden content


Group by themes


Determine relationships between documents


Data 3

4

Process of Data Mining/ Knowledge Discovery

Data 3

5

Data Cleaning

Data Integration

Databases

Data Warehouse

Task
-
relevant Data

Selection

Data Mining

Pattern Evaluation

What does it let you do?


Data mining automates the process of
sifting through historical data in order to
discover new information.


Data Mining techniques enable users to
identify patterns and correlations within a
set of data


These can then be used as predictive
models that anticipate behaviour or events
based on trends in the data.

Data 3

6

Correlation versus Causation


Correlation


A statistical relation between two or more
variables such that changes in the value of one
variable are accompanied by changes in the value
of the other


Causation


Changes in one variable
cause

changes in another.

Data 3

7

What do you need for Data Mining?


Massive data collection


Powerful computers


Data mining algorithms


Data 3

8

Five Basic Operations



Clustering


Identifies groups of items that share a particular characteristic


C
lassification


infers the defining characteristics of a certain group



Association


identifies relationships between events that occur at the one
time


Sequencing
:


relationships over time


Forecasting



estimates future values based on patterns within large sets of
data

Data 3

9

Clustering


The process of identifying relationships between
similar records without any preconceived notion
of what that that similarity might involve.


Examples:


Disease clusters,


Similarities in customers telephone usage


Often used as an exploratory exercise before
further data mining using a classification
technique.

Data 3

10

Classification


DM system learns from examples of the
data how to partition or classify the data
i.e. it formulates classification rules which
can be used for prediction.


Example : Bank classifies customers and may
offer them differing levels of service, different
offers, different charges. Can build loan
approval models.

Data 3

11

Association


Looks for links between records in a data set


e.g. items purchased at the one time.


Patterns can be identified to indicate probabilities
e.g.


500,000 transactions


20,000 nappies


30,000 beer


10,000 nappies + beer


Beer and nappies occur together in 2% of transactions.


“when people buy beer they buy nappies 1/3 of the
time”


“when people buy nappies they buy beer 50% of the
time”

Data 3

12

Sequential Analysis


A form of association used to track
relationships over time.


E.g. health insurance claims.


E.g. 10% of customers who bought a tent bought a
backpack within one month.


Weather patterns e.g. tidal wave in Hawaii follows
hurricane in N. Atlantic x% of the time.

Data 3

13

Forecasting


Concerns the prediction of continuous variables
e.g. sales, share values, stock market levels, oil
prices etc.


Often done with regression functions statistical
methods for examining the relationship between
variables in order to predict a future value.


2 types


Forecasting single continuous value based on
unordered examples. e.g. predict income based on
personal details.


Predict one or more values based on a sequential
pattern


time series forecasting.

Data 3

14

Data Mining Tools in more detail


Case
-
based Reasoning



Use historical cases to identify patterns.


Neural Computing :


Examine historical data for pattern recognition

e.g.
identify potential customers for a new product.


Intelligent agents


Retrieve information from large databases.



Other tools e.g. decision trees, rule induction,
data visualisation.

Data 3

15

Some Key Applications Areas


Data

mining

is

used

in

many

different

areas


Two

big

areas

are
:


Market

analysis

and

management


Initial Data Gathered From

Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, lifestyle studies, focus groups


Fraud

detection

and

management


Data 3

16


Examples Market analysis and management


Target marketing


Find clusters of “model” customers who share the same characteristics: e.g.
interests, income


Determine customer purchasing patterns over time


Cross
-
market analysis uses associations/co
-
relations between product
sales and predicts based on the association information


Customer profiling:



What types of customers buy what products


Identifying customer requirements
-


Identifying the best products for different customers, use prediction to find
what factors will attract new customers


Data 3

17

Fraud detection and management


Used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.


Use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances


Examples


auto insurance
: detect a group of people who stage
accidents to collect on insurance


money laundering
: detect suspicious money
transactions


medical insurance
: detect professional patients and
ring of doctors and ring of references

Data 3

18

Text Mining

-
Application of data mining to unstructured or less
structured files.

-
Text mining operates with less structured
information and helps organisations to:
-


Find hidden content of documents including useful
relationships.


Relate documents across unnoticed divisions e.g.
customers in 2 product division have the same
characteristics.


Group documents by themes e.g. all customers who
have similar complaints.

Data 3

19

Some more example applications by area


Marketing:
-

Predicting customers to respond to internet
banners or buy a product. Segmenting customer
demographics.




Banking : forecasting bad loans and fraudulent credit card
usage, credit card spending by new customers and which
customers will respond bet to new loan offers.



Retailing and Sales: Predicting sales, correct stock levels,
distribution schedules




Manufacturing and Production: predicting when to expect
machinery failures , finding key factors that control the
optimisation of manufacturing capacity.

Data 3

20


Brokerage and Securities Trading:
-

Predicting when bond
prices will change, forecasting range of stock fluctuation
for particular issues, determining when to trade stock.


Insurance: forecasting claim amounts, medical coverage
costs, classifying the most important elements that affect
medical coverage, predicting which customers will buy
new policies.


Computer Hardware and Software: Predicting drive
failure, forecasting creation time for new chips, predicting
potential security violations.



Government and Defence: Forecasting cost of moving
military equipment, testing strategies for potential
military engagements, predicting resource consumption.


Data 3

21


Airlines: Capturing data on what customers are
flying and destination of those who change
carriers midflight.



Healthcare : correlating demographics of patients
with critical illnesses.




Broadcasting


programs best shown in prime
time and how to maximize returns by inserting
advertisements.




Police: tracking crime patterns, locations, criminal
behaviour and attributes to help crack criminal
cases.

Data 3

22

Problems with data mining


Need clear business objectives and access to the
appropriate data.


Need the right data.


Bad data quality can lead to spurious results


Models are not fail
-
safe.


Privacy, property and other legal and ethical
issues.


Companies must change mode of operation and
maintain the effort (e.g. loyalty programs such as
air miles).

Data 3

23

Conclusion


Data Mining is an attractive sounding
technology which is still evolving.


The key is that the algorithms discover useful
relationships.


Unlike standard research where researchers
hypothesise correlations and then search for
them.


There are ethical issues:


E.g. Criminal profiling.

Data 3

24