Topic 7: Data Mining
Business Applications of data mining
Data Mining Activities
Data Mining Techniques
How to Apply Data Mining
Data Mining Development Methodology
Berry, M., & Linoff, G.
Mastering Data Mining,
Publishing, New York
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A.
Discovering Data Mining:
From Concept to Implementation
. Prentice Hall, Englewood Cliffs, NJ 1998.
Dhar, V., & Stein, R
.,”Deriving Rules from Data” in Seven Me
thods for Transforming
Corporate Data into Business Intelligence
., Prentice Hall 1997, pp. 167
Ganti, V., Gehrke, J., & Ramakrishnan, R.
Mining Very Large Databases
, IEEE Computer,
Vol.32 No.8, August 1999, pp.38
a Mining Implementation
, Communications of the ACM, Vol.44,
No.7, July 2001, pp. 87
Web site on Data Mining and Web Mining
Why data mining?
The advent of information technology in the last four decades has resulted in an abundance of data
generated or captured electronically. Some of the sources for this proliferation of data are: data
generated by point
S) devices such as bar code scanners, customer call detail databases,
web log files in e
While the generation, storage and transmission of data is becoming more and more efficient,
organizations are ending up with huge amounts of mostly day
day transaction data, which are being
stored away in database files for possible future use. A more organised and useful approach of storing
data has given rise to organizational data warehouses, which are central repositories of cleaned and
ed data. As an example of the magnitude of business data, the US Library of Congress has 17
million books equivalent to about 17 terabytes of data, which is also reported to be the volume of data
in the database of the UPS shipping company.
Data is bei
ng collected mostly for improving efficiency of underlying operations, but not for analysis
and prediction. It has become obvious that businesses can gain competitive advantage if it is possible to
extract useful information on market and customer behaviou
r by “mine”
ing the data. Such information
may indicate important underlying trends, associations or patterns in market behaviour, which can help
obtain answers to questions like
“Based on past buying behaviour, which customers should be
targeted for dir
According to (Hirji 2001),
“A practical and applied definition of data mining is the analysis and non
trivial extraction of data from
databases for the purpose of discovering new and valuable information, in the form of patterns and rules
from relationships between data elements.”
The term data mining stands for the process of, rather than a product for, exploring and analysing data
for discovering new and useful information. This concept is not new, and data analysis using statistical
ethods such as regression is already a well
established practice. But statistical methods suffer from
scalability problem. They work well with relatively small data sets and manageable number of
variables, but not so with millions of records and thousands
of variables. The advent of intelligent
techniques such as artificial neural networks and decision trees has made it possible to perform data
mining involving large volumes of data more effectively and efficiently.
Current interest in data mining, in part
icular for gaining competitive advantage in business, is growing.
Its application in areas such direct target marketing campaigns, fraud detection, and development of
models to aid in financial predictions is likely to intensify in the coming years.
mining projects involve both the utilization of established algorithms from
statistics, database systems and information visualization, and the development of new methods and
algorithms, targeted at large data mining problems. Nowadays da
ta mining efforts have gone beyond
crunching databases of credit card usage or stored transaction records. They have been focusing on data
collected in the health care system, art, design, medicine and biology and other areas of human
Analytical Processing (OLAP), Data Mining and Knowledge Discovery
Unlike data warehouses, which aim to present a single centralised view of the data in an organization,
OLAP databases (known as
) offer improved speed and responsiveness by limit
themselves to a single view of the data specific to the department that owns the database. OLAP
databases are organised using a multidimensional structure (called a
) using dimensions of the
business such as time, product type and geography. This e
nables the analyst to segregate and analyse
data along the different aspects of business activity. Sometimes a data warehouse can consist of a
collection of data marts.
Both data mining and OLAP are tools for decision support. While OLAP attempts to find
happening and how it is happening, data mining aims to find out the underlying characteristics of data
in order to be able to predict what is likely to happen. The difference in OLAP and data mining is best
understood in terms of the questions the
y allow to be asked by the user
for example, “How has the
sale of product X varied from quarter to quarter during the past year over all the geographic regions it
is sold? (OLAP)”, “Which customers are more likely to buy product X?” (data mining ).
mining belongs to the wider field of knowledge discovery, which concerns the automated
extraction of useful information from a diverse range of sources including textual, multimedia and web
data. Data mining is knowledge discovery applied predominantly in
The main objectives of this topic are to:
Understand the role of data mining in business
Distinguish between different data mining techniques
Understand how to go about making use of data mining
Business Applications of Data M
The data mining segment is one of the fastest growing segments in the business intelligence market.
Companies are increasingly investigating the potential of data mining technology to deliver
competitive advantage. It is being regarded as an integr
al and necessary component of an
organization’s portfolio of analytical techniques. The use of data mining techniques as an intelligent
system tool in a number of aspects in business is outlined below.
Data mining for marketing
Many of the most successful
applications of data mining are in marketing. In this area, data mining is
used for both reducing cost and increasing revenue. Databases contain collections of data on
prospective targets of a marketing campaign. These data relate to information on custom
obtained from operational systems such as point
sale systems or from commercial databases
available for a fee. Data mining can be used to reduce marketing costs by eliminating calls and letters
to people who are unlikely to respond to an of
Data mining for customer relationship management
Good customer relationship management involves anticipating customers’ needs and responding to
them proactively. This can be made possible in a large enterprise only through the application of data
ing techniques to the records in its customer database.
Data mining in R&D
Data mining can lower costs during the research and development phase of the product life cycle. One
example is the pharmaceutical industry, which is characterised by the generat
ion of large amounts of
test data. These data have to be analysed using sophisticated prediction techniques to determine which
chemicals are likely to produce useful drugs. An entire discipline of
has grown up to
mine and interpret the data
being generated by high throughput screening and other biological sources.
Data Mining Activities
The different objectives of performing data mining may be categorised into two broad groups
directed data mining
undirected data mining
ected data mining, we know what we are looking for. We aim to find the value of a pre
target variable in terms of a collection of input variables, eg, classifying insurance claims. Undirected
data mining takes a bottom
up approach. It finds pat
terns in data and leaves it to the user to find the
significance of these patterns. Eg, an analysis of customer profiles may result in identifying groups of
customers with similar buying patterns.
The different types of data mining tasks are listed below
Finding affinity grouping or association rules
Description and visualisation
Classification, estimation and prediction are examples of directed data mining, while the remaining
three tasks belong to the gr
oup undirected data mining group.
Classification assigns a given object to a predefined category (class) based on the object’s attributes
(features). In the business data mining context, the objects to be classified are generally represente
database records. Examples of classification tasks include:
Assigning keywords to articles
Classifying credit applicants as low, medium and high risk
Assigning customers to predefined customer segments
While classification produces discret
yes or no, low, medium, high etc, estimation deals
with continuously varying outcomes, such as income, probability of a customer leaving (known in data
mining circles as
) or average number of children in a family. Outcomes of such estimat
also be used for classification by ranking the values and applying a threshold to categorise.
A prediction can be a classification or estimation task performed to predict some future behaviour.
Predicting which custo
mers will churn in the next six months
Predicting the size of a balance that will be transferred
Finding affinity grouping or association rules
The task of affinity grouping is to find out, which things go together. The prototypical example is
ngs go together in a supermarket shopping trolley. Supermarkets make use of such
information for arranging items in shelves or catalogues, and for identifying cross
(eg, people who buy product X also buy products Y and Z).
Clustering segments a group of diverse records into subgroups or clusters containing similar records.
There are no predefined classes in clustering; records are grouped based on similarities in their
eg, people with similar buying habits. Dif
ferent clusters of music CD purchases may
indicate different age or cultural sub
groups. It is left to the data miner to interpret such clusters and
decide what use can be made of them.
Description and visualisation
Data mining can be used simply to descr
ibe a database to help increase our understanding of the
people, products or processes that produced the data in the first place. A good description can also
provide an explanation of their behaviour or at least indicate where to look for an explanation. D
visualisation is a powerful form of descriptive data mining. Although visualisation may not be
meaningful in all cases, it can be very effective in explaining things by exploiting our ability to make
use of visual clues.
Data Mining Techniques
field of data mining spans a number of disciplines including statistics, databases and machine
learning. In this topic, we’ll aim to find out when to apply data mining techniques, how to interpret
their results, and how to evaluate their performance. To be
able to do so, we need a basic understanding
of the inner workings of these techniques. The three major approaches for data mining are: decision
trees, automatic cluster detection and artificial neural networks (supervised and unsupervised).
A decision tree
may be regarded as the visual representation of a reasoning process. They are
particularly suitable for solving classification problems. A decision tree consists of nodes, branches and
leaves. In it
internal node represents an a
ttribute, called a
, and each leaf node is
labelled with a class label. The class label is decided by the class of the records that ended up in that
leaf during training. A leaf node may also contain a value depending upon the average of
the values of
such records. Each edge originating from an internal node is labelled with a
involves only the node’s splitting attribute. The splitting predicate has the property that any record will
take a unique path from the roo
t to exactly one leaf node.
To help us understand how the decision tree works, we can view each record with
attributes as a
point in an
dimensional record space. Each branch in decision tree is a test on a sin
gle variable that
splits the space into two or more regions. After the split in the root node of the tree, each of these
regions will have a mix of records with different values for most, if not all, attributes except the one,
which was tested in the root
node. With each successive test and split, the resulting regions get more
and more segregated with increasing homogeneity among the records in each region. Ultimately, the
leaf nodes will contain the purest batch of records. For example, in the decision tr
ee of Figure 1, any
employed person aged less than 41 and earning a salary of more than $50,000 will be classified as
belonging to group A.
It is possible to build a decision tree that correctly classifies every single record, assuming no two
ds have the same set of attributes (input variables) but belong to two different classes (target
variables). This is not desirable though, as it gives rise to the problem known as
. Such a tree
describes the training data very well but is unlik
ely to generalise to new data sets. As an example, a
large decision tree may be built to identify every inhabitant of a small town
each leaf node will be
labelled with a name (assuming no two persons with the same names live in this town). But this tree
won’t be able to do anything useful like determining whether a given person belongs to the group of
Sample decision tree for a catalogue m
ailing (Ganti et al.
overweight male teenage students (no representative leaf node). To prevent overfitting, test data set are
used to prune decision trees once it has been bui
lt using the training data set.
There are a number of different types of decision trees depending mainly upon the number of splits
allowed at each level, how these splits are chosen when the tree is built and how the tree is pruned to
. More broadly, decision trees can be grouped as either classification trees (leafs
represent classes) or regression trees (leafs represent a numeric value). There are various algorithms for
building decision trees
the most notable among them are CHAID,
C4.5/C5.0 and CART. Typical data
mining software tools these days allow the user to choose among several splitting criteria and pruning
strategies, and to control parameters such as maximum tree depth to allow approximation of any of
w decision trees are built
Decision trees are built through a process known as recursive partitioning. Recursive partitioning is an
iterative process of splitting the data up into partitions (regions of record space). Initially all the records
are in a tra
the preclassified records that are used to determine the structure of the decision
tree. An algorithm splits up the data, using every possible binary split on every field of the records. The
algorithm chooses the split that partitions the data
into two parts that are purer than the original data
(the training set). The splitting process is then applied to each of the new parts and so on until no more
useful splits can be found.
The most important task in building a decision tree is to decide w
hich of the attributes (independent
fields in a record) gives the best split. The best split is defined as one that creates partitions where a
single class predominates. The measure used to evaluate a potential splitter is the reduction in
crease in purity)
. There are several methods of calculating the index of diversity for a set of
records. One measure, called the Gini index in data mining circles, is given by the formula
is the probability of class one. Two other div
ersity indexes are:
, known as
To choose the best splitter at a node, the decision tree algorithm considers each input field in turn.
Then, every possible split is tried. The diversity measure is calculated for th
e two new partitions, and
the best split is the one with the largest reduction in diversity. The field, which yields the best split, is
chosen as the splitter for that node. If a field takes on only one value, it is eliminated from
consideration since ther
e is no way it can be used to create a split. When no split can be found that
significantly decreases the diversity of a given node, then this node is a leaf node. Eventually only leaf
nodes remain and the full decision tree has been grown.
earlier, the full decision tree needs to be pruned to improve its performance. Pruning is
done by removing leaves and branches (edges leading to leaves) that fail to generalise. There are a
number of pruning methods, one of which uses the actual performan
ce of the tree on a separate set of
preclassified data, called the test set. A tree is pruned back to the subtree that minimises error on the
Application of decision trees
Decision trees are useful when the data mining task is classification o
f records or prediction of
outcomes. They are used when the goal is to assign each record to one
a few broad categories.
Decision tree methods are also chosen for their ability to generate understandable rules, which can be
explained and translated int
o SQL or a natural language. For any classified record, the rule for its
classification can be generated by simply tracing the path from the root to the leaf where record ended
up. Most decision tree tools provide for this capability.
Like artificial ne
ural networks, decision trees also represent a class of machine learning algorithm since
they are capable of generating rules from training data. One major difference between these two
paradigms however is that unlike a neural net, whose rules (or input
tput mappings) are implicit in
its weights, rules in a decision tree are explicit.
Automatic Cluster Detection
Cluster detection aims to discover structure in a complex data set as a whole in order to carve it up into
simpler groups. Examples of cluster
finding products that should be grouped together in a
catalogue, or identifying groups of customers with similar tastes in music. There are many methods for
finding clusters in data, a prominent one among which is described below.
means clustering algorithm is available in a wide variety of commercial data mining tools. It
divides the data set into a predetermined number,
of clusters. These clusters are centred at random
points in the record space. Records are assign
ed to the clusters through an iterative process that moves
the cluster means (also called cluster
) around until each one is actually at the centre of some
cluster of records.
In the first step,
data points are selected to be
the seeds more or less arbitrarily. Each of these seeds is
an embryonic cluster with only one element. In the example shown in figure 1,
In the second step, each record is assigned to the cluster whose centroid is nearest to t
hat record. This
forms the three clusters shown in figure 4 with the new intercluster boundaries. Note the boxed record
which was assigned to cluster 2 (seed 2) initially now becomes part of cluster 1.
Figure 2 Initial cluster seeds (from Berry & Linoff 2000).
3 Initial clusters and intercluster boundaries (from Berry &
The centroid of a cluster of re
cords is calculated by taking the average of each field for all the records
in that cluster. For measuring distances between a record and a cluster’s centroid, the Euclidean
is most commonly used by data mining software.
means method, t
he original choice of the value of
determines the number of clusters that will
be found. Unless advanced knowledge is available on the likely number of clusters,
different values of
. Best results are obtained when
es the underlying structure
of the data.
A strength of the automatic cluster detection is that it is an undirected data mining technique
look for something useful without knowing what we are looking for. But this also means,
we may not
recognize it when we find something useful!
The most frequently used approaches to understanding clusters are
Building a decision tree with the cluster labels as target variables, and using it derive rules
explaining how to assign new record
s to the correct cluster.
Using visualisation to see how the clusters are affected by changes in input variables.
Examining the differences in the distributions of variables from cluster to cluster, one variable
at a time.
Application of clusters
er detection is used when it is suspected that there are natural groupings, which may represent
groups of customers or products that have lot in common with each other. These may turn out to be
commonly occurring customer segments for which customised mark
eting approaches are justified.
Clustering is also useful when there are many competing patterns in the data making it hard to spot any
single pattern. Creating clusters of similar data records reduces the complexity within clusters so that
other data min
ing techniques are more likely to succeed.
The Euclidean distance between two points P(x
, .. , x
) and Q(y
, .. , y
+ .. + (x
Figure 4 New clusters, their centroids marked by crosses and
intercluster boundaries (from Berry & Linoff 2000).
Artificial Neural Networks
Going back to the main data mining activities mentioned earlier, classification, estimation and
prediction are the three types of tasks performed in directed data mining, where a (d
is described in terms of the values of a group of other (independent) variables. In practice,
classification, where we look for a categorical (yes/no, low, medium, high etc) answer, can be an
estimation problem, where some threshold is a
pplied to estimated output values to come up with
categories. Prediction, on the other hand, can be viewed as an estimation or classification task with the
difference that the estimated value or class can only be verified at some future point.
As we found
in topic 3, the main generic application of artificial neural networks is pattern recognition
or classification, which assigns an input object to a class based on the values of its attributes. The best
artificial neural networks model for performing class
ification is the backpropagation network (or the
multilayer perceptron), which learns to classify data through supervised training.
The artificial neural network model particularly suited for the data mining task of clustering is the
Kohonen net or the s
organising map (SOM). With SOMs, the learning algorithms are unsupervised.
Groups of similar input pattern vectors (data records) that are near each other in the
record space are mapped to neurons that are also close to one another in the
representative clusters for similar data records. The SOM can thus serve as a clustering tool as well as
visualisation tool for high
dimensional data. See topic
for more details on both the
backpropagation and SOM neural
SOMs have been claimed to be often more effective than classical clustering algorithms such as
means. While clustering techniques often have very restrictive assumptions or limited ability to find
complex shapes, SOMs are much more able to iso
late clusters in high dimensional space, particularly
those clusters that are not tight little balls of data but rather twisting, curving regions in
space. SOMs take data in
space and produce graphs in 2
dimensional space that reveal key
sociations among the original input vectors in
Artificial neural networks can produce very good results, but they require extensive data preparation
involving normalisation and conversion of categorical values to numeric values. The main drawbac
ANNs is that they are difficult to understand because they represent complex non
linear models that,
unlike decision trees, do not produce rules readily.
Application of neural nets
Neural networks are a good choice for most classification and predict
ion tasks when the results are
more important than understanding how the model works.
Neural nets do not work well when there are many hundreds or thousands of input features. Large
numbers of features can make it more difficult for the network to fined p
atterns and can result in long
How to Apply Data Mining
There are essentially four ways of utilising data mining expertise in business.
By purchasing readymade scores (such as on credit worthiness for a loan applicant) from
By purchasing software that embodies data mining expertise designed for a particular
application such as credit approval, fraud detection or churn prevention.
By hiring outside consultants to perform data mining for special projects.
By developing ow
n data mining skills within the business organisation.
The advantage of purchasing scores is that it is quick and easy, but the intelligence being limited to
single score values, its usefulness is also limited.
Data mining expertise
can be embodied in software in one of two ways. The software may be an actual
model, in the form of a set of rules for decision support, or a fully
trained neural network applied to a
particular domain. Alternatively, it may embody knowledge of the process
of building models
appropriate to a particular domain in the form of a model
creation wizard or template.
Purchasing models developed somewhere else can work well but only to the extent that the products,
customers, and market conditions match those that
were used to develop the model. Applications
developed to meet the needs of a particular industry are often called
attempt to cover every layer from data handling at the bottom to report generation at the top. A general
purpose data mining tool on the other hand is
since it provides broad applicability to many
problems. One example of data mining tools which are vertical applications is the California based
company HNC’s embedded neural network model Falcon for
predicting fraud in credit cards. In 1998,
Falcon monitored over 250 million card accounts worldwide.
Model building software are tools for automating the process of creating candidate models and
selecting the ones that perform best, once the proper inpu
t and output variables have been identified.
Such software enable novice model builders to be taken through the process of creating models based
on their own data. Although this gives the flexibility of reflecting local conditions, it leaves a number
gnificant tasks for the user (these are part of the overall data mining development methodology
Choosing a suitable business problem to be addressed by data mining.
Identifying and collecting data that is likely to contain the informati
on needed to answer the
Preprocessing the data so that the data mining tool can make use of it.
Transforming the database so that the input variables needed by the model are available.
Designing a plan of action based on the model and im
plementing it in the marketplace.
Measuring the results of the actions and feeding them back into the database where they will be
available for future mining.
As one example of model building software, Model 1 from Unica Technologies contains four modules
each of which addresses a particular direct marketing challenge. These are called Response Modeller,
Cross Seller, Customer Valuator, and Customer Segmenter.
Hiring Outside Experts
This approach is recommended if an organisation is in the early stages
of integrating data mining in its
business, and especially if the data mining activity is to be an one
off process, eg, fixing a
manufacturing problem. If instead, it is to be an ongoing process, eg, data mining for customer
relationship management, it is
more worthwhile to consider developing the necessary skill among an
organisation’s own staff through in
Outside expertise for data mining is likely to be available in three possible places:
From a data mining software vendor
If the data
mining software has been already selected, the company providing the software is
the first place to look for help.
Data mining centres
These are usually collaborations between universities and private companies.
Ideally the consulti
ng company chosen should have had experience specifically in the area of
interest to the organisation seeking help.
Any business serious about converting corporate data to business intelligence should consider making
ning one of its core competencies. This applies particularly to companies which have many
products and customers.
Data Mining Development Methodology
According to (Hirji 2001) no quantitative or qualitative study of how to actually perform data mining
has been undertaken. Cabena et al.(Cabena 1998) have proposed a five
stage model of implementing
and using data mining. The stages in the model are:
Business objective determination
usiness objective determination is concerned with clearly identifying the business problem to be
mined. Data preparation involves data selection, preprocessing and transformation. Data mining is
concerned with algorithm selection and execution. Results ana
lysis is concerned with the question of
whether anything new or interesting has been found. Knowledge assimilation aims to formulate ways
of exploiting the new information extracted.
A case study involving a large fast food outlet and describ
ed in (Hirji 2001) brought out some
deficiencies of the methodology described above. A new set of stages for data mining development and
use has been proposed as given below:
Business objective determination
Interactive data min
ing and results analysis
Back end data mining
Results synthesis and presentation
The case study used IBM’s Intelligent Miner for Data on AIX as the data mining tool and took 20
actual days of effort across the 6 stages above. Back end data mining involves
data enrichment and
additional data mining algorithm execution by the data mining specialist. 45% of the total data mining
project effort was taken up by stages 4, 5 and 6, and 30% was required by the data preparation stage.
The 30% of total time required
by the data preparation stage, as compared with the 70% predicted in
the earlier model of Cabena et al. may be explained by the use of an existing data warehouse in the
organization. For a data mining project without this advantage, more project resources
required to perform tasks such as selecting, cleaning, transforming, coding, and loading the data.
Important aspects of the Interactive data mining and results analysis stage were linking data mining
results with business strategy and using appl
ication software such as spreadsheets to perform sensitivity
analysis of results obtained. The objective would be to demonstrate how data mining results support
business strategy. For example, patterns of fast food product combinations would be identified
basis for developing strategies for recombining some existing product offerings.
Further quantitative studies are required involving other industries to further validate the methodology
outlined above. It is expected that a set of best practices for
data mining implementation projects will
emerge through further refinement of this methodology.