Big Data and Data Mining

levelsordData Management

Nov 20, 2013 (3 years and 4 months ago)

70 views

Big Data and Data Mining

Professor Tom
Fomby

Director

Richard B. Johnson Center for Economic
Studies

Department of Economics

SMU

May 23, 2013

Big Data:

Many Observations on Many
Variables

Data File

OBS No.

Target Var.

Var. 1

Var. 2

.

.

Var
. 100

1

0

63

.

.

.

.

2

1

54

.

.

.

.

3

0

44

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

1,500,000

1

32

.

.

.

.

Types of Problems


Customer and Student Retention


Employee Churn


Credit Scoring (Auto or Home Loans)


Bond Ratings


What Characteristics Make for a Successful Mary
Kay Representative?


Detection of Fraudulent Insurance Claims


Is a Newly Introduced Product Meeting with
Consumer Acceptance or Rejection?


Who is a likely Donor to your Charity?


Early Detection of a Stolen or Compromised
Credit Card

Types of Problems


What kind of genetic markers imply
certain susceptibilities to specific
diseases?


Netflix and recommendations of
Related and Suggested Movies


Recommendations for Book Purchases:
Amazon Side
-
Bars


Click Stream Analysis of Optimal Web
Base Design

Statistical Hypothesis Testing

Versus

Prediction

Example of Statistical Hypothesis
Testing

A Clinical Trial of 400 people


200 randomly
selected into a Control (Placebo) Group and the
Other 200 into a Treatment Group

Question:

Does the Drug Treatment Significantly Reduce a
Person’s Cholesterol Count?

Method:

Conventional Statistical Methods Like T
-
Test

Of Significant Difference in Population Means

Example of a Prediction Problem


Early Detection of a Stolen or
Compromised Credit Card

Not So Interested in How or Why the
Credit Card was Stolen but Instead
Whether Recent Transactions are
Indicative of a Stolen or Compromised
Credit Card


Tool


Box Plot


Getting Gems From the Data

Crankshaft Cartoon

The Task of Constructing a Meaningful
Data Warehouse

Data Rich, Information Poor


The Amount of Raw Data Stored in Corporate Databases is
Exploding


Most of this information is recorded instantaneously and with
minimal cost


Data bases are measured in gigabytes and terabytes (One
terabyte = one trillion bytes. A terabyte is equivalent to about
2 million books!)


Walmart uploads 20 million point
-
of
-
sale transactions to 500
parallel processing storage devices each day.


Raw data by itself, however does not provide much
information. That is where Data Mining Comes in!

What is Data Mining?


“Extracting useful information from large datasets” (Hand et
al., 2001)


“Data mining is the process of exploration and analysis, by
automatic or semi
-
automatic means, of large quantities of
data in order to discover meaningful patterns and rules.”
(Berry and Linoff, 1997, 2000)


“Data mining is the process of discovering meaningful new
correlations, patterns and trends by sifting through large
amounts of data stored in repositories, using pattern
recognition technologies as well as statistical and
mathematical techniques” (Gartner Group, 2004)


Four Distinct Characteristics of

Data Mining Projects


Partitioning

given data into Training,
Validation, and Test Parts


Cross Validation



using the Validation and
Test Parts to gauge the worthiness of
competing models


Using Ensemble Methods

to increase
predictive accuracy. (There is no such thing
as a correct model!)


Continual Monitoring

of a PA system to
guard against structural change and to
maintain predictive accuracy

More Detailed Discussion of Specific

Data Mining Applications


Text Mining (Classification of Documents and Evolution of Opinions on
Blogs)


Target Marketing


Credit Scoring


Bond Ratings: Calculating Default Probabilities on Bonds (Bond rating
services like Moody’s, Standard & Poor’s, Fitch, etc.)


Fraud Detection


Customer Retention


Franchise Locations and Performance


Customer Segmentation


Affinity Analysis (i.e. “Market Basket” Analysis)


Link Analysis (Webpage design)


Many Other Fields including Clinical Science, Statistical Genetics, Political
Science, Real Estate Assessment, and College Admissions Practices

Text Mining

Text Mining:

Converting Unstructured Data

to Structured Data

Text

Frequencies of
Words and
Phrases

Numbers for

Prediction

Who Wrote the Federalist Papers?

Frederick
Mosteller

and David Wallace

“Inference in an Authorship Problem” JASA, June 1963

18

Comparing Two Documents

Doc 1

Doc 2

Target Marketing


Target Marketing

is the process of choosing specific customers to
advertise to and/or to offer discounts to in order to increase the sales of
the company


Target Marketing usually proceeds in
two stages
: (1) Determining the
probability that the solicited customer will purchase products from the
company once solicited and (2) Once the solicited customer decides to
purchase items from the company, estimating the profit that will likely be
generated by the customer’s purchases.


Thus the goal is to advertise only to those potential customers that
represent expected profits that exceed the cost of advertising to the
customer


We then need to use data mining techniques to determine (1) the
probability of purchase and (2) conditional on purchase, the expected
profit of purchase.


Expected Profit of Purchase = (Probability of Purchase) x (Expected profits
from purchase, conditional on purchase)

Credit Scoring


Credit scoring

involves using data mining tools
determine the credit worthiness of loan applicants


The task is determining the
probability

that a
potential borrower
will default

on his or her
obligations, given the personal characteristics of the
borrower and the macroeconomic conditions of the
economy at the time


Some Examples: Citibank and Credit Card Issuers
reviewing applicants for credit cards; Banks
considering loaning money for mortgages

Bond Ratings: Calculating Default
Probabilities on Bonds


Given the financial characteristics of a bond issuer
and the macroeconomic conditions at the time, what
is the probability that the bond issuer will, at some
time in the future, not be able to service the
obligations of the bond?


Bond rating services like Moody’s, Standard and
Poor’s, and Fitch build probability of default models
and use them to give bonds their credit ratings (AAA,
AAB, …, BBB, etc.). The lower the probability of
default, the higher the bond rating and vice versa. In
turn, these ratings give rise to differential interest
rates paid by the bond issuers. (See Town and Gown
PPT for example.)


Fraud Detection


Of interest to IRS, Credit Card Companies, and
Auditors


Given a history of transactions, a record of
“typical” income tax reports or income or
balance sheets, which transactions
\
reports
appear to be “outliers”?


Basic Tool:
Statistical Outlier Analysis
.
Roughly speaking: “What is three or more
standard deviations from the norm?”

Customer Retention


What factors determine the loyalty displayed by a
customer?


When is a customer likely to “jump ship”?


Would loyalty programs be useful?


Basic Tool:
Duration Modeling.
This method
determines what factors extend or limit the
durations of customers with companies.


Purpose: To identify potential “fragile” customers
and then “incentivize” them so that they will remain
loyal


Result: Higher profits

Facets of a Data Mining Job

1.
Development of Problem Statement and
Consultation with Domain Experts

2.
Data Acquisition

3.
Data Preparation and Cleaning

4.
Data Visualization and Summarization

5.
Type of Task? Supervised Learning
(Prediction, Classification), or
Unsupervised Learning

6.
Evaluation of Models (Data Partitioning
and Cross Validation)

7.
Scoring of New Data

8.
Continual Review of Model Usefulness


Franchise Locations and Performance


What
location factors

affect the eventual
profitability and success of franchises?


Even within a set of franchises, should the
product mix be the same for all franchises or
should franchises be treated differently?


Can franchisees by put into
“Clusters”

and
treated differently so as to maximize the
profits of the entire franchise operation?

Customer Segmentation


Suppose you are a giant publisher of magazines of various
types. How do your subscribers differ across your portfolio of
magazines?


When soliciting advertising for your magazines, how do you
match

your potential advertisers with your magazines so that
the advertisers receive the maximum benefit for their
advertising expenditures?


Is there a
niche market

(customer segment) that none of your
magazines (or those of your competitors) is currently serving?
Is this niche market substantial enough to warrant introducing
a new magazine?


Also, retailers often like to be able to distinguish between
customers with low versus high
elasticities

of demand for
their products so that they will know who to offer discounts to
increase their revenues and profits.


Basic Tool:
Cluster Analysis


Affinity Analysis


Given that a customer purchases a given set of items, what is
the probability that they will purchase another set of items?
That is, what does the customer’s
final

market basket

look
like, given a
partially
-
filled

one?


Purpose: Arrange the store shelves of a retail store so as make
it most convenient for customers to purchase related goods
and minimize the time of search and shopping. We want the
customer to be able to shop quickly but at the same time buy
a lot!


On book seller web pages, once you have indicated an interest
in purchasing a given book, several related books are often
brought to your attention by “advertisements” in the margins
of the page you are currently on. Affinity analysis is helpful in
generating “associated” sales on retail web pages. This
increases the profits of the web retailer.


Major Tool:
Association Rules



The
A priori Algorithm
.


Link Analysis


Explores Associations between groups
(individuals, organizations, web sites, nation
-
states and the like)


Uses: To improve webpage design, to facilitate
criminal investigations, and to benefit medical
research in epidemiology and pharmacology,
among other uses

Text Mining


To Understand Textual Content


For Finding Interesting Regularities in Text


Help Classify Documents by Type and Content


Useful for Medical Science Search Engines seeking
most current research on particular maladies seen in
patients


Beneficial in Building Spam Filters


Help Examine Evolution of Opinion vis
-
à
-
vis Blogs

Other Fields Where Data Mining is Used


Clinical Science

and Providing Baseline Guidance for Clinical
Treatment


Political Science

(Modeling Voting Patterns, Election
Outcomes and Appeal and Supreme Court Decisions)


Statistical Genetics



Relating Genetic characteristics with
medical outcomes


Real Estate

Assessment Models



County Assessors using
predictive models to gauge the current value of houses for the
purpose of assessing real estate taxes


College Admissions Practices



Which students should be
admitted and how much financial aid is needed to insure that
the chosen student will matriculate?

Typical Data Mining Course Outline

G. Samueli, N. R. Patel and P.C. Bruce.
Data Mining for
Business Intelligence (2007).

Data
Preparation &
Exploration


•Sampling

•Cleaning

•Summaries
•Visualization
•Partitioning

•Dimension
reduction

Figure 1.2: Data mining from a process perspective



Prediction


• MLR


• K
-
Nearest Neighbor


• Regression Trees


• Neural Nets

Classification



K
-
Nearest
Neighbor



Naïve
Bayes



Logistic
Regression



Classification
Trees



Neural Nets



Discriminant
Analysis

Segmentation/Clu
stering

Affinity Analysis/
Association Rules

Model Evaluation
& Selection

Deriving Insight

Deriving Insight

Available Software Packages


XLMINER (Frontline Systems)


SAS Enterprise Miner (SAS Product)


SPSS Modeler (IBM Product)


R (Open Source)


Data Mining Certificates are available for SAS
EM and SPSS Modeler

The Shortage of Trained Personnel

for Doing Data Mining

“Big data: The next frontier for innovation,
competition, and productivity” McKinsey Global
Institute, May 2011


140,000


190,000 more deep analytical talent
positions over the next decade


1.5 Million more data
-
savvy managers to take
advantage of insights offered by Data Mining




What is SMU doing about this
shortage?


Department of Economics: MS in
Applied Economics and Predictive
Analytics


Starting Fall of 2013


Department of Statistics: MS in
Statistics and Data Analytics


Started Fall of 2012


Cox School of Business: MS in
Business Analytics


Starting Fall of
2013

The Super Woman of

Predictive
Analytics

The Skill Set of Super Woman

Analytics:

SAS/SPSS/Statistics

Reporting:
Cognos

and
Dashboards

Data
Management:

Oracle and SQL