Abstract
Regression is an inherently statistical technique used regularly in
data mining. Regression can translate a company’s growth
patterns and can predict conclusions about a company’s
successes based on the products they choose to sell. In this
project, regression was used to build models that predicted the
retail price of laptops sold in London electronic stores. The
model was built using the XLMiner data mining software and a
database that contained 7,956 records. After the research was
concluded, it was found that multiple linear regression was more
efficient than simple linear regression in predicting the retail
price of a laptop.
Introduction
Regression analysis establishes a relationship between a
dependent or outcome variable and a set of predictors.
Regression, as a data mining technique, is supervised learning.
Supervised learning partitions the database into training and
validation data. The techniques used in this research were simple
linear regression and multiple linear regression. Some
distinctions between the use of regression in statistics verses data
mining are listed below in Table 1.
Simple Linear Regression
The strength and direction of a linear relationship between two
variables can be computed by the correlation coefficient,
r
. The
following formula can be used to calculate
r
:
The solutions to
r
range from negative one to positive one.
The formula for simple linear regression is as follows:
ŷ = ax + b,
where ŷ is the outcome variable, and x is the predictor. The ‘a’,
which represents the slope and ‘b’, which represents the
y

intercept, can be calculated using the method of least squares.
Method of Least Squares
The method of least squares is a minimizing technique. In
regression the method of least squares is used to find the least
distance between the predicted y

values and the actual y

values
to create a regression line (the line of “best” fit).
Derivation of the Method of Least Squares
Given points (x
1
, y
1
), (x
2
, y
2
),…,(x
n
, y
n
), the coefficient and
constant for the regression line can be derived from:
First, minimize F(a,b) by calculating the partial derivatives with
respect to a and b. Then set the derivatives equal to zero.
Divide each equation by 2 then solve the system of partial derivative
equations.
It can shown by the second partial derivative test that (a,b) is a minimum
point on the function.
Multiple Linear Regression
The coefficient of determination (r
2
) is a percentage of the total variation
shown by the outcome variable in relation to the independent variables.
If
the r
2
value is interpreted correctly, then it would be useful in determining
whether or not the predicted model is substantial enough to continue
estimation and/or prediction.
The range of the coefficient of determination is between 0 and 1.
The r
2
value can be calculated using the following equation:
r
2
= ∑( ŷ
–
ȳ
)
2
/ ∑( y
i
–
ȳ
)
2
By extension of the Method of Least Squares, we can develop the
multiple regression model. The coefficients and constant terms in the
multiple regression formula can be derived in the same manner in which
we derived the simple linear regression equation.
The multiple regression formula is as follows:
ŷ = a
0
+ a
1
x
1
+ a
2
x
2
+ a
3
x
3
+…+a
n
x
n
,
where ŷ is the outcome variable and x
i
’s are the predictors.
Database
Data Name:
Laptop Sales January 2008
Number of Records:
7,956
Number of Attributes in the Dataset:
17
Attribute used for Simple Linear Regression:
HD size (GB)
Attributes used for Multiple Regression:
HD size (GB)[
x
2
],
battery life (hours)[
x
5
]
, integrated wireless[
x
4
]
, bundled
applications[
x
1
], processor speeds (GHz) [
x
3
] RAM (GB)[
x
6
]
Procedure
The attribute with the strongest
r
will be used to construct a simple linear
regression model. Since HD size has the strongest
r
( 0.485), HD size will
be the predictor variable. Excel will be used to construct the regression
model.
The attributes with the strongest
r
will be used to construct a multiple
linear regression model. Since
HD size, battery life, integrated wireless,
bundled applications, processor speeds, and RAM had the strongest
r,
these attributes were used.
XLMiner is a data mining tool that builds
multiple regression models, and it will be used to construct the regression
model.
Data Mining: Predicting Laptop Retail Price Using Regression
By: Britney Robinson and Joi Officer
Advisor: Dr. Fred Bowers
Statistics
Data Mining (supervised learning)
The data is a sample from a population.
The data is taken from a large database (e.g.
1 million records).
The regression model is constructed from a
sample
The regression model is constructed from a
portion of the data (training data).
Results
The simple linear regression model is as follows:
ŷ=.299x
1
+442.795, with r = .485
The multiple linear regression model is as follows:
ŷ= 49.931x
1
+ .399x
2
+ 49.383x
3
+ 18.922 x
4
+ 51.063x
5
+ 49.085x
6

34.769,
with r
2
= .93
Several records were used to validate the models. Table 2 and table 3 show 4 of the
records used to validate both models.
HD size (GB)
Predicted Retail Price ($)
Actual Retail Price ($)
40
454.75
400.00
80
466.71
455.00
40
454.75
395.00
120
478.67
585.00
HD size (GB)
Processor Speeds
(GHz)
Integrated Wireless
Bundled
Application
RAM (GB)
Battery Life
(Hours)
Predicted Retail
Sale Price ($)
Actual Retail Sale
Price ($)
40
1.5
0
1
2
4
407.62
400.00
80
2
1
1
1
5
469.21
455.00
40
2
0
1
1
5
434.31
395.00
120
2
0
1
2
6
566.42
585.00
Conclusion
Based on the research, the multiple linear regression model was more effective in
predicting the retail price of a laptop in comparison to the simple linear regression
model. It is evident that a better regression model can be constructed using several
attributes to predict laptop retail prices instead of one attribute.
Since this research was based on data from laptop sales that occurred in January 2008,
this work could be expanded by including data from February 2008 to December
2008.
References
1. Data Mining for Business Intelligence, Galit Shmueli, John Wiley, Nitin Patel, and
Peter Bruce, 2007.
2. Business Statistics, Wayne Daniel and James Terrell, 1992, Houghton Mifflin
Company
3. Data Mining: Methods and Models, Daniel Larose, 2006, John Wiley
4. Discovering Knowledge in Data, Daniel Larose, 2005, John Wiley
5. Applied Calculus, Barnett, Ziegler and Byleen, 2003, Prentice Hall
Acknowledgements
We would like to thank the following persons for their support throughout our
research:
Dr. Sidbury, Dr. Lee, Dr. Lawrence, Mr. Duffie, and Dr. Bowers.
Support for this work has been provided by a grant from the Advancing Spelman’s
Participation in Informatics Research and Education Program and the National
Science Foundation Award # HRD

0714553.
Table 2 depicts the validation of the multiple linear regression model using records 1, 4,5, and 79 respectively.
Table 3 depicts the validation of the simple linear regression model using records 1, 4,5, and 79 respectively.
Table 1
a =
b =
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment