Data Mining: Predicting Laptop Retail Price Using Regression

quiltamusedData Management

Nov 20, 2013 (3 years and 10 months ago)

84 views

Abstract


Regression is an inherently statistical technique used regularly in
data mining. Regression can translate a company’s growth

patterns and can predict conclusions about a company’s

successes based on the products they choose to sell. In this

project, regression was used to build models that predicted the
retail price of laptops sold in London electronic stores. The
model was built using the XLMiner data mining software and a
database that contained 7,956 records. After the research was
concluded, it was found that multiple linear regression was more
efficient than simple linear regression in predicting the retail
price of a laptop.


Introduction


Regression analysis establishes a relationship between a

dependent or outcome variable and a set of predictors.

Regression, as a data mining technique, is supervised learning.
Supervised learning partitions the database into training and

validation data. The techniques used in this research were simple
linear regression and multiple linear regression. Some

distinctions between the use of regression in statistics verses data
mining are listed below in Table 1.







Simple Linear Regression



The strength and direction of a linear relationship between two
variables can be computed by the correlation coefficient,
r
. The
following formula can be used to calculate
r
:






The solutions to
r
range from negative one to positive one.




The formula for simple linear regression is as follows:

ŷ = ax + b,

where ŷ is the outcome variable, and x is the predictor. The ‘a’,
which represents the slope and ‘b’, which represents the

y
-
intercept, can be calculated using the method of least squares.


Method of Least Squares


The method of least squares is a minimizing technique. In

regression the method of least squares is used to find the least
distance between the predicted y
-
values and the actual y
-
values
to create a regression line (the line of “best” fit).


Derivation of the Method of Least Squares


Given points (x
1
, y
1
), (x
2
, y
2
),…,(x
n
, y
n
), the coefficient and

constant for the regression line can be derived from:




First, minimize F(a,b) by calculating the partial derivatives with
respect to a and b. Then set the derivatives equal to zero.

Divide each equation by 2 then solve the system of partial derivative
equations.









It can shown by the second partial derivative test that (a,b) is a minimum
point on the function.


Multiple Linear Regression


The coefficient of determination (r
2
) is a percentage of the total variation
shown by the outcome variable in relation to the independent variables.

If
the r
2

value is interpreted correctly, then it would be useful in determining
whether or not the predicted model is substantial enough to continue

estimation and/or prediction.


The range of the coefficient of determination is between 0 and 1.






The r
2
value can be calculated using the following equation:


r
2

= ∑( ŷ


ȳ
)
2

/ ∑( y
i



ȳ
)
2


By extension of the Method of Least Squares, we can develop the

multiple regression model. The coefficients and constant terms in the

multiple regression formula can be derived in the same manner in which
we derived the simple linear regression equation.


The multiple regression formula is as follows:

ŷ = a
0

+ a
1
x
1

+ a
2
x
2

+ a
3
x
3
+…+a
n
x
n
,

where ŷ is the outcome variable and x
i
’s are the predictors.


Database


Data Name:
Laptop Sales January 2008


Number of Records:

7,956

Number of Attributes in the Dataset:
17

Attribute used for Simple Linear Regression:
HD size (GB)

Attributes used for Multiple Regression:
HD size (GB)[
x
2
],

battery life (hours)[
x
5
]
, integrated wireless[
x
4
]
, bundled

applications[
x
1
], processor speeds (GHz) [
x
3
] RAM (GB)[
x
6
]


Procedure


The attribute with the strongest
r
will be used to construct a simple linear

regression model. Since HD size has the strongest
r

( 0.485), HD size will
be the predictor variable. Excel will be used to construct the regression
model.


The attributes with the strongest
r

will be used to construct a multiple

linear regression model. Since
HD size, battery life, integrated wireless,
bundled applications, processor speeds, and RAM had the strongest
r,
these attributes were used.

XLMiner is a data mining tool that builds
multiple regression models, and it will be used to construct the regression
model.





Data Mining: Predicting Laptop Retail Price Using Regression

By: Britney Robinson and Joi Officer

Advisor: Dr. Fred Bowers


Statistics


Data Mining (supervised learning)


The data is a sample from a population.


The data is taken from a large database (e.g.
1 million records).


The regression model is constructed from a
sample


The regression model is constructed from a
portion of the data (training data).


Results


The simple linear regression model is as follows:


ŷ=.299x
1
+442.795, with r = .485


The multiple linear regression model is as follows:


ŷ= 49.931x
1

+ .399x
2

+ 49.383x
3
+ 18.922 x
4

+ 51.063x
5

+ 49.085x
6

-

34.769,

with r
2
= .93


Several records were used to validate the models. Table 2 and table 3 show 4 of the

records used to validate both models.

HD size (GB)


Predicted Retail Price ($)


Actual Retail Price ($)


40


454.75


400.00


80


466.71


455.00


40


454.75


395.00


120


478.67


585.00


HD size (GB)





Processor Speeds
(GHz)





Integrated Wireless





Bundled


Application





RAM (GB)





Battery Life
(Hours)





Predicted Retail
Sale Price ($)





Actual Retail Sale
Price ($)





40





1.5





0





1





2





4





407.62





400.00





80





2





1




1





1





5








469.21





455.00





40





2





0





1





1





5


434.31





395.00





120





2





0





1





2





6


566.42





585.00





Conclusion


Based on the research, the multiple linear regression model was more effective in

predicting the retail price of a laptop in comparison to the simple linear regression
model. It is evident that a better regression model can be constructed using several

attributes to predict laptop retail prices instead of one attribute.


Since this research was based on data from laptop sales that occurred in January 2008,
this work could be expanded by including data from February 2008 to December
2008.

References


1. Data Mining for Business Intelligence, Galit Shmueli, John Wiley, Nitin Patel, and
Peter Bruce, 2007.

2. Business Statistics, Wayne Daniel and James Terrell, 1992, Houghton Mifflin

Company

3. Data Mining: Methods and Models, Daniel Larose, 2006, John Wiley

4. Discovering Knowledge in Data, Daniel Larose, 2005, John Wiley

5. Applied Calculus, Barnett, Ziegler and Byleen, 2003, Prentice Hall

Acknowledgements


We would like to thank the following persons for their support throughout our

research:


Dr. Sidbury, Dr. Lee, Dr. Lawrence, Mr. Duffie, and Dr. Bowers.


Support for this work has been provided by a grant from the Advancing Spelman’s
Participation in Informatics Research and Education Program and the National

Science Foundation Award # HRD
-
0714553.

Table 2 depicts the validation of the multiple linear regression model using records 1, 4,5, and 79 respectively.

Table 3 depicts the validation of the simple linear regression model using records 1, 4,5, and 79 respectively.

Table 1

a =

b =