Using Data-Driven Prediction Methods in a Hedonic

foulchilianΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

71 εμφανίσεις


1

Using Data
-
Driven Prediction Methods in a Hedonic
Regression Problem

MARCOS ÁLVAREZ
-
DÍAZ
1
, MANUEL GONZÁLEZ GÓMEZ

Department of Applied Economy, University of Vigo

Lagoas
-

Marcosende Vigo, Spain

And


ALBERTO ÁLVAREZ

ISME
-
DSEA Department of Electrical Engin
eering, University of Pisa


Via Diotisalve 2,56100 Pisa, Italy


(VERSIÓN PRELIMINAR)

Abstract

The traditional studies about hedonic prices apply simple functional forms such as
linear or linearity transformable structures. Nowadays, it’s known in the liter
ature the
importance of introducing non
-
linearity to improve the models’
explanatory capacity.

In
this work we apply data
-
driven methods to carry out the hedonic regression. These
methods don’t impose any a priori assumption about the functional form. We u
se the
nearest neigbors technique as non
-
parametric method and neural networks and genetic
algorithms both as semi
-
parametric methods. Neural Networks have already been
employed to the specific hedonic regression problem but, to the authors’ knowledge,
th
is is the first time that a genetic algorithm is employed.
The empirical results that we
have obtained demonstrate the usefulness of applying both nonparametric and semi
-
parametric data driven models in the estimation of hedonic price functions. They can
i
mprove the traditional parametric models in terms of out
-
of
-
sample R
2
.




1

Corresponding author: Departamento de Economía Aplicada, Universidad de Vigo Lagoas
-

Marcosende s/n,
36200 Vigo, Spain.
Fax : 986812401; e
-
mail
mad@uvigo.es

.



2

I
-
. Introductions

The hedonic prices theory try to determine how the individual characteristics of a
commodity affect on its price. The hedonic perspective have been applied to many
goo
ds such as automobiles, personal computers, televisions, irrigated lands and,
overcoat, housings. A linear relation between the commodity’s price and its
characteristics is the most widely approach in this sort of studies. The principal reasons
argued is t
hat they are easy to estimate and interpret. Nevertheless, it is also recognized
in the literature the importance of non
-
linearities in the hedonic price function in terms
of increasing its explanatory capacity (Rasmussen and Zuehlke (1990)). Many works
so
lve this issue employing techniques that, by means of transformations, allow flexible
parametric functional forms (for example, Box
-
Cox transformations). However, these
flexible forms impose a structure introducing forced and unnecessary non
-
linearity. The

results are “over
-
parameterized” models originating a loss of out
-
of
-
sample
performance. To avoid this problem, we can use some data
-
driven methods that permit
obtain a model without imposing any a priori assumption about the functional form
(nearest neig
hbours, neural networks and genetic algorithms).



The objective of this paper is to determine if employing both nonparametric and semi
-
parametric data
-
driven models we can improve forecast accuracy respect the parametric
models. Our empirical study is
centered in the real state market in the city of Vigo
(Spain). To carry out our research, we recollected information about renting prices and
houses’ information such as their structural attributes and neighborhood conditions. We
structure the work as fol
low. In section 2, we briefly present the different prediction
methods that we have employed (linear regressions as parametric method, nearest
neighbors as non
-
parametric method and neural networks and genetic algorithms as

3

semi
-
parametric methods).
Later
, in the next section, we define the data and show our
empirical results on out
-
of
-
sample predictions. Finally, we finish with a section
dedicated to conclusions.



II
-
. Methodology

a) Parametric Techniques

The simplest approach to the hedonic regression p
roblem is to postulate that the
functional form is simply linear. To achieve more flexibility, it’s very usual to do some
non
-
linear transformation with respect the data. The greater part of studies about
hedonic regression employ as norm a simple function
al relations such as linear, semi
-
logarithmic, double
-
log or quadratic semi
-
log, between others. The justifications for
such forms is based in the use of traditional estimation method as ordinary least squares,
in the success in previous studies, in empiri
cal test and in the possibility to carrying out
statistical inference and hypothesis testing very easy.



The procedure employed in this work will be to select the linear functional which gets
the higher accuracy in terms of R
2

and, to the same time, al
l the model’s variables are
statistical significant. The selection variables will be carried out by the backward
method. Therefore, we specify a functional form characterized by being linear in
parameters (linear, semi
-
log, double
-
log and quadratic semi
-
lo
g) and consider all the
available independent variables. We estimate the model by ordinary least square and,
each time, we delete the less significant variable. We return to estimate the model
without the deleted variable and we continue with this process
until all the survivor
variable are statistical significant. We’ll employ a significant cut level of 5%.



4

b) Nonparametric Approach


In this subsection we briefly explain a generalization of the nearest neighbour method
denominated local linear regressi
on. The method is based on the idea that observations
with similar characteristics should have similar results. Suppose that we have a sample
of observations where we know the inputs and their respective outputs. We want to
predict the unknown output’s val
ue (
) from a new input vector (
). We calculate the
Euclidean distant between

and the other sample input vectors. In this way, we select
the K closest points and its respective outputs to

perform a linear regression. Once
estimate the parameters we can infer the predicted value
.



Of great interest is the choice of the nearest neighbours number (K). Consistency of
local linear regression demands that the number of

nearest neighbours considered goes
to infinity when the sample size is increased but a slower rate. In the literature we can
find some rules such as
, where T is the sample size and

(it’s usual
to assume
). We prefer to adopt an empirical perspective and to prove with
many values of K. We’ll select the value which achieves the highest out
-
of
-
sample
performance.










5

c) Semi
-
parametric Approach

c.1) Neural Networks


Neural Network
s are a class of semi
-
parametric models inspired by studies about how
the brain and nerve system work. They have been employed to solve a huge range of
economic problems such as financial time series forecasting and bankruptcy prediction,
between others. S
ome works have already utilized NN successfully to the specific case
of hedonic regression (Curry, Morgan and Silver (2001)). A good introduction to the
NN can be found in Smith (1995) and economic applications in Gately (1996) and
Deboeck (1994). NNs are
composed of interconnected elements, called neurons, linked
between them through weights and grouped in layers. The first layer is called the input
layer and the last is the output layer. The middle layers are denominated hidden layers.
Each neuron in the
input layer brings into the network the value of one independent
variable and propagate it towards the neurons of the next layer. In its turn, each neuron
of the next layer makes a weighted linear combination of each received input signal,
process this wei
ghted information through a transfer function and sends an output
signal. The signals from all neurons are propagated across the NN in the same way as
far as the final layer where the NN’s output is offered. The difference between the NN’s
output and the k
nown value of the dependent variable is calculated. The NN try to
minimize this error modifying the weights between links. This process will continue
iteratively to find the optimal weight’s values and it will finish when a determined error
level is achiev
ed or, if not, when it have iterated a determined number of times.


The construction of a good NN for a particular application is not a trivial task. To avoid
lackness of generalization, we must choice an appropriate architecture (for example,
number

of hidden layers, number of units in each layer, connections between units and

6

transfer functions). Usually, a common practice to build a NN is to select the
architecture by a process of “trying and error” searching the highest performance. In this
work,
we’re going to use the most easy and employed NN in economy: a feed
-
forward
back
-
propagation. In its statistical expression, this NN can be expressed as





where Y
i

is the dependent variable, X
i

the input vector, the parameters
and

are the
weighs to be adjusted, n is the number of inputs and q the number of hidden
units,
are the transfer functions and

a disturb term. It’s known and accepted

that a three layers feed
-
forward NN with a linear transfer function in the output unit

and a logistic transfer function in the hidden layer neurons

is able to
approximate any non
-
linear function to an arbitr
ary degree of accuracy (Qi(1999)).
We’ll employ this architecture and the number of neurons in the hidden layer will be
determined by “trying and error” searching the highest value of the out
-
of
-
sample R
2
.



One important question is how to select the NN

inputs. In other words, we have to
determine the independent variables which the NN will employ as input. Medeiros and
Teräsvirta (2001) suggest to carry out the variable selection by linearizing the model
and applying some linear variable selection metho
d. In our case, we’ll employ the
backward method to select the relevant variables.









7

c.2) Genetic Algorithms



Genetic Algorithms (GA) are a functional search procedure based on the Darwinian
theories of natural selection and survival. This procedu
re have been developed by
Holland (1975) and divulged by Goldberg (1989) and Koza (1992). In general, its
application to economic problems is very scarce and, for the present, it hasn’t been used
to the hedonic regression problem (at least, to the authors’

knowledge).



The GA present advantages respect to the neural networks and nearest neighbors
methods. First at all, this procedure permits to obtain explicitly a mathematic equation
that we can analyze. Moreover, in different with neural networks, GAs a
re more flexible
because they don’t require the specification of a previous and complex architecture.



We use a specific GA called DARWIN (Álvarez et al. (2001)). DARWIN carries on a
optimization process that finds an optimal functional form from a devel
oping initial
population of alternative equations. The algorithm simulates in a computer the process
of selection and survival observed in the Nature. Briefly, we can explain how DARWIN
works in the following way. First, a set of candidate equations (the i
nitial population)
for representing the relation between variables is randomly generated. These equations
are initially of the form

where the arguments A, B, C and D are
the explicatory variables or real
-
number constants (the coeffi
cients in the equations),
and the symbol

stands for one of the four basic arithmetic operators
.
Other mathematical operators are conceivable but increasing the number of available
operators complicates the fu
nctional optimization process. Each equation of the initial
population is evaluated and classified according to its R
2
. The equations with highest
values of

are selected to exchange parts of the character between them

8

(reproduction

and crossover) while the individuals less fitted to the data are discarded.
As a result of this crossover, offspring more complicated than the parents are generated.
The total number of characters in the equations is upper bounded to avoid the generation
of offspring with excessive length. Finally, a small percentage of the equations’ most
basic elements, single operators and variables, are mutated at random. The process is
repeated a large number of times to improve the fitness of the evolving population
. At
the end of the evolutionary process, DARWIN offers as result an equation that it
considers optimal to represent the true functional relation between variables.



III
-
. Data and Results




The sample used consist of 110 observations obtained through in
terviews that we carry
out to the estate agency in the city of Vigo (Spain), from March to May 1998. For each
house, we recollected information about his renting price and characteristics such as
structural attributes and neighbourhood conditions. The hous
ing characteristics are
presented in table 1. Our focus is to show that both nonparametric and semi
-
parametric
models can improve forecast accuracy respect the parametric models. The same data are
used for all techniques. The general characteristics of th
e estimation and forecasting
are as follows. The models were estimated from the first to the 85
th

observation (training
set). The remain observations were used to test the model and to obtain the out
-
of
-
sample predictions (validation set). As it was menti
oned before, the variables selection
technique employed was the backward method. Finally, the measure that we have used
to compare the forecasting accuracy of the models considered is the R
-
Square out
-
of
-
sample.




9

Table 1. Description of the Independent Va
riables

VARIABLES

DESCRIPTION


Rp


Renting Price in Pesetas. For the NN case this variable was
normalized and, for the GA, was divided by 1000.


M2



Square Meters


Cond



Dicotomic Variable that takes the value 1 if the house is

catalogued by the age
nt as excellent to occupy.


Ind


Index built as the add of 5 structural characteristics: existence
of lumber
-
room, grocery
-
store, central heating, elevator and if
the kitchen is furnished.


Actecon


Variable that collect the economic activity of the
street where
the house is located. It’s calculated as the ratio between the
湵浢敲映扵獩湥獳⁩渠瑨攠獴牥e琠慮搠瑨e畭扥 映桯f獥献†


Npg



Number of garage places


Ncb



Number of bathrooms



As we can see in the table 2, we show the results cl
assified in three divisions. The
first presents the parametric models: linear, semi
-
log, double
-
log and quadratic semi
-
log. This last method bears in mind the squared and cross
-
product effects between
variables and it was already employed successful by Ras
mussen and Zuehlke (1990).
The best parametric produces an out
-
of
-
sample R
2

of 0.7689 and is obtained by the
quadratic semi
-
log model. However, the sign showed by the variable
IndCond

is not
consistent with the a priori expectative. For the non
-
parametric

case, in the graphic 1
we can see the sensibility of the R
2
with respect the number of nearest neighbours. The
highest accuracy is achieved for K=30 obtaining a value R
2
=0.8575. We can also
employ the graphic to verify the existence of important non
-
linea
rity in the hedonic
regression. The accuracy gets worse when more neighbours are considered. Therefore,

10

the local regression achieves higher accuracy than when all the points are considered
(this case represents the parametric linear regression).


Graphic
1. Local Regression. K Nearest Neighbours Determination



In the semi
-
parametric methods, we start analysing the Neural Network model. The
number of hidden neurons finally selected was 3. As we can observe, NN permits to get
a best result than the p
arametric methods and a slight improvement respect the local
regression. In the other hand, GA presents an accuracy better than parametric models
but worse than non
-
parametric and NN. However, to the opposite of these methods, GA
permit to obtain an explic
it non
-
linear expression that represents the relation between
variables. In this way, it can be emphasized 2 important aspects. First, the expression is
conformed by a non
-
linear component (it affects the variables
M2

and
Cond
) and a
linear component (var
iables
Actecon

and
Ind
). Second, the variables effects on the
renting price are the expected a priori.


11

Table 2. Out
-
of
-
Sample Accuracy and Comparison between Models
















HEDONIC
REGRESSION



METHOD



MODEL

R
2


Out
-
Of
-
Sample








PARAMETRIC

METHODS




Linea
l
Regression






0.7232




Semi
-
log





0.7375


Double
-
Log




0.6996



quadratic
semi
-
log








0.7689


NON
PARAMETRIC

METHOD

Local
Regression


The number
of nearest neighbours considered was 30



0.8575



SEMI
-
PARAMETRIC

METHODS



Neuronal
Network



Feed
-
Forward Back
-
Propagation with 3 layers and 1 neuron in the hidden layer.



0.8621


Genetic
Algorithm





0.8220


12



In summary, we have been able prove how data
-
driven methods such as Neural
Network, Genetic Algorithm (semi
-
parametric methods) and nearest neighbour (non
-
parametric method) permit to capture substantial non
-
linearity that cannot be fully
captured by line
ar transformation models in terms of out
-
of
-
sample performance.



IV
-
. Conclusions

The empirical results that we presented in this paper demonstrate the usefulness of
applying nonparametric and semi
-
parametric data driven models in the estimation of
hedon
ic price functions. In all cases, the data
-
driven models outperform the parametric
models in terms of out
-
of
-
sample R
2
. Despite this improvement, one problem with
nearest neighbours and neural networks is the loss to interpret the results. They don’t
offe
r explicitly an equation where we can analyse the effects of each independent
variable on renting price. In other hand, GA permit to obtain an analytical expression
easy to interpret and with a high accuracy (less than the other data
-
driven techniques
but

best than the parametric methods). The problem is that we can’t carry out statistical
inference and hypothesis testing.










13

References


Álvarez A., A. Orfila y J. Tintore (2001) DARWIN
-

an Evolutionary Program for
Nonlinear Modeling of Chaotic Time
Series, Computer Physics Communications, in
press.


Curry B., Morgan P. and Silver M. (2001) “Hedonic regressions: miss
-
specification and
neural networks” Applied Economics, 33, 659
-
671.


Deboeck G. (1994) “Trading on the edge: Neural, Genetic and fuzzy sy
stems for chaotic
financial markets”, eds. Guido Deboeck, John Wiley & Sons.


Gately E. (1996) “Neural networks for financial forecasting”, eds. P. J. Kaufman, John
Wiley & Sons.


Goldberg D. E. (1989) “Genetic Algorithms in Search, Optimization and Mac
hine
Learning”. Reading, MA: Addison
-
Wesley.


Holland J. H. (1975) “Adaptation in Natural and Artificial Systems”, Ann Arbor. The
University of Michigan Press.


Koza (1992) Genetic programming: On the programming of computers by means of
natural seleccion
”, The MIT Press, Cambridge.



14

Medeiros M. and Teräsvirta T. (2001) “Statistical methods for modelling neural
networks” paper in process.


Qi M. (1999) “Nonlinear predictability of stock returns using financial and economic
variables”, Journal of Business
& Economic Statistics, 17, 4, 419
-
429.


Rasmussen and Zuehlke (1990) “On the choice of functional form for hedonic price
functions”, Applied Economics, 22, 431
-
438.


Smith M. (1995) “Neural networks for statistical modelling” Van Nostrand Reinhold,
New Yor
k.