1
Using Data

Driven Prediction Methods in a Hedonic
Regression Problem
MARCOS ÁLVAREZ

DÍAZ
1
, MANUEL GONZÁLEZ GÓMEZ
Department of Applied Economy, University of Vigo
Lagoas

Marcosende Vigo, Spain
And
ALBERTO ÁLVAREZ
ISME

DSEA Department of Electrical Engin
eering, University of Pisa
Via Diotisalve 2,56100 Pisa, Italy
(VERSIÓN PRELIMINAR)
Abstract
The traditional studies about hedonic prices apply simple functional forms such as
linear or linearity transformable structures. Nowadays, it’s known in the liter
ature the
importance of introducing non

linearity to improve the models’
explanatory capacity.
In
this work we apply data

driven methods to carry out the hedonic regression. These
methods don’t impose any a priori assumption about the functional form. We u
se the
nearest neigbors technique as non

parametric method and neural networks and genetic
algorithms both as semi

parametric methods. Neural Networks have already been
employed to the specific hedonic regression problem but, to the authors’ knowledge,
th
is is the first time that a genetic algorithm is employed.
The empirical results that we
have obtained demonstrate the usefulness of applying both nonparametric and semi

parametric data driven models in the estimation of hedonic price functions. They can
i
mprove the traditional parametric models in terms of out

of

sample R
2
.
1
Corresponding author: Departamento de Economía Aplicada, Universidad de Vigo Lagoas

Marcosende s/n,
36200 Vigo, Spain.
Fax : 986812401; e

mail
mad@uvigo.es
.
2
I

. Introductions
The hedonic prices theory try to determine how the individual characteristics of a
commodity affect on its price. The hedonic perspective have been applied to many
goo
ds such as automobiles, personal computers, televisions, irrigated lands and,
overcoat, housings. A linear relation between the commodity’s price and its
characteristics is the most widely approach in this sort of studies. The principal reasons
argued is t
hat they are easy to estimate and interpret. Nevertheless, it is also recognized
in the literature the importance of non

linearities in the hedonic price function in terms
of increasing its explanatory capacity (Rasmussen and Zuehlke (1990)). Many works
so
lve this issue employing techniques that, by means of transformations, allow flexible
parametric functional forms (for example, Box

Cox transformations). However, these
flexible forms impose a structure introducing forced and unnecessary non

linearity. The
results are “over

parameterized” models originating a loss of out

of

sample
performance. To avoid this problem, we can use some data

driven methods that permit
obtain a model without imposing any a priori assumption about the functional form
(nearest neig
hbours, neural networks and genetic algorithms).
The objective of this paper is to determine if employing both nonparametric and semi

parametric data

driven models we can improve forecast accuracy respect the parametric
models. Our empirical study is
centered in the real state market in the city of Vigo
(Spain). To carry out our research, we recollected information about renting prices and
houses’ information such as their structural attributes and neighborhood conditions. We
structure the work as fol
low. In section 2, we briefly present the different prediction
methods that we have employed (linear regressions as parametric method, nearest
neighbors as non

parametric method and neural networks and genetic algorithms as
3
semi

parametric methods).
Later
, in the next section, we define the data and show our
empirical results on out

of

sample predictions. Finally, we finish with a section
dedicated to conclusions.
II

. Methodology
a) Parametric Techniques
The simplest approach to the hedonic regression p
roblem is to postulate that the
functional form is simply linear. To achieve more flexibility, it’s very usual to do some
non

linear transformation with respect the data. The greater part of studies about
hedonic regression employ as norm a simple function
al relations such as linear, semi

logarithmic, double

log or quadratic semi

log, between others. The justifications for
such forms is based in the use of traditional estimation method as ordinary least squares,
in the success in previous studies, in empiri
cal test and in the possibility to carrying out
statistical inference and hypothesis testing very easy.
The procedure employed in this work will be to select the linear functional which gets
the higher accuracy in terms of R
2
and, to the same time, al
l the model’s variables are
statistical significant. The selection variables will be carried out by the backward
method. Therefore, we specify a functional form characterized by being linear in
parameters (linear, semi

log, double

log and quadratic semi

lo
g) and consider all the
available independent variables. We estimate the model by ordinary least square and,
each time, we delete the less significant variable. We return to estimate the model
without the deleted variable and we continue with this process
until all the survivor
variable are statistical significant. We’ll employ a significant cut level of 5%.
4
b) Nonparametric Approach
In this subsection we briefly explain a generalization of the nearest neighbour method
denominated local linear regressi
on. The method is based on the idea that observations
with similar characteristics should have similar results. Suppose that we have a sample
of observations where we know the inputs and their respective outputs. We want to
predict the unknown output’s val
ue (
) from a new input vector (
). We calculate the
Euclidean distant between
and the other sample input vectors. In this way, we select
the K closest points and its respective outputs to
perform a linear regression. Once
estimate the parameters we can infer the predicted value
.
Of great interest is the choice of the nearest neighbours number (K). Consistency of
local linear regression demands that the number of
nearest neighbours considered goes
to infinity when the sample size is increased but a slower rate. In the literature we can
find some rules such as
, where T is the sample size and
(it’s usual
to assume
). We prefer to adopt an empirical perspective and to prove with
many values of K. We’ll select the value which achieves the highest out

of

sample
performance.
5
c) Semi

parametric Approach
c.1) Neural Networks
Neural Network
s are a class of semi

parametric models inspired by studies about how
the brain and nerve system work. They have been employed to solve a huge range of
economic problems such as financial time series forecasting and bankruptcy prediction,
between others. S
ome works have already utilized NN successfully to the specific case
of hedonic regression (Curry, Morgan and Silver (2001)). A good introduction to the
NN can be found in Smith (1995) and economic applications in Gately (1996) and
Deboeck (1994). NNs are
composed of interconnected elements, called neurons, linked
between them through weights and grouped in layers. The first layer is called the input
layer and the last is the output layer. The middle layers are denominated hidden layers.
Each neuron in the
input layer brings into the network the value of one independent
variable and propagate it towards the neurons of the next layer. In its turn, each neuron
of the next layer makes a weighted linear combination of each received input signal,
process this wei
ghted information through a transfer function and sends an output
signal. The signals from all neurons are propagated across the NN in the same way as
far as the final layer where the NN’s output is offered. The difference between the NN’s
output and the k
nown value of the dependent variable is calculated. The NN try to
minimize this error modifying the weights between links. This process will continue
iteratively to find the optimal weight’s values and it will finish when a determined error
level is achiev
ed or, if not, when it have iterated a determined number of times.
The construction of a good NN for a particular application is not a trivial task. To avoid
lackness of generalization, we must choice an appropriate architecture (for example,
number
of hidden layers, number of units in each layer, connections between units and
6
transfer functions). Usually, a common practice to build a NN is to select the
architecture by a process of “trying and error” searching the highest performance. In this
work,
we’re going to use the most easy and employed NN in economy: a feed

forward
back

propagation. In its statistical expression, this NN can be expressed as
where Y
i
is the dependent variable, X
i
the input vector, the parameters
and
are the
weighs to be adjusted, n is the number of inputs and q the number of hidden
units,
are the transfer functions and
a disturb term. It’s known and accepted
that a three layers feed

forward NN with a linear transfer function in the output unit
and a logistic transfer function in the hidden layer neurons
is able to
approximate any non

linear function to an arbitr
ary degree of accuracy (Qi(1999)).
We’ll employ this architecture and the number of neurons in the hidden layer will be
determined by “trying and error” searching the highest value of the out

of

sample R
2
.
One important question is how to select the NN
inputs. In other words, we have to
determine the independent variables which the NN will employ as input. Medeiros and
Teräsvirta (2001) suggest to carry out the variable selection by linearizing the model
and applying some linear variable selection metho
d. In our case, we’ll employ the
backward method to select the relevant variables.
7
c.2) Genetic Algorithms
Genetic Algorithms (GA) are a functional search procedure based on the Darwinian
theories of natural selection and survival. This procedu
re have been developed by
Holland (1975) and divulged by Goldberg (1989) and Koza (1992). In general, its
application to economic problems is very scarce and, for the present, it hasn’t been used
to the hedonic regression problem (at least, to the authors’
knowledge).
The GA present advantages respect to the neural networks and nearest neighbors
methods. First at all, this procedure permits to obtain explicitly a mathematic equation
that we can analyze. Moreover, in different with neural networks, GAs a
re more flexible
because they don’t require the specification of a previous and complex architecture.
We use a specific GA called DARWIN (Álvarez et al. (2001)). DARWIN carries on a
optimization process that finds an optimal functional form from a devel
oping initial
population of alternative equations. The algorithm simulates in a computer the process
of selection and survival observed in the Nature. Briefly, we can explain how DARWIN
works in the following way. First, a set of candidate equations (the i
nitial population)
for representing the relation between variables is randomly generated. These equations
are initially of the form
where the arguments A, B, C and D are
the explicatory variables or real

number constants (the coeffi
cients in the equations),
and the symbol
stands for one of the four basic arithmetic operators
.
Other mathematical operators are conceivable but increasing the number of available
operators complicates the fu
nctional optimization process. Each equation of the initial
population is evaluated and classified according to its R
2
. The equations with highest
values of
are selected to exchange parts of the character between them
8
(reproduction
and crossover) while the individuals less fitted to the data are discarded.
As a result of this crossover, offspring more complicated than the parents are generated.
The total number of characters in the equations is upper bounded to avoid the generation
of offspring with excessive length. Finally, a small percentage of the equations’ most
basic elements, single operators and variables, are mutated at random. The process is
repeated a large number of times to improve the fitness of the evolving population
. At
the end of the evolutionary process, DARWIN offers as result an equation that it
considers optimal to represent the true functional relation between variables.
III

. Data and Results
The sample used consist of 110 observations obtained through in
terviews that we carry
out to the estate agency in the city of Vigo (Spain), from March to May 1998. For each
house, we recollected information about his renting price and characteristics such as
structural attributes and neighbourhood conditions. The hous
ing characteristics are
presented in table 1. Our focus is to show that both nonparametric and semi

parametric
models can improve forecast accuracy respect the parametric models. The same data are
used for all techniques. The general characteristics of th
e estimation and forecasting
are as follows. The models were estimated from the first to the 85
th
observation (training
set). The remain observations were used to test the model and to obtain the out

of

sample predictions (validation set). As it was menti
oned before, the variables selection
technique employed was the backward method. Finally, the measure that we have used
to compare the forecasting accuracy of the models considered is the R

Square out

of

sample.
9
Table 1. Description of the Independent Va
riables
VARIABLES
DESCRIPTION
Rp
Renting Price in Pesetas. For the NN case this variable was
normalized and, for the GA, was divided by 1000.
M2
Square Meters
Cond
Dicotomic Variable that takes the value 1 if the house is
catalogued by the age
nt as excellent to occupy.
Ind
Index built as the add of 5 structural characteristics: existence
of lumber

room, grocery

store, central heating, elevator and if
the kitchen is furnished.
Actecon
Variable that collect the economic activity of the
street where
the house is located. It’s calculated as the ratio between the
湵浢敲映扵獩湥獳渠瑨攠獴牥e琠慮搠瑨e畭扥 映桯f獥献†
Npg
Number of garage places
Ncb
Number of bathrooms
As we can see in the table 2, we show the results cl
assified in three divisions. The
first presents the parametric models: linear, semi

log, double

log and quadratic semi

log. This last method bears in mind the squared and cross

product effects between
variables and it was already employed successful by Ras
mussen and Zuehlke (1990).
The best parametric produces an out

of

sample R
2
of 0.7689 and is obtained by the
quadratic semi

log model. However, the sign showed by the variable
IndCond
is not
consistent with the a priori expectative. For the non

parametric
case, in the graphic 1
we can see the sensibility of the R
2
with respect the number of nearest neighbours. The
highest accuracy is achieved for K=30 obtaining a value R
2
=0.8575. We can also
employ the graphic to verify the existence of important non

linea
rity in the hedonic
regression. The accuracy gets worse when more neighbours are considered. Therefore,
10
the local regression achieves higher accuracy than when all the points are considered
(this case represents the parametric linear regression).
Graphic
1. Local Regression. K Nearest Neighbours Determination
In the semi

parametric methods, we start analysing the Neural Network model. The
number of hidden neurons finally selected was 3. As we can observe, NN permits to get
a best result than the p
arametric methods and a slight improvement respect the local
regression. In the other hand, GA presents an accuracy better than parametric models
but worse than non

parametric and NN. However, to the opposite of these methods, GA
permit to obtain an explic
it non

linear expression that represents the relation between
variables. In this way, it can be emphasized 2 important aspects. First, the expression is
conformed by a non

linear component (it affects the variables
M2
and
Cond
) and a
linear component (var
iables
Actecon
and
Ind
). Second, the variables effects on the
renting price are the expected a priori.
11
Table 2. Out

of

Sample Accuracy and Comparison between Models
HEDONIC
REGRESSION
METHOD
MODEL
R
2
Out

Of

Sample
PARAMETRIC
METHODS
Linea
l
Regression
0.7232
Semi

log
0.7375
Double

Log
0.6996
quadratic
semi

log
0.7689
NON
PARAMETRIC
METHOD
Local
Regression
The number
of nearest neighbours considered was 30
0.8575
SEMI

PARAMETRIC
METHODS
Neuronal
Network
Feed

Forward Back

Propagation with 3 layers and 1 neuron in the hidden layer.
0.8621
Genetic
Algorithm
0.8220
12
In summary, we have been able prove how data

driven methods such as Neural
Network, Genetic Algorithm (semi

parametric methods) and nearest neighbour (non

parametric method) permit to capture substantial non

linearity that cannot be fully
captured by line
ar transformation models in terms of out

of

sample performance.
IV

. Conclusions
The empirical results that we presented in this paper demonstrate the usefulness of
applying nonparametric and semi

parametric data driven models in the estimation of
hedon
ic price functions. In all cases, the data

driven models outperform the parametric
models in terms of out

of

sample R
2
. Despite this improvement, one problem with
nearest neighbours and neural networks is the loss to interpret the results. They don’t
offe
r explicitly an equation where we can analyse the effects of each independent
variable on renting price. In other hand, GA permit to obtain an analytical expression
easy to interpret and with a high accuracy (less than the other data

driven techniques
but
best than the parametric methods). The problem is that we can’t carry out statistical
inference and hypothesis testing.
13
References
Álvarez A., A. Orfila y J. Tintore (2001) DARWIN

an Evolutionary Program for
Nonlinear Modeling of Chaotic Time
Series, Computer Physics Communications, in
press.
Curry B., Morgan P. and Silver M. (2001) “Hedonic regressions: miss

specification and
neural networks” Applied Economics, 33, 659

671.
Deboeck G. (1994) “Trading on the edge: Neural, Genetic and fuzzy sy
stems for chaotic
financial markets”, eds. Guido Deboeck, John Wiley & Sons.
Gately E. (1996) “Neural networks for financial forecasting”, eds. P. J. Kaufman, John
Wiley & Sons.
Goldberg D. E. (1989) “Genetic Algorithms in Search, Optimization and Mac
hine
Learning”. Reading, MA: Addison

Wesley.
Holland J. H. (1975) “Adaptation in Natural and Artificial Systems”, Ann Arbor. The
University of Michigan Press.
Koza (1992) Genetic programming: On the programming of computers by means of
natural seleccion
”, The MIT Press, Cambridge.
14
Medeiros M. and Teräsvirta T. (2001) “Statistical methods for modelling neural
networks” paper in process.
Qi M. (1999) “Nonlinear predictability of stock returns using financial and economic
variables”, Journal of Business
& Economic Statistics, 17, 4, 419

429.
Rasmussen and Zuehlke (1990) “On the choice of functional form for hedonic price
functions”, Applied Economics, 22, 431

438.
Smith M. (1995) “Neural networks for statistical modelling” Van Nostrand Reinhold,
New Yor
k.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο