Selecting examples for regression
Supervisor: João Mendes Moreira (
jmoreira@fe.up.pt
)
Co

supervisor: Carlos Soares (
csoares@fep.up.pt
)
Data mining (DM) is becoming an increasingly
important technology for businesses [Ghani
& Soares, 2006, Moreira et al.,
2005
]. One important task that is addressed with DM
techniques is prediction. The DM approach to prediction consists of using inductive learning
methods, which analyze available dat
a to generate a model. Then, this model is used for
making predictions concerning new examples. For instance, the examples could represent
bus trips from a public transportation company and the prediction is concerned with the
duration of the trip [Moreira
et al.,
2005
]. Different methods can be used for prediction tasks,
including neural networks, support vector machines and decision trees.
The data used to train the model is referred to as the training set. The success of a DM
approach to prediction depen
ds on how suitable the algorithm is for the training set and how
representative the training set is of the new examples. Different pre

processing tasks can be
used to address these issues [Reinartz, 2002; Blum & Langley, 2007], namely: example (or
instance
) selection [Liu & Motoda, 2001], feature selection [Guyon & Elisseeff, 2003], and
domain values definition. The goal of example selection is to identify the data from the
training set that are expected to yield the best possible model for a particular lea
rning
algorithm. There are some examples of successful approaches for example selection, such as
the one for the linear kernel of support vector machines [Moreira et al., 2006]. A possible
approach to example selection which has not been sufficiently explo
red is metalearning
[Brazdil et al., 2009, Crammer et al., 2008]. This approach consists learning about learning
algorithms (hence the prefix "meta"), i.e., using inductive learning methods on results of
previous prediction problems to choose the experimen
tal setup that is most suitable for a new
prediction problem. In this particular case, the experimental setup to be selected consists of
both the subset of data used for training and also the learning algorithm.
The goal of this project is to combine and e
xtend previous work on example selection
[Moreira et al., 2006] and metalearning [Brazdil et al., 2009] for the problem of selecting the
best training data for a prediction problem. The work will be empirically tested on several
datasets, including an appl
ication of bus tr
avel
time prediction from the public transportation
company of Porto (STCP).
[Blum & Langley, 1997] Blum, A. L., P. Langley (1997).
“
Selection of relevant features and
examples in machine learning.
”
Artificial Intelligence 97(1

2): 245

27
1.
[Brazdil et al., 2009]
Brazdil
, P.,
C. Giraud

Carrier, C. Soares, R. Vilalta (
2009).
“
Metalearning applications to Data Mining", Springer.
[Crammer et al., 2008] Crammer
, K.
, M
.
Kearns, J
.
Wortman (2008).
“
Learning from
Multiple Sources.
”
Journal of
Machine Learning Research
9:1757

1774.
[Ghani & S
oares, 2006]
Ghani
, R.
, C
.
Soares (2006).
"Data mining for business applications:
KDD

2006 workshop", ACM SIGKDD Explorations Newsletter
.
[Guyon & Elisseeff, 2003] Guyon, I., A. Elisseeff (2003). "An introdu
ction to variable and
feature selection." Journal of Machine Learning Research 3: 1157

1182.
[Liu & Motoda, 2001] Liu, H., H. Motoda, Eds (2001).
“
Instance selection and construction
for data mining,
”
Kluwer Academic Publishers.
[Moreira et al., 2005]
More
ira, J
.
M.
,
A. M.
Jorge, J. F. Sousa, C. Soares (2005), “A Data
Mining approach for trip time prediction in mass transit companies
.”
,
Workshop on Data
Mining for Business at ECML/PKDD 2005
,
Porto

Portugal, 63

66
.
[Moreira et al., 2006] Moreira, J. M., A
. M. Jorge, C. Soares, J. F. Sousa (2006).
“
Improving
SVM

linear predictions using CART for example selection.
”
International Symposium on
Methodologies for Intelligent Systems, Springer, LNAI 4203: 632

641.
[Reinartz, 2002] Reinartz, T. (2002).
“
A unifyin
g view on instance selection.
”
Data Mining
and Kno
wledge Discovery 6(2): 191

210.
Development of regression algorithms for censored

data
Supervisor: João Mendes Moreira (
jmoreira@fe.up.pt
)
Prediction methods are an
important technology for businesses.
Regression refers to the
prediction of numeric variables while classification refers to the prediction of categorical
variables
.
In certain areas of business
that use
regression methods, the nu
meric variable is
bounded.
An example is s
urvival analysis
,
a branch of statistics which deals with death in
biological organisms and failure in mechanical systems
. Another example is
the analysis of
performance
(performance is, in this case, a real value
bounded between 0 and 1)
using
exogenous variables in
DEA

D
ata
E
nvelopment
A
nalysis
(
a state of the art benchmark
method)
.
Many other problems exist where data is left

censored, right

censored or
interval

censored.
A common
approach to solve this kind of
problems is Tobit regression, a
parametric statistical method. However, the assumptions of this model
(
homogeneous
variance
and
independency of the errors
)
limit the range of problems where it can be applied.
In the last
few
years new
inductive learning
algorithms
have been developed
for the
regression problem
.
S
upport vector
regression
[Smola & Schol
kopf, 2004]
,
random forests
[Breiman, 2001]
and random decision tress
[Fan et al., 2006]
are examples of such methods.
Inductive learning does not any assum
ption about the data.
However, they t
ypically assume
non

censored data
.
For th
is
reason, there is current research on inductive learning algorithms
in order to adapt
existing
approaches for the resolution of
problems with output
censored data.
This is the
case of
random survival forests
[Ishwaran et al., 2008]
for right

censored survival
data
.
This proposal is
on inductive learning approaches for
interval

censored
data.
The work will
be empirically tested on several datasets, including an application
on the evaluation of bus
line performance
from the public transportation company of Porto (STCP).
[Breiman, 2001] Breiman, L. (2001).
“
Random forests.
”
Machine Learning 45: 5

32.
[Fan et al., 2006] Fan, W., J. McCloskey, et al. (2006).
“
A general framewor
k for accurate
and fast regression by data summarization in random decision trees.
”
The 12th ACM
SIGKDD international conference on Knowledge
D
iscovery and
D
ata
M
ining.
[
Ishwaran et al., 2008]
Ishwaran, H. (2008). “Random survival forests.” The Annals of
A
pplied Statistics 2(3): 841

860.
[Smola & Scholkopf, 2004] Smola, A. J. and B. Scholkopf (2004).
“
A tutorial on support
vector regression.
”
Statistics and Computing 14: 199

222.
Towards an off

the

shelf method for heterogeneous
ensembles
Supervisor: João
Mendes Moreira (
jmoreira@fe.up.pt
)
In the early nineties, the use of multiple models
(also named ensembles)
to accom
plish the
prediction task gained relevance due to wo
r
ks using homogeneous ensemble
s
(
i.e.,
ensembl
e
s
using the same induction algorithm [Hansen & Salamon, 1990]
) and
heterogeneous ensembles (ensembles using diverse induction algorithms [Perrone & Cooper,
1993]). Despite the use of ensembles was not new at that time, it was since then that
ensemble lear
ning became a major research line for
different research
communities
, such as
the ones
on neural networks, machine learning, artificial intelligence, pattern recognition,
computational statistics, among others.
Bagging
[Breiman, 1996]
, boosting
[Freund &
S
chapire, 1996]
, random forests
[Breiman, 2001]
, random decision trees
[Fan et al., 2006]
are some of the ensemble methods for prediction that obtained good results. The first three
are nowadays important benchmarks
of
prediction methods.
All the
methods previously referred
use homogeneous ensembles. However, some studies
show that heterogeneous ensembles can enhance results of homogeneous ones [
Wichard et
al., 2003;
Moreira, 2008].
The main difficulty of this approach is that
the best tuning and
parameter set is problem dependent,
reduc
ing
meaningfully its use as an off

the

shelf method,
i.e., its use by non

experts.
However, this is a promising area of research that is not enough
explored, yet.
The goal of this project is
to develop off

the

shel
f heterogeneous ensembles, enhancing its
use by a broader community
. The work will be empirically tested on several datasets,
including an application of bus tr
avel
time prediction from the public transportation company
of Porto (STCP).
[
Breiman, 1996]
Bre
iman, L. (1996).
“
Bagging predictors.
”
Machine Learning 26: 123

140.
[Breiman, 2001] Breiman, L. (2001). “Random forests.” Machine Learning 45: 5

32.
[Fan et al., 2006] Fan, W., J. McCloskey, et al. (2006). “A general framework for accurate
and fast regre
ssion by data summarization in random decision trees.” The 12th ACM
SIGKDD international conference on Knowledge Discovery and Data Mining.
[Freund & Schapire, 1996]
Freund, Y. and R. Schapire (1996). Experiments with a new
boosting algorithm. Internationa
l Conference on Machine Learning
, 148

156
.
[Hansen & Salamon, 1990]
Hansen, L. K. and P. Salamon (1990).
“
Neural networks
ensembles.
”
IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10):
993

1001.
[Moreira, 2008]
Moreira
, J. M. (2008). “Travel time prediction for the planning of mass
transit companies: a machine learning approach.” Ph.D. thesis, Faculty of Engine
ering,
University of Porto
.
[Perrone & Cooper, 1993]
Perrone, M. P. and L. N. Cooper (1993).
“
When networks dis
agree:
ensemble methods for hybrid neural networks.
”
Neural Networks for Speech and Image
Processing. R. J. Mammone
(Eds)
, Chapman

Hall.
Wichard, J., C. Merkwirth, et al. (2003). Building ensembles with heterogeneous models.
Course of the International Sch
ool on Neural Nets, Salerno

Italy.
A decision su
pport
system
for timetable adjust
ments
Supervisor: João Mendes Moreira (
jmoreira@fe.up.pt
)
Co

supervisor: Jorge Freire de Sousa (
jfsousa@fe.up.pt
)
In the last years, public transportation companies made important investments in order to
have data about the service in real time. One of those investments was in the implementation
of Automatic Vehicle Location (using GPS) in order t
o know where each bus is at each
moment. With this kind of system, companies are able to store both actual and planning data.
With the actual data is now possible to improve the operational planning. An important
operational planning task at this kind of c
ompanies is the definition of timetables. Until
recently, timetables were defined using only the average travel times.
With the existence of actual data, it is possible to take into account the variability of travel
times or the level of vehicles occupanc
y, for instance. This problem is usually seen as a single
objective optimization problem [Carey, 1998; Zhao et al., 2006], namely, the minimization
of the passengers' waiting time. In [Moreira, 2008] it is shown that two objectives must be
considered: the
maximization of passengers' satisfaction (not necessarily the same as the
minimization of passengers' waiting time, at least for some lines) and the minimization of
operational costs. In fact, for the schedulers, rather than a method that solves the partia
l
problem in a deterministic way (as in [Zhao et al., 2006]), they need a tool to give them
insights into the best solution, at least while there are no answers to questions such as "how
does passengers' waiting time compare with the operational cost of an
additional bus?", or
"what is the impact of reducing slack times on operational costs?". The multi

objective
nature of the problem justifies its study and the use of decision support systems.
The expected areas of study are on: (1) Evaluation measures
for the degree of achievement of
the two objectives of the problem dependent of the bus line type [Strathman et al., 1998]; (2)
Analytical solutions for the partial problem of minimization of passengers' waiting time for
the different bus line types (the o
ne presented in [Zhao et al., 2006] is a good startup for the
case of lines with high frequency); (3) Detection of systematic delays using data mining
approaches (the one presented in [Duarte, 2008] is a good startup); (4) Design and
development of decisio
n support systems for timetable adjustments.
[Carey, 1998] Carey
, M. (1998)
.
“
Optimizing s
cheduled times, allowing for be
havioural
response.
”
Transportation Research Part B, 32(5):329

342.
[Duarte, 2008] Duarte, E. (2008), “
Técnicas de Mineração de Dados
para suporte à decisão
no Planeamento de Horários em Empresas de Transportes Públicos
.” M.Sc. thesis,
University of Minho.
[Moreira, 2008] Moreira, J. M. (2008). “Travel time prediction for the planning of mass
transit companies: a machine learning approac
h.” Ph.D. thesis, University of Porto.
[Strathman et al., 1998] Strathman,
J. G.,
K
.
J. Dueker, T
.
Kimpel, R
.
Gerhart, K
.
Turner, P
.
Taylor, S
.
Callas, D
.
Grif
fi
n, J
.
Hopper
(1998)
.
“
Automated bus dispatching, operations
control and
service reliability:
analysis of tri

met baseline service date.
”
Technical report,
University of Washington

U.S.A.
.
[Zhao et al., 2006] Zhao, J., M. Dessouky, S. Bukkapatnam (2006).
“Optimal slack time for
schedule

based transit operations.
”
Transportation Science, 40(4):52
9

539.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο