Algorithms for Learning

yalechurlishΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

344 εμφανίσεις

Algorithms for Learning
Regression Trees and Ensembles
on Evolving Data Streams
Elena Ikonomovska
Doctoral Dissertation
Jozef Stefan International Postgraduate School
Ljubljana,Slovenia,October 2012
Evaluation Board:
Asst.Prof.Dr.Bernard

Zenko,Chairman,Jozef Stefan Institute,Ljubljana,Slovenia
Asst.Prof.Dr.Zoran Bosnic,Member,Faculty of Computer and Information Science,
University of Ljubljana,Slovenia
Dr.Albert Bifet,Member,Yahoo Research,Barcelona,Spain
Elena Ikonomovska
ALGORITHMS FOR LEARNING
REGRESSION TREES AND ENSEMBLES
ON EVOLVING DATA STREAMS
Doctoral Dissertation
ALGORITMI ZAU

CENJE REGRESIJSKIH
DREVES IN ANSAMBLOV IZ
SPREMENLJIVIHPODATKOVNIHTOKOV
Doktorska disertacija
Supervisor:Prof.Dr.Saso Dzeroski
Co-Supervisor:Prof.Dr.Jo~ao Gama
Ljubljana,Slovenia,October 2012
To my mother Slavica
V
Contents
Abstract IX
Povzetek XI
Abbreviations XIII
1 Introduction 1
1.1 Context.......................................1
1.2 Goals........................................4
1.3 Methodology....................................4
1.4 Contributions....................................5
1.5 Organization of the Thesis.............................5
2 Learning from Data Streams 7
2.1 Overview......................................7
2.2 Supervised Learning and the Regression Task..................8
2.2.1 The Task of Regression..........................8
2.2.2 Learning as Search.............................9
2.3 Learning under a Sampling Strategy.......................10
2.3.1 Probably Approximately Correct Learning................10
2.3.2 Sequential Inductive Learning.......................12
2.4 The Online Learning Protocol...........................14
2.4.1 The Perceptron and the Winnow Algorithms..............15
2.4.2 Predicting from Experts Advice......................16
2.5 Learning under Non-stationary Distributions..................17
2.5.1 Tracking the Best Expert.........................18
2.5.2 Tracking Dierences over Sliding Windows...............18
2.5.3 Monitoring the Learning Process.....................19
2.6 Methods for Adaptation..............................21
3 Decision Trees,Regression Trees and Variants 23
3.1 The Tree Induction Task..............................23
3.2 The History of Decision Tree Learning......................25
3.2.1 Using Statistical Tests...........................26
3.2.2 Improving Computational Complexity:Incremental Learning.....27
3.3 Issues in Learning Decision and Regression Trees................28
3.3.1 Stopping Decisions.............................28
3.3.2 Selection Decisions.............................29
3.4 Model Trees.....................................30
3.5 Decision and Regression Trees with Options...................32
3.6 Multi Target Decision and Regression Trees...................34
VI CONTENTS
3.6.1 Covariance-Aware Methods........................34
3.6.2 Covariance-Agnostic Methods.......................35
4 Ensembles of Decision and Regression Trees 39
4.1 The Intuition Behind Learning Ensembles....................39
4.2 Bias,Variance and Covariance..........................40
4.3 Methods for Generating Ensembles........................42
4.3.1 Diversifying the Set of Accessible Hypotheses..............44
4.3.1.1 Diversication of the Training Data..............44
4.3.1.2 Diversication of the Input Space...............45
4.3.1.3 Diversication in the Output Space..............45
4.3.2 Diversication of the Traversal Strategy.................46
4.4 Ensembles of Classiers for Concept Drift Detection..............47
5 Experimental Evaluation of Online Learning Algorithms 49
5.1 Criteria for Online Evaluation...........................49
5.2 Evaluation Metrics.................................50
5.2.1 Error Metrics................................50
5.2.2 Metrics for Model's Complexity......................52
5.2.3 Metrics for Change Detection.......................52
5.3 Evaluation Approaches...............................53
5.3.1 Holdout Evaluation............................53
5.3.2 Prequential Evaluation...........................53
5.4 Online Bias-Variance Analysis...........................54
5.5 Comparative Assessment..............................55
5.6 Datasets.......................................57
5.6.1 Articial Datasets.............................57
5.6.1.1 Concept Drift..........................58
5.6.1.2 Multiple Targets.........................59
5.6.2 Real-World Datasets............................61
5.6.2.1 Protein 3D Structure Prediction................61
5.6.2.2 City Trac Congestion Prediction...............62
5.6.2.3 Flight Arrival Delay Prediction.................62
5.6.2.4 Datasets with Multiple Targets.................62
6 Learning Model Trees from Time-Changing Data Streams 65
6.1 Online Sequential Hypothesis Testing for Learning Model Trees........65
6.2 Probabilistic Sampling Strategies in Machine Learning.............67
6.3 Hoeding-based Regression and Model Trees..................70
6.4 Processing of Numerical Attributes........................72
6.5 Incremental Linear Model Trees..........................76
6.6 Drift Detection Methods in FIMT-DD......................79
6.6.1 The Page-Hinkley Test...........................79
6.6.2 An improved Page-Hinkley test......................80
6.7 Strategies for Adaptation.............................81
6.8 Empirical Evaluation of Online and Batch Learning of Regression and Model
Trees Induction Algorithms............................82
6.8.1 Predictive Accuracy and Quality of Models...............83
6.8.2 Memory and Time Requirements.....................85
6.8.3 Bias-Variance Analysis...........................87
6.8.4 Sensitivity Analysis............................87
6.9 Empirical Evaluation of Learning under Concept Drift.............89
6.9.1 Change Detection.............................89
CONTENTS VII
6.9.2 Adaptation to Change...........................90
6.9.3 Results on Real-World data........................94
6.10 Summary......................................95
7 Online Option Trees for Regression 97
7.1 Capping Options for Hoeding Trees.......................97
7.2 Options for Speeding-up Hoeding-based Regression
Trees.........................................98
7.2.1 Ambiguity-based Splitting Criterion...................99
7.2.2 Limiting the Number of Options.....................101
7.3 Methods for Aggregating Multiple Predictions..................102
7.4 Experimental Evaluation of Online Option Trees for Regression........103
7.4.1 Predictive Accuracy and Quality of Models...............103
7.4.2 Bias-Variance Analysis...........................105
7.4.3 Analysis of Memory and Time Requirements..............107
7.5 Summary......................................111
8 Ensembles of Regression Trees for Any-Time Prediction 113
8.1 Methods for Online Sampling...........................113
8.1.1 Online Bagging...............................114
8.1.1.1 Online Bagging for Concept Drift Management........114
8.1.1.2 Online Bagging for RandomForest...............115
8.1.2 Online Boosting..............................116
8.2 Stacked Generalization with Restricted Hoeding Trees............117
8.3 Online RandomForest for Any-time Regression.................118
8.4 Experimental Evaluation of Ensembles of Regression
Trees for Any-Time Prediction..........................120
8.4.1 Predictive Accuracy and Quality of Models...............121
8.4.2 Analysis of Memory and Time Requirements..............125
8.4.3 Bias-Variance Analysis...........................126
8.4.4 Sensitivity Analysis............................127
8.4.5 Responsiveness to Concept Drift.....................128
8.5 The Diversity Dilemma..............................129
8.6 Summary......................................132
9 Online Predictive Clustering Trees for Multi-Target Regression 135
9.1 Online Multi-Target Classication........................135
9.2 Online Learning of Multi-Target Model Trees..................137
9.2.1 Extensions to the Algorithm FIMT-DD.................138
9.2.2 Split Selection Criterion..........................138
9.2.3 Linear Models for Multi-Target Attributes................143
9.3 Experimental Evaluation..............................144
9.4 Further Extensions.................................147
9.5 Summary......................................148
10 Conclusions 149
10.1 Original Contributions...............................150
10.2 Further Work....................................153
11 Acknowledgements 155
12 References 157
VIII CONTENTS
Index of Figures 173
Index of Tables 177
List of Algorithms 181
Appendices
A Additional Experimental Results 184
A.1 Additional Results for Section 6.8.........................184
A.2 Additional Learning Curves for Section 7.4...................188
A.3 Additional Error Bars for Section 8.4.......................191
A.4 Additional Learning Curves for Section 8.4...................194
A.5 Additional Results for Section 8.4.........................198
A.6 Additional Learning Curves for Section 8.4...................200
B Bibliography 205
B.1 Publications Related to the Thesis........................205
B.1.1 Original Scientic Articles.........................205
B.1.2 Published Scientic Conference Contributions..............205
B.2 Publications not Related to the Thesis......................206
B.2.1 Published Scientic Conference Contributions..............206
B.3 Articles Pending for Publication Related to the Thesis.............206
C Biography 207
IX
Abstract
In this thesis we address the problem of learning various types of decision trees from time-
changing data streams.In particular,we study online machine learning algorithms for
learning regression trees,linear model trees,option trees for regression,multi-target model
trees,and ensembles of model trees from data streams.These are the most representative
and widely used models in the category of interpretable predictive models.
A data stream is an inherently unbounded sequence of data elements (numbers,coordi-
nates,multi-dimensional points,tuples,or objects of an arbitrary type).It is characterized
with high inbound rates and non-stationary data distributions.Real-world scenarios where
processing data streams is a necessity come from various management systems deployed
on top of sensor networks,that monitor the performance of smart power grids,city trac
congestion,or scientic studies of environmental changes.
Due to the fact that this type of data cannot be easily stored or transported to a cen-
tral database without overwhelming the communication infrastructure,data processing and
analysis has to be done in-situ and in real-time,using constant amount of memory.To
enable in-situ real-time learning it is crucial to perform an incremental computation of un-
biased estimates for various types of statistical measures.This requires methods that would
enable us to collect an appropriate sample from the incoming data stream and compute the
necessary statistics and estimates for the evaluation functions on-the- y.
We approached the problem of obtaining unbiased estimates on-the- y by treating the
evaluation functions as random variables.This enabled the application of existing probabil-
ity bounds,among which the best results were achieved when using the Hoeding bound.
The algorithms proposed in this thesis therefore use the Hoeding probability bound for
bounding the probability of error when approximating the sample mean of a sequence of
random variables.This approach gives us the statistical machinery for scaling up various
machine learning tasks.
With our research we address three main sub-problems as part of the problemof learning
tree-based model fromtime-changing data streams.The rst one is concerned with the non-
stationarity of concepts and the need for an informed adaptation of the decision tree.We
propose online change detection mechanisms integrated within the incrementally learned
model.The second subproblem is related to the myopia of decision tree learning algorithms
while searching the space of possible models.We address this problem trough a study and a
comparative assessment of online option trees for regression and ensembles of model trees.
We advise the introduction of options for improving the performance,stability and quality
of standard tree-based models.The third subproblem is related to the applicability of the
proposed approach to the multi-target prediction task.This thesis proposes an extension of
the predictive clustering framework in the online domain by incorporating Hoeding bound
probabilistic estimates.The conducted study opened many interesting directions for further
work.
The algorithms proposed in this thesis are empirically evaluated on several stationary and
non-stationary datasets for single and multi-target regression problems.The incremental
algorithms were shown to perform favorably to existing batch learning algorithms,while
having lower variability in their predictions due to variations in the training data.Our
X CONTENTS
change detection and adaptation methods were shown to successfully track changes in real-
time and enable appropriate adaptations of the model.We have further shown that option
trees improve the accuracy of standard regression trees more than ensemble learning methods
without harming their robustness.At last,the comparative assessment of single target
and multi-target model trees has shown that multi-target regression trees oer comparable
performance to a collection of single-target model trees,while having lower complexity and
better interpretability.
XI
Povzetek
V disertaciji obravnavamo problem ucenja razlicnih vrst odlocitvenih dreves iz podatkovnih
tokov,ki se spreminjajo v casu.Posvetimo se predsvemstudiju sprotnih (online) algoritmov
strojnega ucenja za ucenje regresijskih dreves,linearnih modelnih dreves,opcijskih dreves za
regresijo,vec-ciljnih modelnih dreves in ansamblov modelnih dreves iz casovnih podatkovnih
tokov.Gre za najbolj reprezentativne in pogosto uporabljene razrede modelov iz skupine
interpretabilnih napovednih modelov.
Podatkovni tok je neomejeno zaporedje podatkov (stevil,koordinat,vecdimenzional-
nih tock,n-terk ali objektov poljubnega tipa).Zanj je znacilna visoka frekvenca vhodnih
podatkov,katerih porazdelitve niso stacionarne.Dejanske prakticne primere,v katerih po-
trebujemo obdelavo podatkovnih tokov,predstavljajo raznovrstni sistemi za upravljanje z
mrezami senzorjev,namenjeni nadzoru ucinkovitosti inteligentnih elektro-omrezij,spremlja-
nju prometnih zastojev v mestih,ali pa znanstvenemu raziskovanju podnebnih sprememb.
Ker tovrstnih podatkov ni mogoce preprosto shranjevati ali prenasati v centralno bazo
podatkov,ne da bi s tem preobremenili komunikacijsko infrastrukturo,jih je potrebno ob-
delovati in analizirati sproti in na mestu kjer so,ob uporabi konstantne kolicine pomnilnika.
Pri ucenju iz podatkovnih tokov je najpomembnejsa naloga inkrementalno racunanje nepri-
stranskih priblizkov raznih statisticnih mer.V ta namen potrebujemo metode,ki omogocajo
implicitno zbiranje ustreznih vzorcev iz vhodnega podatkovnega toka in sproten izracun po-
trebnih statistik.
V disertaciji smo pristopili k problemu izracunavanja nepristranskega priblizka cenilne
funkcije tako,da jo obravnavamo kot nakljucno spremenljivko.To namje omogocilo uporabo
obstojecih verjetnostnih mej,med katerimi so bili najboljsi rezultati dosezeni s Hoedingovo
mejo.Algoritmi,ki jih predlagamo v disertaciji,uporabljajo Hoedingovo mejo verjetnosti
za omejitev verjetnosti napake priblizka srednje vrednosti vzorca iz zaporedja nakljucnih
spremenljivk.Ta pristop nam daje statisticni mehanizem za ucinkovito resevanje razlicnih
nalog strojnega ucenja,ki jih obravnavamo v disertaciji.
Z nasim raziskovalnim delom se posvecamo resevanju treh glavnih podproblemov,ki jih
srecamo pri ucenju drevesnih modelov iz casovno spremenljivih podatkovnih tokov.Prvi
podproblem zadeva nestacionarnost konceptov in potrebo po informiranem in smiselnem
prilagajanju odlocitvenega drevesa.V disertaciji predlagamo mehanizem za sprotno za-
znavanje sprememb,ki je vkljucen v inkrementalno nauceni model.Drugi podproblem je
kratkovidnost algoritmov za ucenje odlocitvenih dreves pri njihovem preiskovanju prostora
moznih modelov.Tega problema se lotimo s studijo in primerjalnim vrednotenjem sprotnih
opcijskih dreves za regresijo in ansamblov modelnih dreves.Predlagamo uporabo opcij za
izboljsanje zmogljivosti,stabilnosti in kvalitete obicajnih drevesnih modelov.Tretji problem
je povezan z uporabnostjo predlaganega pristopa v nalogah vec-ciljnega napovedovanja.V
disertaciji predlagamo razsiritev napovednega razvrscanja v smeri sprotnega ucenja proble-
mih z vkljucitvijo verjetnostnih priblizkov,ki so omejeni s Hoedingovo mejo.Opravljene
studije so odprle mnogo zanimivih smeri za nadaljnje delo.
Algoritmi,ki jih predlagamo v disertaciji so empiricno ovrednoteni na vec stacionarnih
in nestacionarnih zbirkah podatkov za eno- in vec-ciljne regresijske probleme.Inkrementalni
algoritmi so se izkazali za boljse od obstojecih algoritmov za paketno obdelavo,pri cemer so
XII CONTENTS
se tudi ob variabilnosti v ucnih podatkih izkazali z manj nihanji v napovedih.Nase metode
za zaznavanje sprememb in prilagajanje le-tem so se izkazale za uspesne pri odkrivanju
sprememb v realnemcasu in so omogocile primerne prilagoditve modelov.Pokazali smo tudi,
da opcijska drevesa bolj izboljsajo tocnost obicajnih regresijskih dreves kot ansambli dreves.
Zmozna so izboljsanja sposobnosti modeliranja danega problema brez izgube robustnosti.
Nenazadnje,primerjalno ovrednotenje eno-ciljnih in vec-ciljnih modelnih dreves je pokazalo
da vec-ciljna regresijska drevesa ponujajo primerljivo zmogljivost kot zbirka vecjega stevila
eno-ciljnih dreves,vendar so obenem enostavnejsa in lazje razumljiva.
XIII
Abbreviations
PAC = Probably Approximately Correct
VC = Vapnik-Chervonenkis
SPRT = Sequential Probability Ratio Test
SPC = Statistical Process Control
WSS = Within Sum of Squares
TSS = Total Sum of Squares
AID = Automatic Interaction Detector
MAID = Multivariate Automatic Interaction Detector
THAID = THeta Automatic Interaction Detector
CHAID = CHi-squared Automatic Interaction Detection
QUEST = Quick Unbiased Ecient Statistical Tree
LDA = Linear Discriminant Analysis
QDA = Quadratic Discriminant Analysis
SSE = Sum of Square Errors
EM = Expectation-Maximization
RSS = Residual Sum of Squares
TDDT = Top-Down Decision Tree
MTRT = Multi-target Regression Tree
WSSD = Within Sum of Squared Distances
TSSD = Total Sum of Squared Distances
RBF = Radial Basis Function
RSM = Random Sampling Method
MSE = Mean Squared Error
RE = Relative (mean squared) Error
RRSE = Root Relative (mean) Squared Error
MAE = Mean Absolute Error
RMAE = Relative Mean Absolute Error
CC = Correlation Coecient
PSP = Protein Structure Prediction
PSSM = Position-Specic Scoring Matrices
IMTI = Incremental Multi Target Induction
RSS = Residual Sums of Squares
RLS = Recursive Least Squares
ANN = Articial Neural Network
PH = Page-Hinkley
ST = Single Target
MT = Multiple Target
1
1 Introduction
First will what is necessary,then love
what you will.
Tim O'Reilly
Machine Learning is the study of computer algorithms that are able to learn automatically
through experience.It has become one of the most active and prolic areas of computer
science research,in large part because of its wide-spread applicability to problems as diverse
as natural language processing,speech recognition,spam detection,document search,com-
puter vision,gene discovery,medical diagnosis,and robotics.Machine learning algorithms
are data driven,in the sense that the success of learning relies heavily on the scope and
the amount of data provided to the learning algorithm.With the growing popularity of the
Internet and social networking sites (e.g.,Facebook),new sources of data on the preferences,
behavior,and beliefs of massive populations of users have emerged.Ubiquitous measuring
elements hidden in literally every device that we use provide the opportunity to automati-
cally gather large amounts of data.Due to these factors,the eld of machine learning has
developed and matured substantially,providing means to analyze dierent types of data
and intelligently ensemble this experience to produce valuable information that can be used
to leverage the quality of our lives.
1.1 Context
Predictive modeling.The broader context of the research presented in this thesis is the
general predictive modeling task of machine learning,that is,the induction of models for
predicting nominal (classication) or numerical (regression) target values.A model can
serve as an explanatory tool to distinguish between objects of dierent classes in which case
it falls in the category of descriptive models.When a model is primarily induced to predict
the class label of unknown records then it belongs to the category of predictive models.
The predictive modeling task produces a mapping fromthe input space,represented with
a set of descriptive attributes of various types,to the space of target attributes,represented
with a set of class values or the space of real numbers.The classication model can be
thus treated as a black box that automatically assigns a class label when presented with
the attribute set of an unknown record.With the more recent developments in the eld of
machine learning,the predictive modeling task has been extended to address more complex
target spaces with a predened structure of arbitrary type.The format of the output can
be a vector of numerical values,a hierarchical structure of labels,or even a graph of objects.
In the focus of this thesis are tree-based models for predicting the value of one or several
numerical attributes,called targets (multi-target prediction).The term that we will use
in this thesis when referring to the general category of tree-based models is decision trees.
The simplest types of tree-based models are classication and regression trees.While clas-
sication trees are used to model concepts represented with symbolic categories,regression
2 Introduction
trees are typically used to model functions dened over the space of some or all of the input
attributes.
Classication and regression tree learning algorithms are among the most widely used
and most popular methods for predictive modeling.A decision tree is a concise data struc-
ture,that is easily interpretable and provides meaningful descriptions of the dependencies
between the input attributes and the target.Various studies and reports on their applica-
bility have shown that regression trees are able to provide accurate predictions,if applied
on adequate types of problems,that is,problems which can be represented with a set of
disjunctive expressions.They can handle both numeric and nominal types of attributes,
and are quite robust to irrelevant attributes.
Decision trees in general can give answers to questions of the type:
1.Which are the most discriminative (for the task of classication) combinations of at-
tribute values with respect to the target?
2.What is the average value (for the task of regression) of a given target for all the
examples for which a given set of conditions on the input attributes is true?
Decision trees have several advantages over other existing models such as support vector
machines,Gaussian models,and articial neural networks.First of all,the algorithms for
learning decision trees are distribution-free,that is,they have no special prior assumptions
on the distribution that governs the data.Second,they do not require tuning of parameters
or heavy tedious training,as in the case for support vector machines.Finally,the most
important advantage of decision trees is the fact that they are easily interpretable by a
human user.Every decision tree can be represented with a set of rules that describe the
dependencies between the input attributes and the target attribute.
Within the category of decision trees fall structured output prediction trees (Blockeel
et al.,1998;Vens et al.,2008) which are able to provide predictions of a more complex
type.Among the most popular types of models that fall in this sub-category are multi-
label and multi-target classication and regression trees (Blockeel et al.,1998).In this
thesis,we have studied only multi-target regression trees which predict the values of multiple
numerical targets.However,most of our ideas can be also extended to the case of multi-label
classication trees.
A classication or regression tree is a single model induced to be consistent with the
available training data.The process of tree induction is typically characterized with a
limited lookahead,instability,and high sensitivity to the choice of training data.Due to
the fact that the nal result is a single model,classication and regression trees are unable
to inform the potential users about how many alternative models are consistent with the
given training data.This issue has been addressed with the development of option trees,
which represent multiple models compressed in a single interpretable decision tree.Better
exploration of the space of possible models can be achieved by learning ensembles of models
(homogeneous or heterogeneous).In this thesis,we have considered both option trees for
regression and ensembles of regression trees,which have complementary characteristics,i.e.,
advantages and disadvantages.
Mining data streams.Continuous streams of measurements are typically found in
nance,environmental or industrial monitoring,network management and many others
(Muthukrishnan,2005).Their main characteristic is the continuous and possibly unbounded
arrival of data items at high rates.The opportunity to continuously gather information from
myriads of sources,however,proves to be both a blessing and a burden.The continuous
arrival of data demands algorithms that are able to process new data instances in constant
time in the order of their arrival.The temporal dimension,on the other hand,implies
possible changes in the concept or the functional dependencies being modeled,which,in the
eld of machine learning is known as concept drift (Kolter and Maloof,2005;Widmer and
Introduction 3
Kubat,1996).Thus,the algorithms for learning from data streams need to be able to detect
the appearance of concept drift and adapt their model correspondingly.
Among the most interesting research areas of machine learning is online learning,which
deals with machine learning algorithms able to induce models from continuous data feeds
(data streams).Online algorithms are algorithms that process their input piece-by-piece in
a serial fashion,i.e.,in the order in which the input is fed to the algorithm,without having
the entire input available fromthe start.Every piece of input is used by the online algorithm
to update and improve the current model.Given their ability to return a valid solution to a
problem,even if interrupted at any time before their ending,online algorithms are regarded
as any-time.However,the algorithm is expected to nd better and better solutions the
longer it runs.An adaptive learning algorithm is an algorithm which is able to adapt its
inference of models when observed evidence,conditional probabilities,and the structure of
the dependencies change or evolve over time.Because of these features,online algorithms
are the method of choice if emerging data must be processed in a real-time manner without
completely storing it.In addition,these algorithms can be used for processing data stored
on large external memory devices,because they are able to induce a model or a hypothesis
using only a single pass over the data,orders of magnitudes faster as compared to traditional
batch learning algorithms.
Learning decision trees from data streams.This thesis is concerned with algo-
rithms for learning decision trees from data streams.Learning decision trees from data
streams is a challenging problem,due to the fact that all tree learning algorithms perform a
simple-to-complex,hill-climbing search of a complete hypothesis space for the rst tree that
ts the training examples.Among the most successful algorithms for learning classication
trees from data streams are Hoeding trees (Domingos and Hulten,2000),which are able
to induce a decision model in an incremental manner by incorporating new information at
the time of its arrival.
Hoeding trees provide theoretical guarantees for convergence to a hypothetical model
learned by a batch algorithmby using the Hoeding probability bound.Having a predened
range for the values of the random variables,the Hoeding probability bound (Hoeding,
1963) can be used to obtain tight condence intervals for the true average of the sequence
of random variables.The probability bound enables one to state,with some predetermined
condence,that the sample average for N randomi.i.d.variables with values in a constrained
range is within distance e of the true mean.The value of e monotonically decreases with
the number of observations N,or in other words,by observing more and more values,the
sampled mean approaches the true mean.
The problem of concept drift and change detection,one of the essential issues in learning
from data streams,has also received proper attention from the research community.Several
change detection methods which function either as wrappers (Gama and Castillo,2006) or
are incorporated within the machine learning algorithm(Bifet and Gavalda,2007) have been
proposed.Hoeding-based algorithms extended to the non-stationary online learning setup
have been also studied by multiple authors (Gama et al.,2004b,2003;Hulten et al.,2001).
Hoeding-based option trees (Pfahringer et al.,2007) and their variants (Adaptive Ho-
eding Option Trees) (Bifet et al.,2009b) have been studied in the context of improving
the accuracy of online classication.Various types of ensembles of Hoeding trees for online
classication,including bagging and boosting (Bifet et al.,2009b),random forests (Ab-
dulsalam et al.,2007,2008;Bifet et al.,2010;Li et al.,2010),and stacked generalization
with restricted Hoeding trees (Bifet et al.,2012),among others,have been proposed and
studied.Finally,Read et al.(2012) recently proposed an algorithm for learning multi-label
Hoeding trees which attacks the multi-label problem through online learning of multi-label
(classication) trees.
The work in this thesis falls in the more specic context of algorithms for learning
Hoeding-based regression trees,algorithms for change detection,and extensions of these
4 Introduction
concepts to more advanced and complex types of models,i.e.,option trees,tree ensembles,
and multi-target trees for regression.
1.2 Goals
The main goal of our research was to study the various aspects of the problem of learning
decision trees from data streams that evolve over time,that is,in a learning environment in
which the underlying distribution that generates the data might change over time.Decision
trees for regression have not been studied yet in the context of data streams,despite the fact
that many interesting real-world applications require various regression tasks to be solved
in an online manner.
Within this study,we aimed to develop various tree-based methods for regression on
time-changing data streams.Our goal was to follow the main developments within the line
of algorithms for online learning of classication trees from data streams,that is,include
change detection mechanisms inside the tree learning algorithms,introduce options in the
trees,and extend the developed methods to online learning of ensemble models for regression
and trees for structured output prediction.
1.3 Methodology
Our approach is to followthe well established line of algorithms for online learning of decision
trees,represented with the Hoeding tree learning algorithm (Domingos and Hulten,2000).
Hoeding trees take the viewpoint of statistical learning by making use of probabilistic
estimates within the inductive inference process.The application of the Hoeding bound
provides a statistical support for every inductive decision (e.g.,selection of a test to put
in an internal node of the tree),which results in a more stable and robust sequence of
inductive decisions.Our goal was to apply the same ideas in the context of online learning
of regression trees.
Our methodology examines the applicability of the Hoeding bound to the split selection
procedure of an online algorithm for learning regression trees.Due to the specics of the
Hoeding bound,extending the same ideas to the regression domain is not straightforward.
Namely,there exist no such evaluation function for the regression domain whose values can
be bounded within a pre-specied range.
To address the issue of change detection,we studied methods for tracking changes and
real-time adaptation of the current model.Our methodology is to introduce change detection
mechanisms within the learning algorithmand enable local error monitoring.The advantage
of localizing the concept drift is that it gives us the possibility to determine the set of
conditions under which the current model remains valid,and more importantly the set of
disjunctive expressions that have become incorrect due to the changes in the functional
dependencies.
The online learning task carried out by Hoeding-based algorithms,although statistically
stable,is still susceptible to the typical problems of greedy search through the space of
possible trees.In that context,we study the applicability of options and their eect on the
any-time performance of the online learning algorithm.The goal is not only to improve the
exploration of the search space,but also to enable more ecient resolution of ambiguous
situations which typically slow down the convergence of the learning process.
We further study the possibility to combine multiple predictions which promises an
increased accuracy,along with a reduced variability and sensitivity to the choice of the
training data.A natural step in this direction is to study the relation between option trees
and ensemble learning methods which oer possibilities to leverage the expertise of multiple
regression trees.We approach it with a comparative assessment of online option trees for
Introduction 5
regression and online ensembles of regression trees.We evaluate the algorithms proposed
on real-world and synthetic data sets,using methodology appropriate for approaches for
learning from evolving data streams.
1.4 Contributions
The research presented in this thesis addresses the general problem of automated and adap-
tive any-time regression analysis using dierent regression tree approaches and tree-based
ensembles from streaming data.The main contributions of the thesis are summarized as
follows:
 We have designed and implemented an online algorithm for learning model trees with
change detection and adaptation mechanisms embedded within the algorithm.To the
best of our knowledge,this is the rst approach that studies a complete system for
learning from non-stationary distributions for the task of online regression.We have
performed an extensive empirical evaluation of the proposed change detection and
adaptation methods on several simulated scenarios of concept drift,as well as on the
task of predicting ight delays from a large dataset of departure and arrival records
collected within a period of twenty years.
 We have designed and implemented an online option tree learning algorithm that
enabled us to study the idea of introducing options within the proposed online learning
algorithmand their overall eect on the learning process.To the best of our knowledge,
this is the rst algorithm for learning option trees in the online setup without capping
options to the existing nodes.We have further performed a corresponding empirical
evaluation and a comparison of the novel online option tree learning algorithm with
the baseline regression and model tree learning algorithms.
 We have designed and implemented two methods for learning tree-based ensembles for
regression.These two methods were developed to study the advantages of combining
multiple predictions for online regression and to evaluate the merit of using options in
the context of methods for learning ensembles.We have performed a corresponding
empirical evaluation and a comparison with the online option tree learning algorithm
on existing real-world benchmark datasets.
 We have designed and implemented a novel online algorithm for learning multiple tar-
get regression and model trees.To the best of our knowledge,this is the rst algorithm
designed to address the problem of online prediction of multiple numerical targets for
regression analysis.We have performed a corresponding empirical evaluation and a
comparison with an independent modeling approach.We have also included a batch
algorithm for learning multi-target regression trees in the comparative assessment of
the quality of the models induced with the online learning algorithm.
To this date and to the best of our knowledge there is no other work that implements and
empirically evaluates online methods for tree-based regression,including model trees with
drift detection,option trees for regression,online ensemble methods for regression,and online
multi-target model trees.With the work presented in this thesis we lay the foundations
for research in online tree-based regression,leaving much room for future improvements,
extensions and comparisons of methods.
1.5 Organization of the Thesis
This introductory chapter presents the general perspective on the topic under study and
provides the motivation for our research.It species the goals set at the beginning of the
6 Introduction
thesis research and presents its main original contributions.In the following,we give a
chapter level outline of this thesis,describing the organization of the chapters which present
the above mentioned contributions.
Chapter 2 gives the broader context of the thesis work within the area of online learning
fromthe viewpoint of several areas,including statistical quality control,decision theory,and
computational learning theory.It gives some background on the online learning protocol
and a brief overview of the research related to the problem of supervised learning from
time-changing data streams.
Chapter 3 gives the necessary background on the basic methodology for learning decision
trees,regression trees and their variants including model trees,option trees and multi-
target decision trees.It provides a description of the tree induction task through a short
presentation of the history of learning decision trees,followed by a more elaborate discussion
of the main issues.
Chapter 4 provides a description of several basic ensemble learning methods with a
focus on ensembles of homogeneous models,such as regression or model trees.It provides
the basic intuition behind the idea of learning ensembles,which is related to one of the main
contributions of this thesis that stems from the exploration of options in the tree induction.
In Chapter 5,we present the quality measures used and the specics of the experimental
evaluation designed to assess the performance of our online algorithms.This chapter denes
the main criteria for evaluation along with two general evaluation models designed specially
for the online learning setup.In this chapter,we also give a description of the methods used
to perform a statistical comparative assessment and an online bias-variance decomposition.
In addition,we describe the various real-world and simulated problems which were used in
the experimental evaluation of the dierent learning algorithms.
Chapter 6 presents the rst major contribution of the thesis.It describes an online change
detection and adaptation mechanism embedded within an online algorithm for learning
model trees.It starts with a discussion on the related work within the online sequential
hypothesis testing framework and the existing probabilistic sampling strategies in machine
learning.The main parts of the algorithm are further presented in more detail,each in a
separate section.Finally,we give an extensive empirical evaluation addressing the various
aspects of the online learning procedure.
Chapter 7 presents the second major contribution of the thesis,an algorithm for online
learning of option trees for regression.The chapter covers the related work on learning
Hoeding option trees for classication and presents the main parts of the algorithm,each
in a separate section.The last section contains an empirical evaluation of the proposed
algorithm on the same benchmark regression problems that were used in the evaluation
section of Chapter 5.
Chapter 8 provides an extensive overview of existing methods for learning ensembles of
classiers for the online prediction task.This gives the appropriate context for the two newly
designed ensemble learning methods for online regression,which are based on extensions of
the algorithm described previously in Chapter 6.This chapter also presents an extensive
experimental comparison of the ensemble learning methods with the online option tree
learning algorithm introduced in Chapter 7.
Chapter 9 presents our nal contribution,with which we have addressed a slightly dif-
ferent aspect of the online prediction task,i.e.,the increase in the complexity of the space of
the target variables.The chapter starts with a short overview of the existing related work
in the context of the online multi-target prediction task.Next,it describes an algorithm for
learning multi-target model trees from data streams through a detailed elaboration of the
main procedures.An experimental evaluation is provided to support our theoretical results
that continues into a discussion of some interesting directions for further extensions.
Finally,Chapter 10 presents our conclusions.It presents a summary of the thesis,our
original contributions,and several directions for further work.
7
2 Learning from Data Streams
When a distinguished but elderly
scientist states that something is
possible,he is almost certainly right.
When he states that something is
impossible,he is very probably wrong.
The First Clarke's Law by Arthur C.
Clarke,in"Hazards of Prophecy:The
Failure of Imagination"
Learning from abundant data has been mainly motivated by the explosive growth of in-
formation collected and stored electronically.It represents a major departure from the
traditional inductive inference paradigm in which the main bottleneck is the lack of training
data.With the abundance of data however,the main question which has been frequently
addressed in the dierent research communities so far is:"What is the minimum amount of
data that can be used without compromising the results of learning?".In this chapter,we
discuss various aspects of the problem of supervised learning from data streams,and strive
to provide a unied view from the perspectives of statistical quality control,decision theory,
and computational learning theory.
This chapter is organized as follows.We start with a high-level overview on the require-
ments for online learning.Next,we discuss the supervised learning and regression tasks.We
propose to study the process of learning as a search in the solution space,and show more
specically how each move in this space can be chosen by using a sub-sample of the training
data.In that context we discuss various sampling strategies for determining the amount of
necessary training data.This is related to the concept of Probably Approximately Correct
learning (PAC learning) and sequential inductive learning.We also present some of the most
representative online learning algorithms.In a nal note,we address the non-stationarity
of the learning process and discuss several approaches for resolving the issues raised in this
context.
2.1 Overview
Learning from data streams is an instance of the online learning paradigm.It diers from
the batch learning process mainly by its ability to incorporate new information into the
existing model,without having to re-learn it from scratch.Batch learning is a nite process
that starts with a data collection phase and ends with a model (or a set of models) typically
after the data has been maximally explored.The induced model represents a stationary
distribution,a concept,or a function which is not expected to change in the near future.
The online learning process,on the other hand,is not nite.It starts with the arrival of
some training instances and lasts as long as there is new data available for learning.As
such,it is a dynamic process that has to encapsulate the collection of data,the learning and
the validation phase in a single continuous cycle.
8 Learning from Data Streams
Research in online learning dates back to second half of the previous century,when the
Perceptron algorithm has been introduced by Rosenblatt (1958).However,the online ma-
chine learning community has been mainly preoccupied with nding theoretical guarantees
for the learning performance of online algorithms,while neglecting some more practical is-
sues.The process of learning itself is a very dicult task.Its success depends mainly on the
type of problems being considered,and on the quality of available data that will be used in
the inference process.Real-world problems are typically very complex and demand a diverse
set of data that covers various aspects of the problem,as well as,sophisticated mechanisms
for coping with noise and contradictory information.As a result,it becomes almost impos-
sible to derive theoretical guarantees on the performance of online learning algorithms for
practical real-world problems.This has been the main reason for the decreased popularity
of online machine learning.
The stream data mining community,on the other hand,has approached the online
learning problem from a more practical perspective.Stream data mining algorithms are
typically designed such that they fulll a list of requirements in order to ensure ecient online
learning.Learning fromdata streams not only has to incrementally induce a model of a good
quality,but this has to be done eciently,while taking into account the possibility that the
conditional dependencies can change over time.Hulten et al.(2001) have identied several
desirable properties a learning algorithm has to posses in order to eciently induce up-to-
date models from high-volume,open-ended data streams.An online streaming algorithm
has to possess the following features:
 It should be able to build a decision model using a single-pass over the data;
 It should have a small (if possible constant) processing time per example;
 It should use a xed amount of memory;irrespective of the data stream size;
 It should be able to incorporate new information in the existing model;
 It should have the ability to deal with concept drift;and
 It should have a high speed of convergence;
We tried to take into account all of these requirements when designing our online adaptive
algorithms for learning regression trees and their variants,as well as,for learning ensembles
of regression and model trees.
2.2 Supervised Learning and the Regression Task
Informally speaking,the inductive inference task aims to construct or evaluate propositions
that are abstractions of observations of individual instances of members of the same class.
Machine learning,in particular,studies automated methods for inducing general functions
fromspecic examples sampled froman unknown data distribution.In its most simple form,
the inductive learning task ignores prior knowledge,assumes a deterministic,observable
environment,and assumes that examples are given to the learning agent (Mitchell,1997).
The learning task is in general categorized as either supervised or un-supervised learning.
We will consider only the supervised learning task,more specically supervised learning of
various forms of tree structured models for regression.
2.2.1 The Task of Regression
Before stating the basic denitions,we will dene some terminology that will be used
throughout this thesis.Suppose we have a set of objects,each described with many at-
tributes (features or properties).The attributes are independent observable variables,nu-
merical or nominal.Each object can be assigned a single real-valued number,i.e.,a value
Learning from Data Streams 9
of the dependent (target) variable,which is a function of the independent variables.Thus,
the input data for a learning task is a collection of records.Each record,also known as an
instance or an example is characterized by a tuple (x,y),where x is the attribute set and y
is the target attribute,designated as the class label.If the class label is a discrete attribute
then the learning task is classication.If the class label is a continuous attribute then the
learning task is regression.In other words,the task of regression is to determine the value
of the dependent continuous variable,given the values of the independent variables (the
attribute set).
When learning the target function,the learner L is presented with a set of training
examples,each consisting of an input vector x from X,along with its target function value
y = f (x).The function to be learned represents a mapping from the attribute space X to the
space of real values Y,i.e.,f:X!R.We assume that the training examples are generated at
random according to some probability distribution D.In general,D can be any distribution
and is not known to the learner.Given a set of training examples of the target function f,
the problem faced by the learner is to hypothesize,or estimate,f.We use the symbol H
to denote the set of all possible hypotheses that the learner may consider when trying to
nd the true identity of the target function.In our case,H is determined by the set of all
possible regression trees (or variants thereof,such as option trees) over the instance space
X.After observing a set of training examples of the target function f,L must output some
hypothesis h from H,which is its estimate of f.A fair evaluation of the success of L assesses
the performance of h over a set of new instances drawn randomly from X;Y according to D,
the same probability distribution used to generate the training data.
The basic assumption of inductive learning is that:Any hypothesis found to approximate
the target function well over a suciently large set of training examples will also approxi-
mate the target function well over unobserved testing examples.The rationale behind this
assumption is that the only information available about f is its value over the set of training
examples.Therefore,inductive learning algorithms can at best guarantee that the output
hypothesis ts the target function over the training data.However,this fundamental as-
sumption of inductive learning needs to be re-examined under the learning setup of changing
data streams,where a target function f might be valid over the set of training examples only
for a xed amount of time.In this new setup which places the afore assumption under a
magnifying glass,the inductive learning task has to incorporate machinery for dealing with
non-stationary functional dependencies and a possibly innite set of training examples.
2.2.2 Learning as Search
The problem of learning a target function has been typically viewed as a problem of search
through a large space of hypotheses,implicitly dened by the representation of hypotheses.
According to the previous denition,the goal of this search is to nd the hypothesis that
best ts the training examples.By viewing the learning problem as a problem of search,
it is natural to approach the problem of designing a learning algorithm through examining
dierent strategies for searching the hypothesis space.We are,in particular,interested in
algorithms that can perform ecient search of a very large (or innite) hypothesis space
through a sequence of decisions,each informed by a statistically signicant amount of evi-
dence,through the application of probability bounds or variants of statistical tests.In that
context,we address algorithms that are able to make use of the general-to-specic order-
ing of the set of all possible hypothesis.By taking the advantage of the general-to-specic
ordering,the learning algorithm can search the hypothesis space without explicitly enumer-
ating every hypothesis.In the context of regression trees,each conjunction of a test to the
previously inferred conditions represents a renement of the current hypothesis.
The inductive bias of a learner is given with the choice of hypothesis space and the set
of assumptions that the learner makes while searching the hypothesis space.As stated by
10 Learning from Data Streams
Mitchell (1997),there is a clear futility in bias-free learning:a learner that makes no prior
assumptions regarding the identity of the target concept has no rational basis for classifying
any unseen instances.Thus,learning algorithms make various assumptions,ranging from
"the hypothesis space H includes the target concept"to"more specic hypotheses are pre-
ferred over more general hypotheses".Without an inductive bias,a learner cannot make
inductive leaps to classify unseen examples.One of the advantages of studying the inductive
bias of a learner is in that it provides means of characterizing their policy for generalizing
beyond the observed data.A second advantage is that it allows comparison of dierent
learning algorithms according to the strength of their inductive bias.
Decision and regression trees are learned from a nite set of examples based on the
available attributes.The inductive bias of decision and regression tree learning algorithms
will be discussed in more detail in the following chapter.We note here that there exist
a clear dierence between the induction task using a nite set of training examples and
the induction task using an innite set of training examples.In the rst case,the learning
algorithm uses all training examples at each step in the search to make statistically based
decisions regarding how to rene its current hypothesis.Several interesting questions arise
when the set of instances X is not nite.Since only a portion of all instances is available when
making each decision,one might expect that earlier decisions are doomed to be improperly
informed,due to the lack of information that only becomes available with the arrival of new
training instances.In what follows,we will try to clarify some of those questions from the
perspective of statistical decision theory.
2.3 Learning under a Sampling Strategy
Let us start with a short discussion of the practical issues which arise when applying a
machine learning algorithm for knowledge discovery.One of the most common issues is the
fact that the data may contain noise.The existence of noise increases the diculty of the
learning task.To go even further,the concept one is trying to learn may not be drawn from
a pre-specied class of known concepts.Also,the attributes may be insucient to describe
the target function or concept.
The problem of learnability and complexity of learning has been studied in the eld of
computational learning theory.There,a standard denition of suciency regarding the qual-
ity of the learned hypothesis is used.Computational learning theory is in general concerned
with two types of questions:"How many training examples are sucient to successfully learn
the target function?"and"How many mistakes will the learner make before succeeding?".
2.3.1 Probably Approximately Correct Learning
The denition of"success"depends largely on the context,the particular setting,or the
learning model we have in mind.There are several attributes of the learning problem that
determine whether it is possible to give quantitative answers to the above questions.These
include the complexity of the hypothesis space considered by the learner,the accuracy to
which the target function must be approximated,the probability that the learner will output
a successful hypothesis,or the manner in which the training examples are presented to the
learner.
The probably approximately correct (PAC) learning model proposed by Valiant (1984)
provides the means to analyze the sample and computational complexity of learning prob-
lems for which the hypothesis space H is nite.In particular,learnability is dened in terms
of how closely the target concept can be approximated (under the assumed set of hypothe-
ses H) from a reasonable number of randomly drawn training examples with a reasonable
amount of computation.Trying to characterize learnability by demanding an error rate
of error
D
(h) = 0 when applying h on future instances drawn according to the probability
Learning from Data Streams 11
distribution D is unrealistic,for two reasons.First,since we are not able to provide to
the learner all of the training examples from the instance space X,there may be multiple
hypotheses which are consistent with the provided set of training examples,and the learner
cannot deterministically pick the one that corresponds to the target function.Second,given
that the training examples are drawn at random from the unknown distribution D,there
will always be some nonzero probability that the chosen sequence of training example is
misleading.
According to the PAC model,to be able to eventually learn something,we must weaken
our demands on the learner in two ways.First,we must give up on the zero error require-
ment,and settle for an approximation dened by a constant error bound e,that can be made
arbitrarily small.Second,we will not require that the learner must succeed in achieving
this approximation for every possible sequence of randomly drawn training examples,but
we will require that its probability of failure be bounded by some constant,d,that can be
made arbitrarily small.In other words,we will require only that the learner probably learns
a hypothesis that is approximately correct.The denition of a PAC-learnable concept
class is given as follows:
Consider a concept class C dened over a set of instances X of length n and
a learner L using hypothesis space H.C is PAC-learnable by L using H if
for all c 2C,distributions D over X,e such that 0 <e <1=2,and d such that
0 <d <1=2,learner L will with probability at least (1d) output a hypothesis
h 2 H such that error
D
(h)  e,in time that is polynomial in 1=e,1=d,n,and
size(c).
The denition takes into account our demands on the output hypothesis:low error (e)
high probability (1d),as well as the complexity of the underlying instance space n and
the concept class C.Here,n is the size of instances in X (e.g.,the number of independent
variables).
However,the above denition of PAC learnability implicitly assumes that the learner's
hypothesis space H contains a hypothesis with arbitrarily small error e for every target
concept in C.In many practical real world problems it is very dicult to determine C in
advance.For that reason,the framework of agnostic learning (Haussler,1992;Kearns et al.,
1994) weakens the demands even further,asking for the learner to output the hypothesis
from H that has the minimum error over the training examples.This type of learning is
called agnostic because the learner makes no assumption that the target concept or function
is representable in H;that is,it doesn't know if C H.Under this less restrictive setup,the
learner is assured with probability (1d) to output a hypothesis within error e of the best
possible hypothesis in H,after observing m randomly drawn training examples,provided
m
1
2e
2
(ln(1=d) +lnjHj):(1)
As we can see,the number of examples required to reach the goal of close approximation
depends on the complexity of the hypothesis space H,which in the case of decision and
regression trees and other similar types of models can be innite.For the case of innite
hypothesis spaces,a dierent measure of the complexity of H is used,called the Vapnik-
Chervonenkis dimension of H (VC dimension,or VC(H) for short).However,the bounds
derived are applicable only to some rather simple learning problems for which it is possible
to determine the VC(H) dimension.For example,it can be shown that the VC dimension
of linear decision surfaces in an r dimensional space (i.e.,the VC dimension of a perceptron
with r inputs) is r+1,or for some other well dened classes of more complex models,such as
neural networks with predened units and structure.Nevertheless,the above considerations
have lead to several important ideas which have in uenced some recent,more practical,
solutions.
12 Learning from Data Streams
An example is the application of general Hoeding bounds (Hoeding,1963),also known
as additive Cherno bounds,in estimating how badly a single chosen hypothesis deviates
from the best one in H.The Hoeding bound applies to experiments involving a number
of distinct Bernoullli trials,such as m independent ips of a coin with some probability of
turning up heads.The event of a coin turning up heads can be associated with the event of
a misclassication.Thus,a sequence of m independent coin ips is analogous to a sequence
of m independently drawn instances.Generally speaking,the Hoeding bound characterizes
the deviation between the true probability of some event and its observed frequency over
m independent trials.In that sense,it can be used to estimate the deviation between the
true probability of misclassication of a learner and its observed error over a sequence of m
independently drawn instances.
The Hoeding inequality gives a bound on the probability that an arbitrarily chosen
single hypothesis h has a training error,measured over set D'containing m randomly drawn
examples from the distribution D,that deviates from the true error by more than e.
Pr[error
D
0
(h) >error
D
(h) +e] e
2me
2
To ensure that the best hypothesis found by L has an error bounded by e,we must
bound the probability that the error of any hypothesis in H will deviate from its true value
by more than e as follows:
Pr[(8h 2H)(error
D
0 (h) >error
D
(h) +e)] jHje
2me
2
If we assign a value of d to this probability and ask how many examples are necessary
for the inequality to hold we get:
m
1
2e
2
(lnjHj +ln(1=d)) (2)
The number of examples depends logarithmically on the inverse of the desired probability
1=d,and grows with the square of 1=e.Although this is just one example of how the
Hoeding bound can be applied,it illustrates the type of approximate answer which can
be obtained in the scenario of learning from data streams.It is important to note at this
point that this application of the Hoeding bound ensures that a hypothesis (from the nite
space H)with the desired accuracy will be found with high probability.However,for a large
number of practical problems,for which the hypothesis space H is innite,similar bounds
cannot be derived even if we use the VC(H) dimension instead of jHj.Instead,several
approaches have been proposed that relax the demands for these dicult cases even further,
while assuring that each inductive decision will satisfy a desired level of quality.
2.3.2 Sequential Inductive Learning
Interpreting the inductive inference process as search through the hypothesis space H enables
the use of several interesting ideas from the eld of statistical decision theory.We are
interested in algorithms that are able to make use of the general-to-specic ordering of the
hypotheses in H,and thus perform a move in the search space by rening an existing more
general hypothesis into a new,more specic one.The choice of the next move requires the
examination of a set of renements,from which the best one will be chosen.
Most learning algorithms use some statistical procedure for evaluating the merit of each
renement,which in the eld of statistics has been studied as the correlated selection problem
(Gratch,1994).In selection problems,one is basically interested in comparing a nite set
of hypotheses in terms of their expected performance over a distribution of instances and
selecting the hypothesis with the highest expected performance.The expected performance
of a hypothesis is typically dened in terms of the decision-theoretic notion of expected
utility.
Learning from Data Streams 13
In machine learning,the commonly used utility functions are dened with respect to
the target concept c or the target function f which the learner is trying to estimate.For
classication tasks,the true error of a hypothesis h with respect to a target concept c and
a distribution D is dened as the probability that h will misclassify an instance drawn at
random according to D:
error
D
(h) Pr
x2D
[c(x) 6=h(x)]
where the notation P
x2D
indicates that the probability is taken over the instance distribution
D and not over the actual set of training examples.This is necessary because we need
to estimate the performance of the hypothesis when applied on future instances drawn
independently from D.
Obviously,the error or the utility of the hypothesis depends strongly on the unknown
probability distribution D.For example,if D happens to assign very low probability to
instances for which h and c disagree,the error might be much smaller compared to the case
of a uniform probability distribution that assigns the same probability to every instance in
X.The error of h with respect to c or f is not directly observable to the learner.Thus,
L can observe the performance of h over the training examples and must choose its output
hypothesis on this basis only.
When data are abundant,evaluating a set of hypotheses seems trivial,unless one takes
into account the computational complexity of the task.Under this constraint,one has
to provide an answer to the question:"How likely is that the estimated advantage of one
hypothesis over another will remain truthful if more training examples were used?".
As noted in the previous section,bounding the probability of failure in nding a hy-
pothesis which is within an e bound of the best one in H depends on the complexity of the
assumed set of hypotheses and the learning setup.However,the theoretical implications are
(most of the time) not useful in practice.Classical statistical approaches typically try to
assume a specic probability distribution and bound the probability of an incorrect asser-
tion by using the initial assumptions.For example,most techniques assume that the utility
of hypotheses is normally distributed,which is not an unreasonable assumption when the
conditions for applying the central limit theorem hold.Other approaches relax the assump-
tions,e.g.,assume that the selected hypothesis has the highest expected utility with some
pre specied condence.An even less restrictive assumption is that the selected hypothesis
is close to the best with some condence.The last assumption leads to a class of selection
problems known in the eld of statistics as indierence-zone selection (Bechhofer,1954).
Unfortunately,given a single reasonable selection assumption,there is no single optimal
method for ensuring it.Rather,there exist a variety of techniques,each with its own set of
tradeos.In order to support ecient and practical learning,a sequential decision-theoretic
approach can be used that relaxes the requirements for successful learning by moving from
the goal of"converging to a successful hypothesis"to the goal of"successfully converging to
the closest possible hypothesis to the best one".In this context,there are some interesting
cases of learning by computing a required sample size needed to bound the expected loss in
each step of the induction process.
Anotable example that has served as an inspiration is the work by Musick et al.(1993),in
which a decision-theoretic subsampling has been proposed for the induction of decision trees
on large databases.The main idea is to choose a smaller sample,from a very large training
set,over which a tree of a desired quality would be learned.In short,the method tries to
determine what sequence of subsamples from a large dataset will be the most economical
way to choose the best attribute,to within a specied expected error.The sampling strategy
proposed takes into account the expected quality of the learned tree,the cost of sampling,
and a utility measure specifying what the user is willing to pay for dierent quality trees,
and calculates the expected required sample size.A generalization of this method has
been proposed by Gratch (1994),in which the so called one-shot induction is replaced with
14 Learning from Data Streams
sequential induction,where the data are sampled a little at a time throughout the decision
process.
In the sequential induction scenario,the learning process is dened as an inductive deci-
sion process consisting of two types of inductive decisions:stopping decisions and selection
decisions.The statistical machinery used to determine the sucient amount of samples
for performing the selection decisions is based on an open,unbalanced sequential strategy
for solving correlated selection problems.The attribute selection problem is,in this case,
addressed through a method of multiple comparisons,which consists of simultaneously per-
forming a number of pairwise statistical comparisons between the renements drawn from
the set of possible renements of an existing hypothesis.Let the size of this set be k.This
reduces the problem to estimating the sign of the expected dierence in value between the
two renements,with error no more than e.Here,e is an indierence parameter,that
captures the intuition that,if the dierence is suciently small we do not care if the tech-
nique determines its sign incorrectly.Stopping decisions are resolved using an estimate of
the probability that an example would reach a particular node.The sequential algorithm
should not partition a node if this probability is less then some threshold parameter g.This
decision should be,however,reached with a probability of success (1d).
The technique used to determine the amount of training examples necessary for achieving
a successful indierence-zone selection takes into account the variance in the utility of each
attribute.If the utility of a splitting test varies highly across the distribution of examples,
more data is needed to estimate its performance to a given level of accuracy.The statistical
procedure used is known as the sequential probability ratio test (SPRT);cf.Wald (1945).
SPRT is based on estimating the likelihood of the data generated according to some specied
distribution at two dierent values for the unknown mean,q and q.In this case,the
assumption is that the observed dierences are generated according to a normal distribution
with mean e,and a variance estimated with the current sample variance.A hypothesis is
the overall best if there is a statistically signicant positive dierence in its comparison with
the k1 remaining hypotheses.Therefore,a renement would be selected only when enough
statistical evidence has been observed from the sequence of training examples.
With the combination of these two techniques,the induction process is designed as a
sequence of probably approximately correct inductive decisions,instead of probably approxi-
mately correct learning.As a result,the inductive process will not guarantee that the learner
will output a hypothesis which is close enough to the best one in H with a probability of
success (1d).What will be guaranteed is that,given the learning setup,each inductive
decision of the learner will have a probability of failure bounded with d in estimating the
advantage of the selected renement over the rest with an absolute error of at most e.A
similar technique for PAC renement selection has been proposed by Domingos and Hulten
(2000) that employs the Hoeding inequality in order to bound the probability of failure.
The approach is closely related to the algorithms proposed in this thesis and will be discussed
in more detail in Chapter 6.
2.4 The Online Learning Protocol
The learning scenario assumed while designing the algorithms and the experiments presented
in this thesis follows the online learning protocol (Blum and Burch,2000).Let us consider
a sequence of input elements a
1
,a
2
,...,a
j
,...which arrive continuously and endlessly,each
drawn independently from some unknown distribution D.In this setting,the following
online learning protocol is repeated indenitely:
1.The algorithm receives an unlabeled example.
2.The algorithm predicts a class (for classication) or a numerical value (for regression)
of this example.
Learning from Data Streams 15
3.The algorithm is then given the correct answer (label for the unlabeled example).
An execution of steps (1) to (3) is called a trial.We will call whatever is used to perform
step (2),the algorithm's"current hypothesis".New examples are classied automatically as
they become available,and can be used for training as soon as their class assignments are
conrmed or corrected.For example,a robot learning to complete a particular task might
obtain the outcome of its action (correct or wrong) each time it attempts to perform it.
An important detail in this learning protocol is that the learner has to make a prediction
after every testing and training example it receives.This is typical for the mistake bound
model of learning,in which the learner is evaluated by the total number of mistakes it
makes before it converges to the correct hypothesis.The main question considered in this
model is"What is the number of mistakes in prediction that the learner will make before it
learns the target concept?".This question asks for an estimation of the predictive accuracy
of the learner at any time during the course of learning,which is signicant in practical
settings where learning must be done while the system is in use,rather than during an
o-line training phase.If an algorithm has the property that,for any target concept c 2C,
it makes at most poly(p;size(c)) mistakes on any sequence of examples,and its running time
per trial is poly(p;size(c)) as well,then it is said that the algorithm learns class C in the
mistake bound model.Here p denotes the cardinality of the problem,that is,the number
of predictor variables fx
1
;:::;x
p
g.
2.4.1 The Perceptron and the Winnow Algorithms
Examples of simple algorithms that perform surprisingly well in practice under the mistake
bound model are the Perceptron (Rosenblatt,1958),and the Winnow (Littlestone,1988),
algorithms,which both perform online learning of a linear threshold function.The Percep-
tron algorithm is one of the oldest online machine learning algorithms for learning a linear
threshold function.For a sequence S of labeled examples which is assumed to be consistent
with a linear threshold function w

 x >0,where w

is a unit-length vector,it can be proven
the number of mistakes on S made by the Perceptron algorithm is at most (1=g)
2
,where
g =min
x2S
w

x
kxk
:
The parameter"g"is often called the margin of w

and denotes the closest the Percep-
tron algorithm can get in approximating the true linear threshold function w

 x >0.The
Perceptron algorithm is given with the following simple sequence of rules:
1.Initialize the iteration with t =1.
2.Start with an all-zeros weight vector w
1
=0,and assume that all examples are nor-
malized to have Euclidean length 1.
3.Given example x,predict positive i w
t
 x >0.
4.On a mistake update the weights as follows:
 If mistake on positive:w
t+1
w
t
+x.
 If mistake on negative:w
t+1
w
t
x.
5.t t +1.
6.Go to 3.
16 Learning from Data Streams
In other words,if me make a mistake on a positive example then the weights will be
updated to move closer to the positive side of the plane,and similarly if we make a mistake
on a negative example then again the weights will be decreased to move closer to the value
we wanted.The success of applying the Perceptron algorithmdepends naturally on the data.
If the data is well linearly-separated then we can expect that g 1=n,where n is the size
of the sequence of examples S.In the worst case,g can be exponentially small in n,which
means that the number of mistakes made over the total sequence will be large.However,
the nice property of the mistake-bound is that it is independent on the number of features
in the input feature space,and depends purely on a geometric quantity.Thus,if data is
linearly separable by a large margin,then the Perceptron is the right algorithm to use.If
the data doesn't have a linear separator,then one can apply the kernel trick by mapping
the data to a higher dimensional space,in a hope that it might be linearly separable there.
The Winnow algorithm similarly learns monotone disjunctions (e.g.,h =x
1
_x
2
_:::_x
p
)
in the mistake bound model and makes only O(r log p) mistakes,where r is the number of
variables that actually appear in the target disjunction.This algorithm can also be used to
track a target concept that changes over time.This algorithm is highly ecient when the
number of relevant predictive attributes r is much smaller then the total number of variables
p.
The Winnow algorithm maintains a set of weights w
1
;:::;w
p
,one for each variable.The
algorithm,in its most simple form,proceeds as follows:
1.Initialize the weights w
1
;:::;w
p
to 1.
2.Given an example x =fx
1
;x
2
;:::;x
p
g,output 1 if
w
1
x
1
+w
2
x
2
+:::+w
p
x
p
 p
and output 0 otherwise.
3.If the algorithm makes a mistake:
(a) If it predicts negative on a positive example,then for each x
i
equal to 1,double
the value of w
i
.
(b) If it predicts positive on a negative example,then for each x
i
equal to 1,cut the
value of w
i
in half.
4.Go to 2.
The Winnow algorithm does not guarantee successful convergence to the exact target
concept.Namely,the target concept may not be linearly separable.However,its perfor-
mance can still be bounded,even when not all examples are consistent with some target
disjunction,if one only is able to count the number of attribute errors in the data with re-
spect to c.Having to realize that a concept may not be learnable,a more practical question
to ask is"How badly the algorithm performs,in terms of predictive accuracy,with respect to
the best one that can be learned on the given sequence of examples?".The following section
presents in more detail a very popular,practically relevant algorithm for online learning.
2.4.2 Predicting from Experts Advice
The algorithm presented in this section tackles the problem of"predicting from expert
advice".While this problem is simpler than the problem of online learning,it has a greater
practical relevance.A learning algorithm is given the task to predict one of two possible
outcomes given the advice of n"experts".Each expert predicts"yes"or"no",and the
learning algorithm must use this information to make its own prediction.After making the
Learning from Data Streams 17
prediction,the algorithm is told the correct outcome.Thus,given a continuous input of
examples fed to the experts,a nal prediction has to be produced after every example.
The very simple algorithm called the Weighted Majority Algorithm (Littlestone and
Warmuth,1994),solves this basic problem by maintaining a list of weights w
1
,w
2
,w
3
,...
w
p
,one for each expert,which are updated every time a correct outcome is received such that
each mistaken expert is penalized by multiplying its weight by 1/2.The algorithm predicts
with a weighted majority vote of the expert opinions.As such,it does not eliminate a
hypothesis that is found to be inconsistent with some training example,but rather reduces its
weight.This enables it to accommodate inconsistent training data.The Weighted Majority
Algorithmalgorithmhas another very interesting property:The number of mistakes made by
the Weighted Majority Algorithm is never more than 2:42(m+log p) where m is the number
of mistakes made by the best expert so far.There are two important observations that
we can make based on the above described problem and algorithm.First,an ensemble of
experts which forms its prediction as a linear combination of the experts predictions should
be considered in the rst place if the user has a reason to believe that there is a single best
expert over the whole sequence of examples that is unknown.Since no assumptions are
made on the quality of the predictions or the relation between the expert prediction and
the true outcome,the natural goal is to perform nearly as well as the best expert so far.
Second,the target distribution is assumed to be stationary,and hence the best expert will
remain best over the whole sequence.
These assumptions may not be valid in practice.However,the Weighted Majority Al-
gorithm has served as the basis for extensive research on relative loss bounds for online
algorithms,where the additional loss of the algorithm on the whole sequence of examples
over the loss of the best expert is bounded.An interesting generalization of these relative
loss bounds given by Herbster and Warmuth (1998) which allows the sequence to be parti-
tioned into segments,with the goal of bounding the additional loss of the algorithm over the
sum of the losses of the best experts for each segment.This is to model situations in which
the concepts change and dierent experts are best for dierent segments of the sequence
of examples.The experts may be viewed as oracles external to the algorithm,and thus
may represent the predictions of a neural net,a decision tree,a physical sensor or perhaps
even of a human expert.Although the algorithms do not produce the best partition,their
predictions are close to those of the best partition.In particular,when the number of seg-
ments is k +1 and the sequence is of length l,the additional loss of their algorithm over
the best partition is bounded by O(klog p+klog(l=k)).This work is valid in the context of
online regression since it applies to four loss functions:the square,the relative entropy,the
Hellinger distance (loss),and the absolute loss.
2.5 Learning under Non-stationary Distributions
Given an innite stream of instances,the challenge of every online learning algorithm is to
maintain an accurate hypothesis at any time.In order for a learner to be able to infer a
model,which would be a satisfactory approximation of the target concept c or function f,it
is necessary to assume a sequence of training examples generated by an unknown stationary
data distribution D.However,it is highly unlikely that the distribution will remain as
is indenitely.For that reason,throughout this work,we assume a setup in which the
distribution underlying the data changes with time.
Our learning setup is thus represented with a stream of sequences S
1
,S
2
,...,S
i
,...
each of which represents a sequence of instances a
i
1
;a
i
2
;:::drawn from the corresponding
stationary distribution D
i
.We expect that D
i
will be replaced with the next signicantly
dierent stationary distribution D
i+1
after an unknown amount of time or number of in-
stances.Besides changes in the distribution underlying the instances in X,we must take
into account the possibility of changes in the target function.For example,in the simple
18 Learning from Data Streams
case of learning monotone disjunctions we can imagine that from time to time,variables are
added or removed from the target function f =x
i
_x
j
_x
k
.In general,we have to expect any
kind of changes in the shape of the target function or the target concept.Therefore,given
a sequence of target functions f
1
;f
2
;:::;f
i
;:::the task of the learning algorithm is to take
into account the changes in the distribution or the concept function,and adapt its current
hypothesis accordingly.
2.5.1 Tracking the Best Expert
In the eld of computational learning theory,a notable example of an online algorithm for
learning drifting concepts is the extension of the learning algorithms proposed by Herbster
and Warmuth (1998) in the context of tracking the best linear predictor (Herbster and
Warmuth,2001).The important dierence between this work and previous works is that
the predictor u
t
at each time point t is now allowed to change with time,and the total
online loss of the algorithm is compared to the sum of the losses of u
t
at each time point
plus the total cost for shifting to successive predictors.In other words,for a sequence S
of examples of length l a schedule of predictors hu
1
;u
2
;:::;u
l
i is dened.The total loss of
the online algorithm is thus bounded by the loss of the schedule of predictors on S and the
amount of shifting that occurs in the schedule.These types of bounds are called shifting
bounds.In order to obtain a shifting bound,it is normal to constrain the hypothesis of the
algorithm to a suitably chosen convex region.The new shifting bounds build on previous
work by the same authors,where the loss of the algorithmwas compared to the best shifting
disjunction (Auer and Warmuth,1998).The work on shifting experts has been applied to
predicting disk idle times (Helmbold et al.,2000) and load balancing problems (Blum and
Burch,2000).
While linear combinations of"experts"have been shown suitable for online learning,a
missing piece seems to be that the proposed algorithms assume that each expert (predictor)
at the end of a time point or a trial (receiving a training example,predicting and receiving
the correct output) is unrelated to the expert at the previous trial.Thus,there is some
information loss as compared to the setup where the experts are online algorithms,able to
update their hypothesis at the end of every trial.
2.5.2 Tracking Dierences over Sliding Windows
Due to the assumption that the learned models are relevant only over a window of most
recent data instances,there is a host of approaches based on some form of an adaptive