1
Real Time
Data Mining
Saed Sayad
20
10
2
Devoted to:
a dedicated
mother,
Khadijeh R
azavi
Sayad
a larger than life f
ather,
Mehdi Sayad
my
beautiful
wife,
Sara
Pourmolkara
and
my
wonderful children
,
Sahar
and
Salar
3
Forewo
rd
Data now appear in very large quantities and in real time but conventional data mining methods
can only be applied to
relatively
small, accumulated data batches. This book shows how, by
using a method termed the “Real Time Learning Machine” (RTLM) these methods can be
readily upgraded to accept data as they are generated and to flexibly deal with changes in how
they are to be proce
ssed. This real time data mining is the future of predictive modelling. The
main purpose of the book is to enable you, the reader, to intelligently apply real time data mining
to your own data. Thus, in addition to detailing the mathematics involved and
providing simple
numerical examples, a computer program is made freely available to allow you to learn the
method using your data.
This book summarizes twenty years of method development by Dr. Saed Sayad. During that
time he worked to apply data mining
in areas ranging from bioinformatics through chemical
engineering to financial engineering, in industry, in government and at university. This included
the period since 1996 when Saed has worked with me to co

supervise graduate students in the
Departmen
t of Chemical Engineering and Applied Chemistry at the University of Toronto.
Portions of the method have now been previously presented at data mining conferences,
published in scientific publications and used in Ph.D. thesis research as well as in gradua
te
courses at the University of Toronto. This book integrates all of the previous descriptions and
updates the nomenclature. It is intended as a practical instruction manual for data miners and
students of data mining.
4
I have now so often witnessed the
initial reaction to the numerous claims made for this very
elegant method: incredulity. There have been so many much more complex attempts in the
published literature to create practical real time data mining methods. None of these complex
attempts ha
ve even come close to matching the capabilities of the RTLM. This book is uniquely
important to the field of data mining: its combination of mathematical rigor, numerical
examples, software implementation and documented references, finally comprehensively
communicates what is Saed’s invention of the RTLM
.
Most importantly,
i
t does accomplish his
purpose in authoring the book: it very effectively enables you to apply the method with
confidence to your data. So, I sincerely hope that you will suspend yo
ur disbelief for a moment
about the claims for the method, read this book, experiment with the software using your own
data and then appreciate just how powerful is this apparently “too simple” real time data mining
method.
Stephen T. Balke Ph.D., P.Eng.
Professor Emeritus
University of Toronto
Toronto, Ontario, Canada
December 1, 2010
5
Two Kinds of Intelligence
There are two kinds of intelligence: one acquired,
as a child in school memorizes facts and concepts
from books and from what the teacher
says,
collecting information from the traditional sciences
as well as from the new sciences.
With such intelligence you rise in the world.
You get ranked ahead or behind others
in regard to your competence in retaining
information. You stroll with this in
telligence
in and out of fields of knowledge, getting always more
marks on your preserving tablets.
There is another kind of tablet, one
already completed and preserved inside you.
A spring overflowing its springbox. A freshness
in the center of the chest.
This other intelligence
does not turn yellow or stagnate. It's fluid,
and it doesn't move from outside to inside
through conduits of plumbing

learning.
This second knowing is a fountainhead
from within you, moving out.
From the translations of Rumi by Col
eman Barks
6
Table of Contents
1.
Introduction
:
Defining Real Time Data Mining
8
2.
The
Real Time
Learning
Machine
(
RTLM
)
1
1
2.1.
Basic Elements Table
1
4
2.2.
Attribute Types
1
5
2.2.1.
Missing Values
1
6
3.
Real Time
Data Exploration
1
9
3.1.
Univariate Statistical Analysis
20
3.1.1.
Count
20
3.1.2.
Mean
2
0
3.1.3.
Variance
2
1
3.1.4.
Standard Deviation
2
2
3.1.5.
Coefficient of Variation
23
3.1.6.
Skewness
2
3
3.1.7.
Kurtosis
2
4
3.1.8.
Median
2
6
3.1.9.
Mode
2
6
3.1.10.
Minimum and Maximum
2
7
3.2.
Bivariate Statistical Analysis
3
1
3.2.1.
Covariance
3
1
3.2.2.
Linear Correlation Coefficient
3
2
3.2.3.
Conditional Univariate Statistics
3
2
3.2.4.
Z
test
3
5
7
3.2.5.
T
test
3
6
3.2.6.
F
test
3
7
3.2.7.
Analysis Of Variance
38
3.2.8.
Z
test
–
Proport
ions
39
3.2.9.
Chi
2
test
4
1
4.
Real Time
Classification
4
7
4.1.
Naïve
Bayesian
48
4.2.
Linear Discriminant Analysis
5
4
4.2.1.
Quadratic Discriminant Analysis
5
7
4.3.
Linear Support Vector Machines
6
0
5.
Real Time
Regression
6
7
5.1.
Simple
Linear Regression
68
5.2.
Multiple Linear Regression
7
0
5.3.
Principal Components Analysis
79
5.4.
Princ
ipal Components Regression
82
5.5.
Linear Support
Vector Regression
8
3
6.
Real Time
Sequence Analysis
8
6
6.1.
Markov Chains
8
6
6.2.
Hidden Markov Models
88
7.
Real Time
Parallel Processing
9
0
References
9
2
Appendi
x A
9
3
Appendix B
10
5
8
1.0 Introduction
Data mining
is about explaining the past and predicting the future by exploring and analyzing
data. Data
mining is a multi

disciplinary
field which combines statistics, machine learning,
artificial intelligence and database technology.
A
lthough data mining
algorith
ms
are widely used in extremely diverse situations, in practice, one
or more major limitations almost invariably appear and significantly constrain successful data
mining applications.
Frequently, these problems are associated with large increases in the r
ate of
generation of data, the quantity of data and the number of attributes (variables) to be processed:
Increasingly, the data situation is now
beyond
the capabilities of conventional data mining
methods
.
The
term “
Real Time
” is used to describe how well a data mining
algorithm
can accommodate an
ever
increas
ing
data load
instantaneously
. However, such
real time
problems are usually closely
coupled
with the
fact that conventional data mining
algorithms
operate in a batch mod
e where
having all of the relevant data at once is a requirement.
Thus,
here
Real Time
D
ata
M
ining
is
defined
as having
all
of the following characteristics, independent of the amount of data
involved:
1.
Incremental learning
(L
earn)
: immediately
updat
ing
a
model
with each new
observation
without the necessity of pooling new data with old data.
2.
Decremental learning
(Forget)
: immediately
updat
ing
a
model by excluding
observations
identified as adversely affecting model performance without forming a new
dataset omitting this data and returnin
g to the model formulation step.
9
3.
Attribute
addition
(Grow)
:
Adding
a new attribute (
variable
)
on the fly
,
without the
necessity of pooling new
data with old data.
4.
Attribute
deletion
(Shrink)
: immediately discontinuing use of a
n
attribute
identified as
advers
ely affecting model performance.
5.
Scenario testing
:
rapid formulation and testing of multiple and diverse
models to
optimize
prediction
.
6.
Real
T
ime operation
:
Instantaneous data exploration
, modeling and model evaluation.
7.
In

L
ine operation
: processing that can be carried out in

situ (e.g.: in a mobile device, in
a satellite, etc.)
.
8.
Distributed processing:
separately processing distributed data
o
r segments of large data
(that may be located in diverse geographic locations) and re

combining the r
esults to
obtain a single model.
9.
Parallel processing:
carrying out parallel processing extremely rapidly from multiple
conventional processing units (
multi

threads, multi

processors
or
a specialized chip
).
Upgrading conventional data mining to
r
eal
t
ime
data mining is through the use of a method
termed the
Real Time
Learning Machine
or
RTLM
.
The use of the
RTLM
with conventional
data mining methods enables “
Real Time
D
ata
M
ining”.
The future of predictive modeling belongs to real time data mining and the main
motivation in
authoring
this book is to help
you
to understand the method and to implement it for
your
a
pplication. The book provides previously
published [1

6] and
unpublished details on
implementation of
real time
data mining
. Each section is followed by a simple numerical
example illustrating the mathematics. Finally, recognizing that the best way to le
arn and to
10
appreciate real time data mining is to apply it to your own data, a software program that easily
enables you to accomplish this is provided and is free for non

commercial applications.
The book
begin
s
by showing equations enabling
r
eal
t
ime
d
ata exploration previous to
development of useful models. These “
Real Time
E
quations
(
RTE
)
” appear similar to the usual
ones seen in many textbooks. However, closer examination will reveal a slightly different
notation than the conventional one. This notat
ion is n
ecessary to explain how
“
Real Time
Equation
” differ
s
from conventional ones.
Then, it
detail
s
how a “
Basic Elements Table
(BET)
”
is constructed from a dataset and used to achieve scalability
and real
time capabilities
in a data
mining
algorithm
.
Finally
, each of the following methods is examined in turn and the
r
eal
t
ime
equations necessary for utilization of the Basi
c Elements Table are provided:
Naïve
Bayesian,
linear discriminant analysis, linear support vector machines, multiple linear regress
ion, principal
component
analysis
and regression
, linear support vector regression, Markov c
hains and hidden
Markov models.
11
2.0 The
Real Time
Learning Machine
In the previous section
r
eal
t
ime
data mining algorithms
defined
as having nine characteristics,
independent of the amount of data involved. There
it is
mentioned that conventional data mining
methods are not
r
eal
t
ime
methods. For example, while learning in nature is incremental, on

line
and in real
time as should
r
eal
t
ime
algorithm
s
be, most learning algorithms in data mining
operate in a batch mode where having all the relevant data at once is a requirement. In this
section we present a widely applicable novel architecture for upgrading conventional data mining
meth
ods to
r
eal
t
ime
methods. This architecture is termed the “
Real Time
Learning Machine
”
(
RTLM
). This
new architecture adds
real
time analytical power
to the following widely used
conventional learning algorithms:
Naïve Bayesian
Linear Discriminant Analysis
Single and Multiple Linear Regression
Principal Component Analysis and Regression
Linear Support Vector Machines
and Regression
Markov Chains
Hidden Markov Models
Conventionally,
data mining
algorithms
interact directly with the whole
dataset
and must
somehow accommodate the impact of new data, changes in
attributes (
variables
)
, etc. An
important feature of the
RTLM
is that, as shown in Figure 1
, the
modeling process is s
plit into
12
four separate
components
:
L
earner
,
E
xplore
r
,
Modeler
and
P
redic
tor
. The data is summarized
in
the Basic Elements Table (BET) which is
a relatively small table.
Figure
2.1
Real
Time
Learning Machine
(RTLM)
The tasks assigned to each of the four
real time
components
are as follows:
Learner
:
updates (incrementally or decrementally) the Basic Elements Table
utilizing
the data
in real time
.
Explorer
:
does univariate and bivariate statistical data
analysis
using the Basic Elements
Table
in real time
.
Modeler
:
constructs
models using the Basic Ele
ments Table
in real time
.
Predictor
:
uses the models for prediction
in real time
.
The
RTLM
is not constrained by the amount of data involved and is a mathematically rigorous
method for making parallel data processing readily accomplished. As shown in Figures 2 and 3,
any size dataset can be divided to smaller parts and each part can be processe
d separately (multi

threads or multi

processors or
a specialized chip
) . The results can then be joined together to
obtain the same model as if we had processed the whole dataset at once.
13
Figure
2.
2
Parallel
Real Time
Learning Machine
Figure
2.3
Parallel Multi

layer
Real Time
Learning Machine
14
2.1 The
Basic Elements Table
The Basic Elements Table
building block
includes two
attributes,
,
and one or more basic
elements
:
BET
Figure 2.
4
B
uilding block of the
Basic Elements Tabl
e.
where
can
consist of one or more following basic elements:
:
T
otal number of joint occurrence of two
attributes
∑
and
∑
:
S
um
of data
∑
:
Sum of multiplication
∑
and
∑
:
S
um of
squared
d
ata
∑
(
)
:
S
um of square
d
multiplication
All above
seven
basic elements
can be update in real time (incrementally
or decrementally),
using the following
basic
general
real time
equation.
General
Real Time
Equation
where:
(+) represents incremental and (

) decremental change
of the basic elements
.
15
The above
seven
basic elements are not the only
ones;
there are more elements which could be
included in this list such
as
∑
,
∑
and more.
The number of attributes can also be updated in real time (incrementally or decrementally),
simply
by adding corresponding rows and columns and the related basic elements to the BET
table
.
2.2
Attribute
Types
There are
only
two types of
attributes
in BET
;
Numeric
and
Categorical (Binary)
. The
numerical
attributes
can
also
be
discretized
(binning
)
. The
categorical
attributes
and the
descretized version of the numerical attributes must
be
encoded
into
binary
(0, 1)
.
The following
example shows how
to
transf
orm
a
categorical attribute to its binary counterparts
.
Temperature
Humidity
Play
98
88
no
75
70
yes
90
96
no
78
?
yes
65
60
yes
Table
2.
1
O
riginal dataset
.
Temperature
Humidity
Play.yes
Play.no
98
88
0
1
75
70
1
0
90
96
0
1
78
?
1
0
65
60
1
0
Table
2.
2
C
ategorical attribute (Play) is transformed to two binary attributes.
16
Temperature
Temp.hot
Temp.moderate
Temp.mild
Humidity
Play.yes
Play.no
98
1
0
0
88
0
1
75
0
1
0
70
1
0
90
1
0
0
96
0
1
78
0
1
0
?
1
0
65
0
0
1
60
1
0
Table
2.
3
F
inal
transformed dataset with three new binary attributes which are created by
discretizing Temperature.
Note:
Technically
speaking
, for a
categorical
attribute
with
k
categories we
only
need
to create
k

1
binary
attributes but we do not suggest it for the
RTLM implementation.
2.2.1
Missing Values
Simply, a
ll
the
non

numeric
values
in the dataset are considered as missing values.
For example,
the “?” in the above dataset will be ignored by the RTLM Learner.
There is no need for any
missing values policy
here
, because RTLM can build a new model on the fly with excluding
attribute
s with
missing value
.
17
Basic Elements Table

Example
The Basic Element
Table
for the
above
sample dataset
with four basic elements
is shown below.
The RTLM Learner updates
the
basic elem
e
nts
table
with any new incoming data
.
∑
,
∑
∑
Temperature
Temp.hot
Temp.moderate
Temp.mild
Humidity
Play.yes
Play.no
Temperature
5
406
, 406
33638
5
406
, 2
188
5
406
, 2
153
5
406
, 1
65
4
328
, 314
26414
5
406
, 3
218
5
406
, 2
188
Temp.hot
5
2
, 2
2
5
2
, 2
0
5
2
, 1
0
4
2
, 314
184
5
2
, 3
0
5
2
, 2
2
Temp.moderate
5
2
, 2
2
5
2
, 1
0
4
1
, 314
70
5
2
, 3
2
5
2
, 2
0
Temp.mild
5
1
, 1
1
4
1
, 314
60
5
1
, 3
1
5
1
, 2
0
Humidity
4
314
, 314
25460
4
314
, 2
130
4
314
, 2
184
Play.yes
5
3
, 3
3
5
3
, 2
0
Play.no
5
2
, 2
2
Table
2.
4
Basic Elements Table for the sample dataset
.
18
H
ere,
it is
show
n
how
to
compute some of the necessary statistics for many modeling algor
ithms
using only the basic elements.
Numerical Variable
(
)
∑
(
)
∑
(
∑
)
(
)
Categorical Variable
(
)
∑
(
)
∑
Numerical Variable and Numerical Variable
(
)
∑
∑
∑
(
)
Numerical Variable and Categorical Variable
(
)
∑
∑
Categorical Variable and Categorical Variable
(
)
∑
∑
19
3.0 Real
Time
Data
Exploration
Data Exploration
is about describing the data by means of statistical and visualization
techniques
.
In this section
the
focus
is
on statistical
methods
and emphasize how specific
statistical quantities can be calculated using “
Real Time
E
quation
s
(RTE)
”.
In subsequent
sections it will be seen that the
“R
eal
T
ime
E
quations
”
enable upgrading of
the
convent
ional data
mining techniques to their real time counterparts
.
Real Time
Data
E
xploration
will be discussed
in the following two categories:
1.
Real Time
Univariate
Data
Exploration
2.
Real Time
Bivariate
Data
Exploration
20
3.1 Real
Time
Univariate
Data
Exploration
Univariate
data
analysis explores attributes
(variables)
one by one
using
statistical analysis
.
A
ttributes
are
either numerical
or
categorical
(encoded to binary)
. Numerical
attributes
can be
transformed into categorical counterparts by discretization
or binning
.
An example is
“
Age
”
with
three categories
(bins);
20

39, 40

59, and 60

79.
Equal Width and Equal Frequency are two
popular
binning methods.
Moreover, binning may improve accuracy of the predictive models by
reducing the noise or non

linearity
and
allows easy identification of outliers, invalid and missing
values.
3.1.1
Count
The total count of
k
subsets of
attribute
X
can be
computed in real time.
It means
a
data
set
can be
divided
into
k
subsets and the count of the whole data
will be equal to
the total count of all
its
subset
s
.
Real Time Equation 1: Count
(
)
The
notation preceding
in the above equation means that the number of data in a subset is a
positive quantity if those data are being added to the
BET
(incremental learning)
or a negative
quantity if those data are being subtracted from the
BET
(decremental learning)
.
3.1.2
Me
an
(Average)
The m
ean or average is a point estimation of a set of data.
As normally written,
a
verage
is not
a
r
eal
t
ime
equation
because
averages with different
N
cannot be added or subtracted
incrementally
.
21
̅
∑
However, using the same notation as u
sed in
the first r
eal
t
ime
equation
we see that the
summation can be written in a
r
eal
t
ime
form as follows:
Real Time
Equation 2: Su
m
of data
∑
∑
∑
(
)
Real Time
Equation 3: Mean
̅
∑
∑
(
)
where:
:
Count
(
1
)
∑
:
Sum
of data
(
2
)
Now
“
Mean”
can be written as a
r
eal
t
ime
quantity because the whole data is not required each
time it is calculated. Only the values of the subset sums are required along with the
count
in
each
subset.
3.1.3
Variance
The v
ariance is a measure of data dispersion or variability.
A low
variance
indicates that the data
tend to be very close to the mean, whereas high
variance
indicates that the data is spread out over
a large range of values.
Similarly,
the
variance
equation
is
not
real t
ime
by itself
.
22
∑
(
∑
)
However, the sums involved can be written as sums over subsets rather than over the whole data.
Real Time Equation 4: Sum of S
quare
d data
∑
∑
∑
(
)
Now
, we can write
real time
equations for the variance and the standard deviatio
n using the basic
elements.
Real Time Equation 5: Variance
(
∑
∑
)
(
∑
∑
)
(
)
(
)
(
)
Note: If the number of data is less than 30
we should
replace
N
with
N

1.
3.1.
4
Standard Deviation
Standard deviation like variance is
a
measure
of
the variability or dispersion
.
A low standard
deviation indicates that the data tend to be very close to the mean, whereas high standard
deviation indicates that the data is spread out ove
r a large range of values.
Real Time Equation 6: Standard Deviation
√
√
(
∑
∑
)
(
∑
∑
)
(
)
(
)
(
)
23
3.1.
5
Coefficient of Variation
The c
oefficient of variation is a
standardized
measure
of
the
dispersion or
variability
in data.
CV
is independent of the units of measurement.
Real Time Equation 7: Coefficient of Variation
̅
(
)
where:
̅
:
Average (
3
)
:
Standard Deviation
(
6
)
3.1.6
Skewness
Skewness is a
measure of symmetry or asymmetry in the distribution of data.
The s
kewness
equation
is not
real time
by itself but its components are.
(
)
(
)
∑
(
̅
)
First we need to expand the aggregate part of the equation:
∑
(
̅
)
∑
(
̅
̅
̅
)
(
∑
̅
∑
̅
∑
̅
)
24
where:
:
Count (
1
)
∑
:
Sum of data
(
2
)
̅
:
Average
(
3
)
:
Standard Deviation
(
6
)
Real Time Equation 8:
Sum of data to the power of 3
∑
∑
∑
(
)
Real Time Equation 9: Skewness
(
)
(
)
(
∑
̅
∑
̅
∑
̅
)
(
)
3.1.
7
Kurtosis
Kurtosis is a
measure of whether the data are peaked or flat relative to a normal distribution.
Like
skewness
,
the
standard
equation
for kurtosis is not real time
equation
but
can be transformed to
be real
time
.
[
(
)
(
)
(
)
(
)
∑
(
̅
)
]
(
)
(
)
(
)
First we need to expand the aggregate part of the equation:
25
∑
(
̅
)
∑
(
̅
̅
̅
̅
)
(
∑
̅
∑
̅
∑
̅
∑
̅
)
where:
: Count (
1
)
∑
:
Sum of data
(
2
)
̅
:
Average
(
3
)
∑
:
Sum of
S
quared data
(
4
)
:
Standard Deviation
(
6
)
∑
:
Sum of data to the power of 3
(
8
)
Real Time Equation 10:
Sum of data to the power of 4
∑
∑
∑
(
)
Real Time Equation 1
1
: Kurtosis
[
(
)
(
)
(
)
(
)
(
∑
̅
∑
̅
∑
̅
∑
̅
)
]
(
)
(
)
(
)
(
)
26
3.1.
8
Median
The m
edian is the middle data point where
b
elow and above
it
,
lie
an equal number of
data
points
.
The
median
equation
is not
r
eal
t
ime
and cannot be directly transformed to
. However
,
by
using a
discretized (binned)
version of the
attribute
we can often have a good estimation of
the
median.
Real Time Equation 12: Median
[
]
(
)
Figure out which bin contains the median by using the
(
N +
1)/2
formula.
N
j
is the count
for the median bin.
N
is the total count for all bins.
Find the cumulative percentage of the interval (
F
j

1
) preceding the median group.
h
is the range in each bin.
L
1
is lower limit value in the median bin.
3.1.9
Mode
Mode
like
median cannot be transformed to r
eal
t
ime
.
However
,
like median we can have a good
estimation of mode
by having
the discretized
(binned)
version of
a
numerical
attribute
.
When
numerical
attributes
are
discretized
in bins, the mode is defined as the bin wh
ere most
observations lie.
27
3.1.
10
Minimum
and
Maximum
Minimum
and Maximum
can be updated
in real
t
ime
incrementally
but not decrementally.
It
means if we lose an existing maximum or minimum value we would need to consider
all
historical
data to replace them.
One practical option is using the lower bound (minimum) and
upper bound (maximum) of the discretized version of a numerical attribute
.
28
Summary of
Real Time
Univariate
Data
Analysis
All the above univariate
r
eal
t
ime
statistical equations are based on only five basic elements. To
calculate any of the
univariate quantities
, we only
need
to save and update
the following
five
elements in the
B
asic
E
lements
T
able (
BET
)
.
Real Time Equation 1: Count
Real
Time Equation 2: Sum
of data
∑
∑
∑
Real Time Equation 4: Sum of
S
quared
data
∑
∑
∑
Real Time Equation 8:
Sum of data to the power of 3
∑
∑
∑
Real Time Equation 10:
Sum of data to the power of 4
∑
∑
∑
29
Real Tim
e Univariate
Data
Analysis

Example
s
T
he following univariate real time statistical quantities
are based on the Iris dataset
found
in the
Appendix A. To calculate any of the univariate quanti
ty
, we only need to use the elements of the
Basic Elements Table
(BET) generated from the Iris dataset.
All the BET elements are
updateable in real time.
s
epal_length
i
nd
ices
Count
Mean
̅
∑
Variance
∑
(
∑
)
(
)
Standard Deviation
√
√
Coefficient of
Variation
̅
Skewness
(
)
(
)
(
∑
̅
∑
̅
∑
̅
)
(
)
(
)
(
)
30
Kurtosis
[
(
)
(
)
(
)
(
)
(
∑
̅
∑
̅
∑
̅
∑
̅
)
]
(
)
(
)
(
)
[
(
)
(
)
(
)
(
)
(
)
]
(
)
(
)
(
)
睨w牥:
∑
∑
(
)
∑
(
)
∑
(
)
∑
Figure 3.1
Univariate analysis on a numerical attribute.
Iris_setosa
indices
Count
Mean
̅
∑
Variance
∑
(
∑
)
(
)
(
)
(
)
Standard
Deviation
√
√
(
)
√
Figure 3.
2
Univariate analysis on a categorical (binary) attribute.
31
3
.2 Real
Time
Bivariate
Data Analysis
Bivariate
data
analysis is the simultaneous analysis of two attributes
(
variables). It explores the
concept of relationship between two
attributes
, whether there
is
an association and the strength of
this association, or whether there are differences between two
attributes
and the sig
nificance of
these differences.
3.2.1
Covariance
Covariance measures t
he extent to which two
numerical attributes
vary together.
That is
a
measure of the linear relationship between two
attributes.
Real Time Equation 13: Covariance
(
)
∑
∑
∑
(
)
w
here:
:
Count
(
1
)
∑
∑
:
Sum of data
(2)
Real Time Equation 14: Sum of Multiplications
∑
∑
∑
(
)
The following
r
eal
t
ime
equation is also
very useful
.
Real Time Equation 15: Sum of Square
d
Multiplication
∑
(
)
∑
(
)
∑
(
)
(
)
32
3.2.
2
Linear Correlation Coefficient
Linear correlation quantifies the strength of a linear relationship between two
attributes
. When
there is no correlation between two
attributes
, there is no tendency for the
values of one quantity
to increase or decrease with the values of the second quantity.
The linear correlation coefficient
measures the strength of a linear relationship and is always between

1 and 1 where

1 means
perfect negative linear correlation and +
1 means perfect positive linear correlation and zero
means no linear correlation.
Real Time Equation 16: Linear Correlation Coefficient
(
)
√
(
)
where:
(
)
:
Covariance
(
13
)
:
Variance
(
5
)
3.2.3
Conditional
Univariate Statistics
The following equations define
univariate statistics
for an attribute
given a binary attribut
e
when
.
Many
of the bivariate statistics rely on these conditional univariate statistics.
Re
al Time Equation 1
7
: Conditional Count
(

)
∑
(
)
33
Real Time Equation 1
8
: Conditional
Sum of data
(

)
∑
∑
(
)
Real Time Equation
19
: Conditional
Sum of
S
quared data
(

)
∑
∑
(
)
(
)
Real Time Equation
2
0
: Conditional Mean
(

)
̅
∑
∑
∑
(
)
Real Time Equation
2
1
: Conditional Variance
(

)
∑
(
∑
)
∑
(
)
(
∑
)
∑
∑
(
)
Real Time Equation 2
2
: Conditional Standard Deviation
√
√
∑
(
∑
)
√
∑
(
)
(
∑
)
∑
∑
(
)
Complement Conditional Univariate St
atistics
For
real time
predictive
modeling we also need to define
conditional
univariate statistics for an
attribute
given a binary attribute
when
.
34
Real Time Equation
23
: Complement Conditional Count
(

)
̅
∑
(
)
Real Time Equation
24
: Complement Conditional
Sum of data
(

)
∑
̅
∑
∑
(
)
Real Time Equation
25
: Complement Conditional
Sum of
S
quared data
(

)
∑
̅
∑
∑
(
)
(
)
Real Time Equation 2
6
:
Complement
Conditional Mean
(

)
̅
̅
∑
̅
̅
∑
∑
∑
(
)
Real Time Equation 2
7
:
Complement Conditional Variance
(

)
̅
∑
̅
(
∑
̅
)
̅
̅
(
∑
∑
(
)
)
(
∑
∑
)
(
∑
)
(
∑
)
(
)
Real Time Equation 2
8
:
Complement
Conditional Standard Deviation
̅
√
̅
√
∑
̅
(
∑
̅
)
̅
̅
√
(
∑
∑
(
)
)
(
∑
∑
)
(
∑
)
(
∑
)
(
)
35
3.2.
4
Z
t
est
The
Z
test assess
es
whether the
difference between
averages of two
attributes
are statistically
significant
. This analysis is approp
riate for comparing the average
of
a
numerical attribute
with a
known average
or
two
conditional
averages of
a
numerical
attribute
given
two
binary attributes
(two
categories
of the same categorical attribute)
.
Real Time Equation
2
9
: Z
t
est

one
group
̅
√
(
)
where:
̅
:
Mean or Conditional Mean
(3
or 20
)
:
Standard
Deviation
or Conditional Standard
Deviation
(
6
or 22
)
:
Count
or Conditional Count
(1
or 17
)
:
k
nown
average
T
he probability of
Z
(using normal distribution) defines
the significance of
the difference
between two averages.
Real Time Equation
30
: Z
t
est

two
groups
̅
̅
√
(
)
where:
̅
̅
:
Conditional Mean
(
20
)
:
Conditional
Variance (
2
1
)
36
:
Conditional
Count (
1
7
)
3.2.
5
T
t
est
The
T
test
like
Z
test
assess
es
whether the averages of two
numerical
attributes
are statistically
different from each other
when the number of data points is less than 30
.
T
test
is approp
riate for
comparing the average
of
a numerical attribute with a known average
or
two conditional
averages of a
numerical
attribute
given
two
binary attributes (two
categories
of the same
categorical attribute)
.
Real Time Equation
31
: T
t
est

one
group
̅
√
(
)
where:
̅
:
Mean or Conditional Mean
(3
or 20
)
:
Standard
Deviation
or Conditional Standard
Deviation (6
or 22
)
:
Count or Conditional Count (1 or 17)
: Known average
The probability
of
t
(using
t
distribution with
N

1
degree of freedom) defines if
the difference
between two averages is
statistically
significant.
Real Time Equation
32
: T
t
est

two
groups
̅
̅
√
(
)
(
)
37
(
)
(
)
where:
̅
̅
:
Conditional Mean
(
20
)
:
Conditional
Variance (21)
:
Conditional Count (
17)
3.2.
6
F
t
est
The
F

test is used to compare the variances of two
attributes
.
F
test
can be used for comparing
the variance
of
a numerical attribute with a known variance
or
two conditional variances of a
numerical
attribute
given
two
binary attributes (two
categories
of the
same categorical attribute)
.
Real Time Equation
33
: F
t
est
–
one group
(
)
(
)
where:
:
Count or Conditional Count (1 or 17)
:
Variance
or
Conditional
Variance (
5
or 21
)
:
k
nown variance
:
has
distribution with
N

1
degree of freedom
Real Time Equation
34
: F
t
est
–
two
group
s
(
)
38
where:
:
Conditional
Variance (21)
:
has
F
distribution with
degree of freedoms
3.2.
7
Analysis of Variance (ANOVA)
ANOVA assesses whether the
averages
of more than two groups are statistically different from
each other
,
under the assumption that the
corresponding
populations are normally distributed.
ANOVA is useful
for comparing average
s
of
two or more
numerical attribute
s
or
two
or more
conditional averages of a
numerical
attribute
given
two
or more
binary attributes (two
or more
categories
of the same categorical attribute)
.
Source of
Variation
Sum of
Squares
Degree of
Freedom
Mean Squares
F
P
robabili
ty
Between Groups
(
)
W楴桩渠䝲潵灳
呯瑡l
Figure 3.
3
Analysis of Variance
and its components.
Real Time Equation
35
: Sum of Squares Between
Groups
∑
(
∑
)
(
∑
(
∑
)
)
∑
(
)
39
where:
:
Conditional
Count (1
7
)
∑
:
Conditional
Sum of data
(
18
)
Real Time Equation
36
: Sum of Squares Within Groups
∑
(
∑
)
∑
(
∑
)
(
)
∑
where:
:
Conditional
Count (17)
∑
:
Conditional
Sum of data
(18)
∑
:
Conditional
Sum of
S
quare
d data
(
19
)
Real Time Equation
37
: Sum of Squares Total
(
)
∑
:
has
F
distribution with
and
degree of freedom
s.
3.2.
8
Z
t
est
–
Proportions
The
Z
test can
also
be used to compare proportions.
It can be used to compare a proportion from
one categorical attribute with
a known proportion or
compare two proportions originated from
two
binary
a
ttributes
(two
categories
of the same categorical attribute)
.
40
Real Time Equation
38
: Z
t
est

one group
√
(
)
(
)
where:
:
Count or
Conditional
Count
(
1 or
17
)
:
Sum of data
or
Conditional
Sum of data
(
2 or
1
8
)
: known
probability
:
has normal distribution
Real Time Equation
39
: Z
t
est

two groups
√
̂
(
̂
)
(
)
(
)
̂
where:
:
Conditional
Count
(
1
7
)
:
Conditional
Sum of data
(18)
:
has
normal distribution
T
he probability of
Z
(using normal distribution) defines the significance of
the difference
between two
proportions
.
41
3.2.
8
Chi
2
t
est
(
Test of Independence
)
The
Chi
2
tes
t can be used to determine the
a
s
sociation between categorical
(binary)
attributes
. It
is based on the difference between the expected frequencies
and the observed frequencies
in one
or more categories in the frequency table. The
Chi
2
dis
tribution returns a probability
for the
computed
Chi
2
and the degree of freedom. A probability of ze
ro shows complete dependency
between two categorical
attributes
and a probability of one means that two categorical
attributes
are completely indepe
ndent
.
Real Time Equation
40
:
Chi
2
t
est (
Test of Independence
)
∑
∑
(
)
(
)
(
)
(
)
where:
∑
:
Conditional
Sum of data
(1
8
)
: expected
frequency
from the subset
: degree of freedom
: number of rows
and columns
: has Chi
2
distribution with
(
)
(
)
degree of
freedom
42
Summary of
Real Time
Bivariate
Data
Analysis
All the above
r
eal
t
ime
bivariate statistical equations are b
ased on only 5 basic elements.
As in
the case of
the r
eal
t
ime
univariate statistical analysis, to calculate the
required
statistical
quantit
ies
we
only
need
to save and update
these five elements in the
basic elements table (BET)
.
Real Time Equation 1: Count
Real Time Equation 2:
Sum of data
∑
∑
∑
Real Time Equation 4:
Sum of S
quared data
∑
∑
∑
Real Time
Equation 14: Sum of Multiplication
∑
∑
∑
Real
Time Equation 15: Sum of Squared
Multiplication
∑
(
)
∑
(
)
∑
(
)
43
As a reminder, the following conditional basic elements are derived directly from the above five
basic elements. These equations define univariate statistics for an attribute
given a binary
attribute
when
.
Real Time Equation 17: Conditional Count
(

)
∑
Real Time Equation 18: Conditional
Sum of data
(

)
∑
∑
Real Time Equation 19: Conditional
Sum of
S
quared data
(

)
∑
∑
(
)
Real Time Equation 23: Complement Conditional Count
(

)
̅
∑
Real Time Equation 24: Complement Conditional
Sum of data
(

)
∑
̅
∑
∑
Real Time Equation 25: Complement Conditional
Sum of
S
quared data
(

)
∑
̅
∑
∑
(
)
44
Real Time
B
ivariate
Data
Ana
lysis

Example
s
T
he following bivariate real time statistical quantities are based on the Iris dataset in the
Appendix A. To calculate any bivariate quanti
ty
, we only need to use the elements of the Basic
Elements Table (BET) generated from the Iris datas
et.
All the BET elements are updateable in
real time.
sepal_length
(1)
,
petal_length
(3)
indices
Covar
iance
(
)
∑
∑
∑
Linear Corre
l
ati
on
(
)
√
√
睨w牥:
∑
(
∑
)
(
)
∑
(
∑
)
(
)
Figure 3.
4
B
ivariate analysis
on two numerical attributes.
45
sepal_length
(1)
,
sepal_width_b1
(7)
, sepal_width_b2
(8)
indices
Z test
̅
̅
√
√
(
)
(
)
睨w牥:
̅
̅
∑
∑
̅
̅
∑
∑
∑
(
)
(
∑
)
∑
∑
(
)
∑
(
)
(
∑
)
∑
∑
(
)
∑
∑
Figure 3.
5
Bivariate analysis on one numerical attribute and one categorical (binary) attribute.
46
petal_width_b1
(11)
, petal_width_b2
(12)
Iris_setosa
(13)
,
Iris_versicolor
(14)
,
Iris_virginica
(15)
indices
Chi
2
test
∑
∑
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
†
(
)
(
)
Iris_setosa
Iris_versicolor
Iris_virginica
petal_width_b1
∑
∑
∑
㔰
petal_width_b2
∑
∑
∑
1
㔰
㔰
㔰
ㄵ1
Iris_setosa
Iris_versicolor
Iris_virginica
petal_width_b1
㔰
petal_width_b2
1
㔰
㔰
㔰
ㄵ1
Figure 3.
6
Bivariate analysis on two categorical attributes.
47
4.0 Real
Time
Classification
Classification
refers to the data mining
task
of attempting to build a predictive model when the
target
is categorical.
The main goal of classification is to divide a dataset into mutually
exclusive groups such that the members of each group are as close as possible to
one another,
and different groups are as far as possible from one another.
There are many different
classification algorithms (e.g.,
Naïve
Bayesian, Decision Tree, Support Vector Machines, etc.)
.
H
owever
,
not all classification algorithms
can have a real
time version
.
Here we discuss three
classification
algorithms which can be
built and
updated in real time
using the
Basic Elements
Tables
.
Naïve Bayesian
Linear Discriminant Analysis
Linear Support Vector Machines
48
4.1 Real
Time
Naïve
Bayesian
The
Naïve
Bayesian classifier is based on Bayes’ theorem with independence assumptions
between
attributes
. A
Naïve
Bayesian model is easy to build, with no complicated iterative
parameters estimation which makes it particularly useful for very large datasets.
Despite its
simplicity, the
Naïve
Bayesian classifier often does surprisingly well and is widely used because
it often outperforms more sophisticated classification methods.
Algorithm
–
Real Time
Bayes
’
theorem provides a way of calculating the posterior probability,
P
(
c
a
), from the class
(
binary attribute
) prior probability,
P
(
c
), the prior probability
of the value of attribute
,
P
(
a
), and
the likelihood,
P
(
a
c
). Naive Bayes classifier assumes that the
effect of the value of attribute (
a
)
on a given class (
c
) is independent of the values of other
attributes
. This assumption is called
class
conditional independence.
Bayes’ Rule:
(
)
(
)
(
)
(
)
Class conditional i
ndependence:
(
)
(
)
(
)
(
)
(
)
(
)
49
In t
he real time version of
Bayesian classifiers we
c
alculat
e
the
likelihood
and
the
prior
probabilities
from the Basic Elements Table
(BET) which can be updated in real time.
Real Time Equation 41: Likelihood
(
)
̅
∑
∑
(
)
Real Time Equation 42: Class Prior Probability
(
)
̅
∑
(
)
Real Time Equation 43: Attribute

value
Prior Probability
(
)
̅
∑
(
)
where:
̅
:
Conditional Mean (
20
)
̅
and
̅
:
Mean
(3)
∑
:
Sum of M
ultiplication
(14)
∑
∑
:
Sum of data
(2)
:
Count (1)
If
the attribute
(
A
)
is numerical
the
likelihood
for its value (
a
)
ca
n be calculated from the normal
distribution
equation.
(
)
√
(
̅
)
50
Numerical Attribute Mean
:
̅
̅
∑
∑
Numerical Attribute Variance
:
∑
(
)
(
∑
)
∑
∑
where:
̅
:
Conditional Mean (
20
)
:
Conditional Variance
(
21
)
∑
(
)
:
Sum of Square
d
Multiplication (15)
∑
:
Sum of Multiplication
(14)
∑
∑
:
Sum of data
(2)
:
Count (1)
In practice
, there is no need to calculate
P(a)
because it is a constant value for all the classes and
can be considered as a normalization factor
Comments 0
Log in to post a comment