Real Time Data Mining

siberiaskeinΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

424 εμφανίσεις

1








Real Time

Data Mining


Saed Sayad


20
10










2








Devoted to:


a dedicated

mother,
Khadijeh R
azavi

Sayad


a larger than life f
ather,
Mehdi Sayad


my
beautiful
wife,
Sara
Pourmolkara


and
my
wonderful children
,
Sahar

and
Salar




























3


Forewo
rd

Data now appear in very large quantities and in real time but conventional data mining methods
can only be applied to
relatively
small, accumulated data batches. This book shows how, by
using a method termed the “Real Time Learning Machine” (RTLM) these methods can be
readily upgraded to accept data as they are generated and to flexibly deal with changes in how
they are to be proce
ssed. This real time data mining is the future of predictive modelling. The
main purpose of the book is to enable you, the reader, to intelligently apply real time data mining
to your own data. Thus, in addition to detailing the mathematics involved and

providing simple
numerical examples, a computer program is made freely available to allow you to learn the
method using your data.

This book summarizes twenty years of method development by Dr. Saed Sayad. During that
time he worked to apply data mining

in areas ranging from bioinformatics through chemical
engineering to financial engineering, in industry, in government and at university. This included
the period since 1996 when Saed has worked with me to co
-
supervise graduate students in the
Departmen
t of Chemical Engineering and Applied Chemistry at the University of Toronto.
Portions of the method have now been previously presented at data mining conferences,
published in scientific publications and used in Ph.D. thesis research as well as in gradua
te
courses at the University of Toronto. This book integrates all of the previous descriptions and
updates the nomenclature. It is intended as a practical instruction manual for data miners and
students of data mining.

4


I have now so often witnessed the

initial reaction to the numerous claims made for this very
elegant method: incredulity. There have been so many much more complex attempts in the
published literature to create practical real time data mining methods. None of these complex
attempts ha
ve even come close to matching the capabilities of the RTLM. This book is uniquely
important to the field of data mining: its combination of mathematical rigor, numerical
examples, software implementation and documented references, finally comprehensively

communicates what is Saed’s invention of the RTLM
.
Most importantly,

i
t does accomplish his
purpose in authoring the book: it very effectively enables you to apply the method with
confidence to your data. So, I sincerely hope that you will suspend yo
ur disbelief for a moment
about the claims for the method, read this book, experiment with the software using your own
data and then appreciate just how powerful is this apparently “too simple” real time data mining
method.

Stephen T. Balke Ph.D., P.Eng.

Professor Emeritus

University of Toronto

Toronto, Ontario, Canada

December 1, 2010






5


Two Kinds of Intelligence


There are two kinds of intelligence: one acquired,

as a child in school memorizes facts and concepts

from books and from what the teacher

says,

collecting information from the traditional sciences

as well as from the new sciences.

With such intelligence you rise in the world.

You get ranked ahead or behind others

in regard to your competence in retaining

information. You stroll with this in
telligence

in and out of fields of knowledge, getting always more

marks on your preserving tablets.

There is another kind of tablet, one

already completed and preserved inside you.

A spring overflowing its springbox. A freshness

in the center of the chest.

This other intelligence

does not turn yellow or stagnate. It's fluid,

and it doesn't move from outside to inside

through conduits of plumbing
-
learning.

This second knowing is a fountainhead

from within you, moving out.

From the translations of Rumi by Col
eman Barks

6


Table of Contents

1.

Introduction
:
Defining Real Time Data Mining






8


2.

The
Real Time

Learning

Machine

(
RTLM
)






1
1

2.1.

Basic Elements Table








1
4

2.2.

Attribute Types









1
5

2.2.1.

Missing Values








1
6

3.

Real Time

Data Exploration








1
9

3.1.

Univariate Statistical Analysis







20

3.1.1.

Count










20

3.1.2.

Mean










2
0

3.1.3.

Variance









2
1

3.1.4.

Standard Deviation








2
2

3.1.5.

Coefficient of Variation







23

3.1.6.

Skewness









2
3

3.1.7.

Kurtosis









2
4

3.1.8.

Median









2
6

3.1.9.

Mode










2
6

3.1.10.

Minimum and Maximum







2
7


3.2.

Bivariate Statistical Analysis








3
1

3.2.1.

Covariance









3
1

3.2.2.

Linear Correlation Coefficient






3
2

3.2.3.

Conditional Univariate Statistics






3
2

3.2.4.

Z

test










3
5

7


3.2.5.

T

test










3
6

3.2.6.

F

test










3
7

3.2.7.

Analysis Of Variance








38

3.2.8.

Z

test


Proport
ions








39


3.2.9.

Chi
2

test









4
1


4.

Real Time

Classification









4
7


4.1.

Naïve
Bayesian









48


4.2.

Linear Discriminant Analysis







5
4

4.2.1.

Quadratic Discriminant Analysis






5
7

4.3.

Linear Support Vector Machines







6
0


5.

Real Time

Regression









6
7


5.1.

Simple

Linear Regression








68


5.2.

Multiple Linear Regression








7
0


5.3.

Principal Components Analysis







79


5.4.

Princ
ipal Components Regression







82

5.5.

Linear Support
Vector Regression







8
3


6.

Real Time

Sequence Analysis








8
6


6.1.

Markov Chains









8
6


6.2.

Hidden Markov Models








88


7.

Real Time
Parallel Processing








9
0

References











9
2

Appendi
x A











9
3

Appendix B











10
5

8


1.0 Introduction


Data mining

is about explaining the past and predicting the future by exploring and analyzing
data. Data

mining is a multi
-
disciplinary

field which combines statistics, machine learning,
artificial intelligence and database technology.



A
lthough data mining
algorith
ms
are widely used in extremely diverse situations, in practice, one
or more major limitations almost invariably appear and significantly constrain successful data
mining applications.
Frequently, these problems are associated with large increases in the r
ate of
generation of data, the quantity of data and the number of attributes (variables) to be processed:

Increasingly, the data situation is now
beyond
the capabilities of conventional data mining
methods
.

The

term “
Real Time
” is used to describe how well a data mining
algorithm
can accommodate an

ever

increas
ing

data load

instantaneously
. However, such
real time

problems are usually closely
coupled
with the

fact that conventional data mining
algorithms
operate in a batch mod
e where
having all of the relevant data at once is a requirement.

Thus,
here
Real Time

D
ata
M
ining

is
defined
as having
all
of the following characteristics, independent of the amount of data
involved:


1.

Incremental learning

(L
earn)
: immediately
updat
ing

a

model
with each new
observation

without the necessity of pooling new data with old data.

2.

Decremental learning

(Forget)
: immediately
updat
ing

a
model by excluding
observations

identified as adversely affecting model performance without forming a new
dataset omitting this data and returnin
g to the model formulation step.

9


3.

Attribute

addition

(Grow)
:
Adding

a new attribute (
variable
)

on the fly
,
without the
necessity of pooling new

data with old data.

4.

Attribute
deletion

(Shrink)
: immediately discontinuing use of a
n

attribute
identified as
advers
ely affecting model performance.

5.

Scenario testing
:

rapid formulation and testing of multiple and diverse
models to
optimize
prediction
.

6.

Real

T
ime operation
:
Instantaneous data exploration
, modeling and model evaluation.

7.

In
-
L
ine operation
: processing that can be carried out in
-
situ (e.g.: in a mobile device, in
a satellite, etc.)
.

8.

Distributed processing:

separately processing distributed data

o
r segments of large data

(that may be located in diverse geographic locations) and re
-
combining the r
esults to
obtain a single model.

9.

Parallel processing:

carrying out parallel processing extremely rapidly from multiple
conventional processing units (
multi
-
threads, multi
-
processors
or
a specialized chip
).


Upgrading conventional data mining to
r
eal
t
ime

data mining is through the use of a method
termed the
Real Time

Learning Machine

or
RTLM
.


The use of the
RTLM

with conventional
data mining methods enables “
Real Time

D
ata
M
ining”.

The future of predictive modeling belongs to real time data mining and the main

motivation in
authoring

this book is to help
you

to understand the method and to implement it for
your

a
pplication. The book provides previously
published [1
-
6] and
unpublished details on
implementation of
real time

data mining
. Each section is followed by a simple numerical
example illustrating the mathematics. Finally, recognizing that the best way to le
arn and to
10


appreciate real time data mining is to apply it to your own data, a software program that easily
enables you to accomplish this is provided and is free for non
-
commercial applications.

The book

begin
s

by showing equations enabling
r
eal
t
ime

d
ata exploration previous to
development of useful models. These “
Real Time

E
quations

(
RTE
)
” appear similar to the usual
ones seen in many textbooks. However, closer examination will reveal a slightly different
notation than the conventional one. This notat
ion is n
ecessary to explain how

Real Time

Equation
” differ
s

from conventional ones.
Then, it

detail
s

how a “
Basic Elements Table

(BET)

is constructed from a dataset and used to achieve scalability

and real
time capabilities

in a data
mining
algorithm
.
Finally
, each of the following methods is examined in turn and the
r
eal
t
ime

equations necessary for utilization of the Basi
c Elements Table are provided:
Naïve
Bayesian,
linear discriminant analysis, linear support vector machines, multiple linear regress
ion, principal
component
analysis

and regression
, linear support vector regression, Markov c
hains and hidden
Markov models.




















11


2.0 The

Real Time

Learning Machine


In the previous section
r
eal
t
ime

data mining algorithms

defined
as having nine characteristics,
independent of the amount of data involved. There
it is

mentioned that conventional data mining
methods are not
r
eal
t
ime

methods. For example, while learning in nature is incremental, on
-
line
and in real

time as should
r
eal

t
ime

algorithm
s

be, most learning algorithms in data mining
operate in a batch mode where having all the relevant data at once is a requirement. In this
section we present a widely applicable novel architecture for upgrading conventional data mining
meth
ods to
r
eal
t
ime

methods. This architecture is termed the “
Real Time

Learning Machine

(
RTLM
). This

new architecture adds
real
time analytical power

to the following widely used
conventional learning algorithms:



Naïve Bayesian



Linear Discriminant Analysis



Single and Multiple Linear Regression



Principal Component Analysis and Regression



Linear Support Vector Machines

and Regression



Markov Chains



Hidden Markov Models


Conventionally,
data mining

algorithms

interact directly with the whole
dataset

and must
somehow accommodate the impact of new data, changes in
attributes (
variables
)
, etc. An
important feature of the
RTLM

is that, as shown in Figure 1
, the

modeling process is s
plit into
12


four separate
components
:
L
earner
,
E
xplore
r
,
Modeler

and
P
redic
tor
. The data is summarized
in
the Basic Elements Table (BET) which is
a relatively small table.


Figure
2.1

Real

Time

Learning Machine

(RTLM)


The tasks assigned to each of the four

real time

components

are as follows:



Learner
:
updates (incrementally or decrementally) the Basic Elements Table
utilizing

the data

in real time
.



Explorer
:

does univariate and bivariate statistical data
analysis

using the Basic Elements
Table

in real time
.



Modeler
:

constructs

models using the Basic Ele
ments Table

in real time
.



Predictor
:

uses the models for prediction

in real time
.


The
RTLM

is not constrained by the amount of data involved and is a mathematically rigorous
method for making parallel data processing readily accomplished. As shown in Figures 2 and 3,
any size dataset can be divided to smaller parts and each part can be processe
d separately (multi
-
threads or multi
-
processors or
a specialized chip
) . The results can then be joined together to
obtain the same model as if we had processed the whole dataset at once.

13




Figure
2.
2

Parallel
Real Time

Learning Machine







Figure
2.3

Parallel Multi
-
layer
Real Time

Learning Machine










14


2.1 The

Basic Elements Table

The Basic Elements Table

building block

includes two
attributes,



,



and one or more basic
elements



:

BET










Figure 2.
4
B
uilding block of the
Basic Elements Tabl
e.


where



can
consist of one or more following basic elements:






:

T
otal number of joint occurrence of two
attributes







and




:

S
um

of data









:

Sum of multiplication








and





:

S
um of
squared
d
ata




(




)


:

S
um of square
d
multiplication

All above
seven

basic elements
can be update in real time (incrementally
or decrementally),
using the following
basic
general
real time

equation.


General
Real Time

Equation











where:











(+) represents incremental and (
-
) decremental change

of the basic elements
.

15


The above
seven

basic elements are not the only
ones;

there are more elements which could be
included in this list such
as




,




and more.

The number of attributes can also be updated in real time (incrementally or decrementally),
simply
by adding corresponding rows and columns and the related basic elements to the BET
table
.



2.2
Attribute

Types

There are
only
two types of
attributes

in BET
;
Numeric

and
Categorical (Binary)
. The
numerical
attributes
can
also
be

discretized
(binning
)
. The
categorical

attributes
and the
descretized version of the numerical attributes must

be
encoded

into
binary

(0, 1)
.
The following
example shows how
to
transf
orm
a

categorical attribute to its binary counterparts
.

Temperature

Humidity

Play

98

88

no

75

70

yes

90

96

no

78

?

yes

65

60

yes

Table

2.
1
O
riginal dataset
.


Temperature

Humidity

Play.yes

Play.no

98

88

0

1

75

70

1

0

90

96

0

1

78

?

1

0

65

60

1

0

Table

2.
2

C
ategorical attribute (Play) is transformed to two binary attributes.

16


Temperature

Temp.hot

Temp.moderate

Temp.mild

Humidity

Play.yes

Play.no

98

1

0

0

88

0

1

75

0

1

0

70

1

0

90

1

0

0

96

0

1

78

0

1

0

?

1

0

65

0

0

1

60

1

0

Table

2.
3
F
inal
transformed dataset with three new binary attributes which are created by
discretizing Temperature.


Note:

Technically

speaking
, for a
categorical
attribute

with
k

categories we
only
need

to create
k
-
1

binary
attributes but we do not suggest it for the
RTLM implementation.


2.2.1

Missing Values

Simply, a
ll
the
non
-
numeric
values
in the dataset are considered as missing values.

For example,
the “?” in the above dataset will be ignored by the RTLM Learner.

There is no need for any
missing values policy

here
, because RTLM can build a new model on the fly with excluding
attribute
s with
missing value
.








17


Basic Elements Table

-

Example

The Basic Element
Table
for the
above
sample dataset
with four basic elements
is shown below.


The RTLM Learner updates

the
basic elem
e
nts

table
with any new incoming data
.








,













Temperature




Temp.hot




Temp.moderate




Temp.mild




Humidity




Play.yes




Play.no




Temperature

5

406
, 406

33638

5

406
, 2

188

5

406
, 2

153

5

406
, 1

65

4

328
, 314

26414

5

406
, 3

218

5

406
, 2

188




Temp.hot


5

2
, 2

2

5

2
, 2

0

5

2
, 1

0

4

2
, 314

184

5

2
, 3

0

5

2
, 2

2




Temp.moderate



5

2
, 2

2

5

2
, 1

0

4

1
, 314

70

5

2
, 3

2

5

2
, 2

0




Temp.mild




5

1
, 1

1

4

1
, 314

60

5

1
, 3

1

5

1
, 2

0




Humidity





4

314
, 314

25460

4

314
, 2

130

4

314
, 2

184




Play.yes






5

3
, 3

3

5

3
, 2

0




Play.no







5

2
, 2

2

Table

2.
4
Basic Elements Table for the sample dataset
.

18


H
ere,
it is

show
n

how
to

compute some of the necessary statistics for many modeling algor
ithms
using only the basic elements.


Numerical Variable



(

)


















(

)







(



)








(

)








Categorical Variable


(



)








(



)
















Numerical Variable and Numerical Variable



(



)
























(



)







Numerical Variable and Categorical Variable



(





)

















Categorical Variable and Categorical Variable


(







)

















19


3.0 Real

Time

Data
Exploration

Data Exploration

is about describing the data by means of statistical and visualization
techniques
.
In this section
the

focus
is
on statistical
methods

and emphasize how specific
statistical quantities can be calculated using “
Real Time

E
quation
s

(RTE)
”.


In subsequent

sections it will be seen that the
“R
eal
T
ime

E
quations


enable upgrading of
the
convent
ional data
mining techniques to their real time counterparts
.


Real Time

Data
E
xploration
will be discussed
in the following two categories:

1.

Real Time
Univariate
Data

Exploration


2.

Real Time
Bivariate
Data
Exploration























20


3.1 Real

Time

Univariate
Data
Exploration

Univariate
data
analysis explores attributes

(variables)

one by one

using
statistical analysis
.
A
ttributes

are

either numerical

or
categorical

(encoded to binary)
. Numerical
attributes

can be
transformed into categorical counterparts by discretization

or binning
.
An example is

Age


with

three categories

(bins);
20
-
39, 40
-
59, and 60
-
79.
Equal Width and Equal Frequency are two
popular

binning methods.
Moreover, binning may improve accuracy of the predictive models by
reducing the noise or non
-
linearity

and

allows easy identification of outliers, invalid and missing
values.


3.1.1

Count

The total count of
k

subsets of
attribute

X

can be

computed in real time.
It means

a

data
set

can be
divided

into
k

subsets and the count of the whole data
will be equal to
the total count of all
its
subset
s
.


Real Time Equation 1: Count














(

)

The


notation preceding


in the above equation means that the number of data in a subset is a
positive quantity if those data are being added to the
BET

(incremental learning)
or a negative
quantity if those data are being subtracted from the
BET

(decremental learning)
.


3.1.2

Me
an

(Average)

The m
ean or average is a point estimation of a set of data.
As normally written,
a
verage

is not
a
r
eal
t
ime

equation
because
averages with different
N

cannot be added or subtracted
incrementally
.



21



̅





However, using the same notation as u
sed in
the first r
eal
t
ime

equation
we see that the
summation can be written in a
r
eal
t
ime

form as follows:


Real Time

Equation 2: Su
m

of data

















(

)



Real Time

Equation 3: Mean


̅


















(

)

where:




:
Count
(
1
)





:
Sum

of data
(
2
)


Now

Mean”

can be written as a
r
eal
t
ime

quantity because the whole data is not required each
time it is calculated. Only the values of the subset sums are required along with the
count

in

each
subset.


3.1.3

Variance

The v
ariance is a measure of data dispersion or variability.
A low
variance

indicates that the data
tend to be very close to the mean, whereas high
variance

indicates that the data is spread out over
a large range of values.

Similarly,
the
variance
equation
is
not
real t
ime

by itself
.

22









(


)




However, the sums involved can be written as sums over subsets rather than over the whole data.


Real Time Equation 4: Sum of S
quare
d data




















(

)


Now
, we can write
real time

equations for the variance and the standard deviatio
n using the basic
elements.


Real Time Equation 5: Variance




(








)

(






)

(




)
(




)







(

)

Note: If the number of data is less than 30
we should
replace
N

with

N
-
1.


3.1.
4

Standard Deviation

Standard deviation like variance is
a
measure

of

the variability or dispersion
.
A low standard
deviation indicates that the data tend to be very close to the mean, whereas high standard
deviation indicates that the data is spread out ove
r a large range of values.


Real Time Equation 6: Standard Deviation








(








)

(






)

(




)
(




)








(

)

23


3.1.
5

Coefficient of Variation

The c
oefficient of variation is a
standardized
measure

of

the
dispersion or
variability

in data.
CV
is independent of the units of measurement.


Real Time Equation 7: Coefficient of Variation





̅









(

)

where:




̅
:
Average (
3
)




:
Standard Deviation

(
6
)


3.1.6

Skewness

Skewness is a

measure of symmetry or asymmetry in the distribution of data.

The s
kewness

equation

is not
real time

by itself but its components are.




(



)
(



)

(



̅

)


First we need to expand the aggregate part of the equation:


(



̅

)



(







̅




̅



̅



)




(






̅






̅






̅

)


24


where:




:
Count (
1
)





:
Sum of data

(
2
)




̅
:
Average

(
3
)




:
Standard Deviation

(
6
)


Real Time Equation 8:
Sum of data to the power of 3




















(

)



Real Time Equation 9: Skewness




(



)
(



)




(






̅






̅






̅

)







(

)


3.1.
7

Kurtosis

Kurtosis is a
measure of whether the data are peaked or flat relative to a normal distribution.

Like
skewness
,
the
standard

equation

for kurtosis is not real time
equation
but

can be transformed to
be real
time
.



[

(



)
(



)
(



)
(



)

(



̅

)

]


(



)

(



)
(



)

First we need to expand the aggregate part of the equation:

25



(



̅

)



(







̅





̅





̅



̅



)




(






̅






̅







̅






̅

)

where:




: Count (
1
)





:
Sum of data

(
2
)




̅
:
Average

(
3
)






:
Sum of
S
quared data

(
4
)




:
Standard Deviation

(
6
)






:
Sum of data to the power of 3

(
8
)



Real Time Equation 10:
Sum of data to the power of 4




















(

)



Real Time Equation 1
1
: Kurtosis



[

(



)
(



)
(



)
(



)




(






̅






̅







̅






̅

)
]


(



)

(



)
(



)







(

)


26


3.1.
8

Median

The m
edian is the middle data point where

b
elow and above
it
,

lie

an equal number of
data
points
.

The

median
equation
is not
r
eal
t
ime

and cannot be directly transformed to
. However
,

by
using a
discretized (binned)
version of the
attribute

we can often have a good estimation of
the

median.


Real Time Equation 12: Median






[











]









(

)




Figure out which bin contains the median by using the
(
N +
1)/2

formula.
N
j

is the count
for the median bin.
N

is the total count for all bins.



Find the cumulative percentage of the interval (
F
j
-
1
) preceding the median group.




h
is the range in each bin.



L
1

is lower limit value in the median bin.


3.1.9

Mode

Mode
like

median cannot be transformed to r
eal
t
ime
.

However
,

like median we can have a good
estimation of mode
by having
the discretized

(binned)
version of
a

numerical
attribute
.

When
numerical
attributes

are
discretized

in bins, the mode is defined as the bin wh
ere most
observations lie.





27


3.1.
10


Minimum
and

Maximum

Minimum
and Maximum
can be updated

in real

t
ime

incrementally
but not decrementally.


It
means if we lose an existing maximum or minimum value we would need to consider
all
historical

data to replace them.


One practical option is using the lower bound (minimum) and
upper bound (maximum) of the discretized version of a numerical attribute
.



















28


Summary of
Real Time

Univariate
Data
Analysis

All the above univariate
r
eal
t
ime

statistical equations are based on only five basic elements. To
calculate any of the
univariate quantities
, we only
need
to save and update
the following

five
elements in the
B
asic
E
lements
T
able (
BET
)
.


Real Time Equation 1: Count









Real
Time Equation 2: Sum

of data












Real Time Equation 4: Sum of
S
quared
data















Real Time Equation 8:
Sum of data to the power of 3















Real Time Equation 10:
Sum of data to the power of 4
















29


Real Tim
e Univariate
Data
Analysis

-

Example
s

T
he following univariate real time statistical quantities

are based on the Iris dataset
found
in the
Appendix A. To calculate any of the univariate quanti
ty
, we only need to use the elements of the
Basic Elements Table

(BET) generated from the Iris dataset.

All the BET elements are
updateable in real time.

s
epal_length

i
nd
ices











Count










Mean


̅















Variance









(



)








(



)








Standard Deviation
















Coefficient of
Variation





̅














Skewness




(



)
(



)




(







̅







̅







̅

)


(



)
(



)






(





























)





30


Kurtosis



[

(



)
(



)
(



)
(



)




(







̅







̅








̅







̅

)
]


(



)

(



)
(



)

[


(



)
(



)

(



)

(



)






(








































)
]



(



)

(



)

(



)







睨w牥:






(




)



(




)



(



)







Figure 3.1

Univariate analysis on a numerical attribute.

Iris_setosa

indices











Count










Mean


̅















Variance









(



)






(

)












(



)




(





)





Standard
Deviation









(



)










Figure 3.
2

Univariate analysis on a categorical (binary) attribute.

31


3
.2 Real

Time

Bivariate
Data Analysis

Bivariate
data
analysis is the simultaneous analysis of two attributes

(
variables). It explores the
concept of relationship between two
attributes
, whether there
is

an association and the strength of
this association, or whether there are differences between two
attributes

and the sig
nificance of
these differences.

3.2.1

Covariance

Covariance measures t
he extent to which two
numerical attributes

vary together.

That is
a
measure of the linear relationship between two
attributes.


Real Time Equation 13: Covariance


(





)
























(

)

w
here:




:
Count

(
1
)











:
Sum of data

(2)


Real Time Equation 14: Sum of Multiplications


























(

)

The following
r
eal
t
ime

equation is also
very useful
.


Real Time Equation 15: Sum of Square
d
Multiplication


(




)



(




)



(




)









(

)

32


3.2.
2

Linear Correlation Coefficient

Linear correlation quantifies the strength of a linear relationship between two
attributes
. When
there is no correlation between two
attributes
, there is no tendency for the
values of one quantity
to increase or decrease with the values of the second quantity.

The linear correlation coefficient
measures the strength of a linear relationship and is always between
-
1 and 1 where
-
1 means
perfect negative linear correlation and +
1 means perfect positive linear correlation and zero
means no linear correlation.


Real Time Equation 16: Linear Correlation Coefficient




(





)















(

)

where:




(





)

:
Covariance

(
13
)













:
Variance

(
5
)


3.2.3

Conditional

Univariate Statistics

The following equations define

univariate statistics
for an attribute



given a binary attribut
e




when





.
Many

of the bivariate statistics rely on these conditional univariate statistics.


Re
al Time Equation 1
7
: Conditional Count


(


|




)
















(

)



33



Real Time Equation 1
8
: Conditional
Sum of data


(


|




)



















(

)



Real Time Equation
19
: Conditional
Sum of
S
quared data






(


|




)









(




)








(

)



Real Time Equation
2
0
: Conditional Mean


(


|




)


̅





























(

)



Real Time Equation
2
1
: Conditional Variance


(


|




)














(





)











(




)


(





)














(

)



Real Time Equation 2
2
: Conditional Standard Deviation





















(





)












(




)


(





)














(

)


Complement Conditional Univariate St
atistics

For

real time
predictive
modeling we also need to define
conditional
univariate statistics for an
attribute



given a binary attribute



when





.

34



Real Time Equation
23
: Complement Conditional Count


(


|




)





̅














(

)



Real Time Equation
24
: Complement Conditional
Sum of data


(


|




)






̅

















(

)



Real Time Equation
25
: Complement Conditional
Sum of
S
quared data






(


|




)






̅








(




)








(

)



Real Time Equation 2
6
:
Complement
Conditional Mean


(


|




)


̅



̅






̅




̅























(

)



Real Time Equation 2
7
:

Complement Conditional Variance


(


|




)





̅







̅


(





̅
)





̅




̅

(






(




)

)

(









)

(








)
(






)







(

)


Real Time Equation 2
8
:
Complement
Conditional Standard Deviation





̅






̅








̅


(





̅
)





̅




̅


(






(




)

)

(









)

(








)
(






)







(

)

35


3.2.
4

Z

t
est

The
Z

test assess
es

whether the
difference between
averages of two
attributes

are statistically
significant
. This analysis is approp
riate for comparing the average

of
a

numerical attribute

with a
known average

or

two
conditional
averages of
a

numerical
attribute

given

two
binary attributes
(two
categories

of the same categorical attribute)
.


Real Time Equation
2
9
: Z
t
est
-

one
group




̅













(

)

where:




̅
:
Mean or Conditional Mean

(3

or 20
)





:
Standard
Deviation

or Conditional Standard
Deviation

(
6

or 22
)




:
Count

or Conditional Count
(1

or 17
)





:
k
nown
average

T
he probability of
Z

(using normal distribution) defines
the significance of

the difference
between two averages.


Real Time Equation
30
: Z
t
est
-

two
groups




̅



̅




















(

)

where:




̅




̅


:
Conditional Mean

(
20
)













:
Conditional

Variance (
2
1
)

36











:
Conditional

Count (
1
7
)


3.2.
5

T

t
est

The

T

test
like
Z

test
assess
es

whether the averages of two
numerical

attributes

are statistically
different from each other

when the number of data points is less than 30
.
T
test
is approp
riate for
comparing the average

of
a numerical attribute with a known average
or

two conditional
averages of a

numerical
attribute

given

two
binary attributes (two
categories

of the same
categorical attribute)
.


Real Time Equation
31
: T
t
est
-

one
group




̅













(

)

where:




̅
:
Mean or Conditional Mean

(3

or 20
)





:
Standard
Deviation

or Conditional Standard
Deviation (6

or 22
)




:
Count or Conditional Count (1 or 17)





: Known average


The probability
of
t

(using
t

distribution with
N
-
1

degree of freedom) defines if

the difference
between two averages is
statistically

significant.


Real Time Equation
32
: T
t
est
-

two
groups




̅



̅




(







)







(

)

37





(




)




(




)











where:




̅




̅


:
Conditional Mean

(
20
)













:
Conditional

Variance (21)










:
Conditional Count (
17)


3.2.
6

F

t
est

The
F
-
test is used to compare the variances of two
attributes
.

F

test

can be used for comparing
the variance

of
a numerical attribute with a known variance
or

two conditional variances of a

numerical
attribute

given
two
binary attributes (two
categories

of the
same categorical attribute)
.


Real Time Equation
33
: F
t
est


one group




(



)












(

)

where:





:
Count or Conditional Count (1 or 17)






:
Variance

or
Conditional

Variance (
5

or 21
)







:
k
nown variance





:
has





distribution with
N
-
1

degree of freedom


Real Time Equation
34
: F
t
est


two

group
s
















(

)

38


where:













:
Conditional

Variance (21)





:
has
F

distribution with












degree of freedoms


3.2.
7

Analysis of Variance (ANOVA)

ANOVA assesses whether the
averages

of more than two groups are statistically different from
each other
,

under the assumption that the
corresponding

populations are normally distributed.
ANOVA is useful

for comparing average
s

of
two or more

numerical attribute
s

or

two
or more
conditional averages of a

numerical
attribute

given
two

or more

binary attributes (two
or more
categories

of the same categorical attribute)
.

Source of
Variation

Sum of
Squares

Degree of
Freedom

Mean Squares

F

P
robabili
ty

Between Groups























(

)

W楴桩渠䝲潵灳

















呯瑡l










Figure 3.
3

Analysis of Variance

and its components.



Real Time Equation
35
: Sum of Squares Between
Groups





(


)









(

(


)





)















(

)









39


where:





:
Conditional

Count (1
7
)






:
Conditional

Sum of data

(
18
)


Real Time Equation
36
: Sum of Squares Within Groups





(



)



(


)



















(

)














where:





:
Conditional

Count (17)






:
Conditional

Sum of data

(18)







:
Conditional

Sum of
S
quare
d data

(
19
)


Real Time Equation
37
: Sum of Squares Total
















(

)
















:
has
F

distribution with



and



degree of freedom
s.


3.2.
8

Z

t
est


Proportions

The

Z

test can
also
be used to compare proportions.

It can be used to compare a proportion from
one categorical attribute with
a known proportion or

compare two proportions originated from
two
binary

a
ttributes

(two
categories

of the same categorical attribute)
.

40



Real Time Equation
38
: Z
t
est
-

one group











(




)








(

)

where:




:
Count or

Conditional

Count

(
1 or
17
)




:
Sum of data

or

Conditional
Sum of data

(
2 or
1
8
)





: known
probability




:

has normal distribution


Real Time Equation
39
: Z
t
est
-

two groups














̂
(



̂
)
(







)







(

)


̂












where:





:
Conditional

Count

(
1
7
)





:
Conditional
Sum of data

(18)





:

has

normal distribution

T
he probability of
Z

(using normal distribution) defines the significance of

the difference
between two
proportions
.




41


3.2.
8

Chi
2

t
est

(
Test of Independence
)

The
Chi
2

tes
t can be used to determine the
a
s
sociation between categorical
(binary)
attributes
. It
is based on the difference between the expected frequencies
and the observed frequencies

in one
or more categories in the frequency table. The
Chi
2

dis
tribution returns a probability
for the
computed
Chi
2

and the degree of freedom. A probability of ze
ro shows complete dependency
between two categorical
attributes

and a probability of one means that two categorical
attributes

are completely indepe
ndent
.


Real Time Equation
40
:
Chi
2

t
est (
Test of Independence
)






(





)


















(

)













(



)
(



)

where:











:
Conditional
Sum of data

(1
8
)




: expected
frequency

from the subset





: degree of freedom








: number of rows

and columns






: has Chi
2

distribution with
(



)
(



)

degree of
freedom


42


Summary of
Real Time

Bivariate
Data

Analysis

All the above
r
eal
t
ime

bivariate statistical equations are b
ased on only 5 basic elements.
As in
the case of
the r
eal
t
ime

univariate statistical analysis, to calculate the
required

statistical
quantit
ies

we
only

need
to save and update
these five elements in the
basic elements table (BET)
.


Real Time Equation 1: Count









Real Time Equation 2:
Sum of data












Real Time Equation 4:
Sum of S
quared data















Real Time
Equation 14: Sum of Multiplication





















Real
Time Equation 15: Sum of Squared
Multiplication


(




)



(




)



(




)





43


As a reminder, the following conditional basic elements are derived directly from the above five
basic elements. These equations define univariate statistics for an attribute



given a binary
attribute



when





.


Real Time Equation 17: Conditional Count


(


|




)













Real Time Equation 18: Conditional
Sum of data


(


|




)















Real Time Equation 19: Conditional
Sum of
S
quared data






(


|




)









(




)




Real Time Equation 23: Complement Conditional Count


(


|




)





̅










Real Time Equation 24: Complement Conditional
Sum of data


(


|




)






̅













Real Time Equation 25: Complement Conditional
Sum of
S
quared data






(


|




)






̅








(




)


44


Real Time
B
ivariate
Data
Ana
lysis

-

Example
s

T
he following bivariate real time statistical quantities are based on the Iris dataset in the
Appendix A. To calculate any bivariate quanti
ty
, we only need to use the elements of the Basic
Elements Table (BET) generated from the Iris datas
et.

All the BET elements are updateable in
real time.


sepal_length

(1)
,

petal_length

(3)

indices











Covar
iance


(





)








































Linear Corre
l
ati
on





(





)


























睨w牥:










(



)














(



)

















(



)












(



)








Figure 3.
4

B
ivariate analysis
on two numerical attributes.




45


sepal_length

(1)

,
sepal_width_b1

(7)

, sepal_width_b2

(8)

indices













Z test




̅



̅











































(

)


(



)






睨w牥:


̅



̅























̅



̅

































(




)


(





)












(



)



















(




)


(





)












(



)




































Figure 3.
5

Bivariate analysis on one numerical attribute and one categorical (binary) attribute.







46


petal_width_b1

(11)

, petal_width_b2

(12)

Iris_setosa

(13)

,
Iris_versicolor

(14)

,
Iris_virginica

(15)

indices

















Chi
2

test






(





)












(



)





(




)





(




)





(




)





(



)





(



)










(



)

(



)









(




)


(



)







Iris_setosa

Iris_versicolor

Iris_virginica


petal_width_b1



























petal_width_b2

























㄰1








ㄵ1






Iris_setosa

Iris_versicolor

Iris_virginica


petal_width_b1






























petal_width_b2




























㄰1








ㄵ1

Figure 3.
6

Bivariate analysis on two categorical attributes.

47


4.0 Real

Time

Classification


Classification

refers to the data mining
task

of attempting to build a predictive model when the
target
is categorical.


The main goal of classification is to divide a dataset into mutually
exclusive groups such that the members of each group are as close as possible to

one another,
and different groups are as far as possible from one another.


There are many different
classification algorithms (e.g.,
Naïve
Bayesian, Decision Tree, Support Vector Machines, etc.)
.
H
owever
,

not all classification algorithms
can have a real

time version
.


Here we discuss three
classification

algorithms which can be

built and
updated in real time
using the

Basic Elements
Tables
.



Naïve Bayesian



Linear Discriminant Analysis



Linear Support Vector Machines







48


4.1 Real

Time

Naïve

Bayesian

The
Naïve

Bayesian classifier is based on Bayes’ theorem with independence assumptions
between
attributes
. A
Naïve

Bayesian model is easy to build, with no complicated iterative
parameters estimation which makes it particularly useful for very large datasets.
Despite its
simplicity, the
Naïve

Bayesian classifier often does surprisingly well and is widely used because
it often outperforms more sophisticated classification methods.

Algorithm



Real Time

Bayes


theorem provides a way of calculating the posterior probability,
P
(
c|
a
), from the class
(
binary attribute
) prior probability,
P
(
c
), the prior probability

of the value of attribute
,
P
(
a
), and
the likelihood,
P
(
a
|c
). Naive Bayes classifier assumes that the
effect of the value of attribute (
a
)
on a given class (
c
) is independent of the values of other
attributes
. This assumption is called
class
conditional independence.

Bayes’ Rule:


(



)


(

)


(



)

(

)

Class conditional i
ndependence:


(



)


(




)


(




)


(




)




(




)


(

)



49


In t
he real time version of

Bayesian classifiers we
c
alculat
e

the
likelihood

and

the

prior
probabilities

from the Basic Elements Table

(BET) which can be updated in real time.


Real Time Equation 41: Likelihood


(



)


̅



















(

)


Real Time Equation 42: Class Prior Probability


(

)


̅














(

)


Real Time Equation 43: Attribute
-
value

Prior Probability


(

)


̅














(

)

where:




̅




:
Conditional Mean (
20
)




̅


and


̅


:
Mean
(3)









:
Sum of M
ultiplication

(14)













:
Sum of data

(2)











:

Count (1)


If
the attribute
(
A
)
is numerical

the
likelihood

for its value (
a
)

ca
n be calculated from the normal
distribution

equation.


(



)











(



̅
)




50


Numerical Attribute Mean
:


̅


̅













Numerical Attribute Variance
:











(




)


(





)








where:




̅




:
Conditional Mean (
20
)









:
Conditional Variance

(
21
)




(




)


:
Sum of Square
d

Multiplication (15)









:
Sum of Multiplication

(14)













:
Sum of data

(2)











:

Count (1)

In practice
, there is no need to calculate
P(a)

because it is a constant value for all the classes and
can be considered as a normalization factor