CA660_DA_L1_2013_2014x - School of Computing - DCU

hesitantdoubtfulAI and Robotics

Oct 29, 2013 (3 years and 7 months ago)

63 views

DATA ANALYSIS

Module Code :CA660

(Application Areas: Bio
-
, Business,
Social, Environment etc.)

2

STRUCTURE of Investigation/DA

Level of
Measurement

Distributional Assumptions, Probability , Estimation properties



Basis: Size/Type of Data Set/Tools

Parametric

Non
-
Parametric

Study techniques

Lab. techniques

Estimation/H.T.


H.T.

1,2, many samples

E.D., Reg
n.
, C.T.

Replication,

Assays, Counts

Probability & Statistics Primer

-
overview


Note:

Short overview. Other statistical distributions in lectures

3

Summary Statistics
-

Descriptive

In analysis of practical sets of data, useful to define a small number of values that
summarise main features present. We derive (
i
) representative values, (ii)
measures of spread and (iii) measures of
skewness

and other characteristics.


Representative Values


Sometimes called measures of
location

or
measures of central tendency
.


1. Random Value

Given a set of data S = { x
1
, x
2
, … ,
x
n

}, we select a random number, say k, in the
range 1 to n and return the value
x
k
. This method of generating a representative
value is straightforward, but it suffers from the fact that extreme values can occur
and successive values could vary considerably from one another.


2. Arithmetic Mean


For the set S above
,

the a
rithmetic

mean (or just mean)

is




x = {x
1

+ x
2

+ … +
x
n

}/ n.


If x
1

occurs f
1

times, x
2

occurs f
2

times and so on, we get the formula





x = { f
1

x
1

+ f
2

x
2

+ … +
f
n

x
n

} / { f
1

+ f
2

+ … +
f
n

} ,


written



4

Example 1.

D
ata
are student marks in an examination. Find the average mark for the class.


N
ote 1
:
M
arks
are given
as ranges, so care
Mark

Mid
-
Point

Number

needed
in range interpretation


of Range of Students

All intervals must be of equal rank and there


x
i

f
i




f
i
x
i


must be no gaps in the classification








0
-

19


10


2

20

We interpret
the range 0
-

19 to contain marks

21
-

39 30 6 180

greater than 0 and less than or equal to 20.


40
-

59 50 12

600

Thus,
mid
-
point
is 10. The other intervals
are

60
-

79 70 25 1750

are interpreted accordingly.




80
-

99 90 5 450







Sum


-

50


3000

The arithmetic mean is x = 3000 / 50 = 60 marks.


Note
2
:
Pivot. If
weights of size f
i

are suspended
x
1

x
2

x
x
n

from a metre stick at the points x
i
, then the

average is the
centre of gravity

of
the

distribution. Consequently, it is very sensitive


f
1

f
2
f
n

to outlying values.


Note 3:
P
opulation
should be
homogenous

for
average
to be meaningful. For example, if
assume
that
typical
height of girls in a class is less than that of boys,
then
average
height
of all students

is neither indicative of the girls nor of the boys.

5

3. The Mode


This is
the value that
occurs
most

frequently
. By common agreement,

it is calculated from the histogram
using
linear

interpolation on the modal class.


The various similar triangles in the diagram

generate the common ratios. In our case,

the mode is


60 + 13 / 33 (20) = 67.8 marks.


4. The Median


T
he
middle point

of the distribution.

If
{ x
1
, x
2
, … ,
x
n

} are marks of students

in
a class,
arranged

in
nondecreasing


order
, then the median is the mark of

the
(n + 1)/2 student.

O
ften use the
ogive

or

cumulative
frequency

Diagram to calculate.
In our case,

the median is


60 + 5.5 / 25 (20) = 64.4 marks.

50

Frequency

20

20

40

60

80

100

6

12

25

5

2

13

13

20

Cumulative

Frequency

100

80

60

40

20

50

25.5

6

Measures of Dispersion or Scattering


Example 2. The
distribution shown has
the same

Marks

Frequency

arithmetic mean as
Example
1, but
values
are
more

x
j


f
j



f
j

x
j

dispersed
.
Illustrates that an average value alone

may not adequately
d
e
scribe statistical



10 6 60

distributions
.






30 8 240








50 6 300

To devise a formula that
captures degree
to which a

70 15 1050

distribution is concentrated about the average, we

90 15 1350

consider the deviations of the values from the average.

Sums
50


3000

If distribution
is concentrated around the mean,

then
deviations
will be small, while
if it is very scattered,

then deviations
will be large.

The
average of the squares

of the deviations

is called

the
variance

and this is used as a measure of dispersion.


The square root of the variance is
the
standard

deviation

, has same
units of measurement as

the original values and is the preferred measure of

dispersion in many applications.

x
1

x
2

x
3

x
4

x
5

x
6

x

7

Variance & Standard Deviation



s
2



VAR[X] = Average of the Squared Deviations





=
S
f { Squared Deviations } /
S
f




=
S

f
i

{ x
i

-

x }
2

/
S
f
i




=
S
f x
i

2

/

S

f
-

x

2


, called the
product moment

formula.







s 
Standard Deviation =

Variance


Example 1

Example
2






f


x


f x


f x
2



f


x


f x


f x
2

2

10


20


200


6

10


60


600

6

30


180


5400


8

30


240


7200

12

50


600


30000


6

50


300


15000

25

70

1750

122500


15

70

1050


73500

5

90


450


40500


15

90

1350

121500

50


3000

198600


50


3000

217800


VAR [X] = 198600 / 50
-

(60)
2


VAR [X] = 217800 / 50
-

(60)
2


= 372 marks
2





= 756 marks
2




8

Other
Summary Statistics


Skewness

An important attribute of a statistical distribution
is its
degree of symmetry
. The

skew

means a tail, so
distributions with a
large tail of outlying values on the right
-
hand
-
side
are
positively
skewed
or
skewed to the right
. The notion of
negative
skewness

is defined
similarly.
A simple
formula for
skewness

is


Skewness

= ( Mean
-

Mode ) / Standard Deviation

which
for Example
1 is:


Skewness

= (60
-

67.8) / 19.287 =
-

0.4044.


Coefficient of Variation

This formula

was devised to ‘standardise’
the arithmetic mean so
comparisons
can be
drawn between different
distributions. Not universally
used.


Coefficient of Variation = Mean / Standard Deviation.


Semi
-
Interquartile Range

T
he
M
edian is the mid or 0.5 point
in a
distribution.
T
he
quartiles Q
1
, Q
2
, Q
3

correspond
to the 0.25, 0.50 and 0.75 points. An alternative measure of dispersion
is thus


Semi
-
Interquartile Range = ( Q
3

-

Q
1

) / 2.


Geometric Mean

For data that
grow
geometrically,
e.g.
economic
data
with high
inflation effect
,
another
mean is sometimes used.
The G.M.
is defined for a product of frequencies, where N
=
S
f



G. M.
=

N

x
1
f1

x
2

f2


x
k

fk


9

Regression


[Example 3.]
As a motivating example, suppose
we model
sales data over time.


SALES


3


5


4


5


6


7


TIME

1990

1991

1992

1993

1994

1995


Want the
straight line “Y = m X + c” that best

approximates the data.
“Best
” in this
case

is

the
line which minimizes
the sum
of squares

of vertical deviations of points from

the line:


SSQ

= SS =
S

( Y
i

-

[
mX
i

+ c ] )

2

Setting
partial
derivatives of SS
w.r.t.
m

and c to zero leads to the “Normal Equations”




S

Y
= m
S


X + n
c




S


X
Y
= m
S

X
2

+ c
S

X ,

where n = # points


Let
1990 correspond to Year 0
.


X.X X X.Y
Y

Y.Y


0 0 0 3 9


1 1 5 5 25


4 2 8 4 16


9 3 15 5 25


16 4 24 6 36


25 5 35 7 49



55 15 87 30 160

X

Y

Y
i

= m X
i

+ c

m X
i

+ c

Y
i

0

X
i

Time

Sales

10


5


0


5

10

Example 3
-

Working


The normal equations are:


30 = 15 m + 6 c

=>

150 = 75 m + 30 c


87 = 55 m + 15 c


174 = 110 m + 30 c

=>

24 = 35 m

=>

30 = 15 (24 / 35) + 6
c => c
= 23/7


Thus the regression line of Y on X is


Y = (24/35) X + (23/7)

and to plot the line
just need
two points, so


X = 0

=> Y = 23/7

and X = 5 => Y = (24/35) 5 + 23/7 = 47/7.


E
asy
to see that ( X, Y ) satisfies the normal equations, so that the regression line
of Y on X passes
through “
Centre of Gravity” of the data. By expanding terms,
get



S

( Y
i

-

Y )
2 =
S
(
Y
i

-

[ m X
i

+ c ] )

2


+
S

( [ m X
i

+ c ]
-

Y )
2


Total Sum


Error Sum


Regression Sum

of Squares


of Squares


of Squares

SST


= SSE



+

SSR


Distinguish
the

independent and dependent variables

(
X and
Y
respectively)


X

Y

Y
i

mX
i

+C

Y

X

Y

11

Correlation


The
coefficient of determination
r
2

( which takes values in the range 0 to 1) is a
measure of the proportion of the
total variation

that is associated with the
regression process:



r
2

=

SSR/ SST =

1
-

SSE / SST.


The
coefficient of correlation
‘r’
(values
in the range
-
1 to +1 ) is
a more common
measure
of the degree to which a
mathematical relationship

exists between X
and Y.

It can
be
calculated as:


r

=




( X
-

X ) ( Y
-

Y )







( X
-

X )
2

( Y
-

Y )
2




=


n


X Y
-



X


Y






[
{ n

X
2

-

(


X )

2

} { n


Y
2

-

(


Y )

2

}
]


Example. In our
case,

r
= {6(87)
-

(15)(30)}/


{ 6(55)
-

(15)
2

} { 6

(160)
-

(30)
2

} =
0.907
.








r =
-

1

r = + 1

r = 0

12

Col
l
inearity


For correlation
coefficient
value >
0.9
or
<
-

0.9, we would take this to mean that
there is a
mathematical relationship

between the variables. D
oes
not imply that
a
cause
-
and
-
effect relationship

exists.


E.g.
c
onsider
a country with a slowly changing population size, where a certain political
party retains a relatively stable
percentage
of the poll in elections. Let


X = Number of people that vote for the party in an election


Y = Number of people that die
of a
given disease in a year


Z = Population size.

Then,
correlation
coefficient between X and Y is
~1
, indicating
a
mathematical relationship
between them (i.e.) X is a function of Z and Y is a function of Z also. It would clearly be silly
to suggest that the incidence of
disease
is caused by the number of people that vote for the
given political party. This is known as the problem of

col
l
inearity
.


Spotting
hidden
dependencies is
non
-
trivial.
Statistical
experimentation can only
be used to disprove hypotheses, or to lend evidence to support the view that
reputed relationships between variables may be valid. Thus, the fact
of a high
correlation coefficient between deaths due to heart failure in a given year with
the number of cigarettes consumed twenty years earlier does not establish a
cause
-
and
-
effect
relationship, though may be useful to guide research.

13

Overview of Probability Theory


In statistical theory, an experiment is any operation that can be
replicated infinitely often
and gives rise to a set of
elementary outcomes
, which are deemed to be
equally likely
.

The
sample space S

of the experiment is the set of all possible outcomes of the experiment.
Any subset
E

of the sample space is called an
event.

An
event E
occurs

whenever any of its
elements is an outcome of the experiment. The
probability

of occurrence of E is


P {E} =

Number of elementary outcomes in E


Number of elementary outcomes in S


The
complement

E of an event E is the set of all elements that belong to S but
not to

E. The
union

of two events E
1

E
2

is the set of all outcomes that belong to E
1

or
to E
2

or
to both.
The
intersection

of two events E
1

E
2

is the set of all events that belong to both E
1

and

E
2.

Two events are
mutually exclusive

if occurrence
of either precludes
occurrence
of the
other (
i.e
) their intersection is the empty
set .
Two events are
independent

if
occurrence
of either is un
a
ffected

by
occurrence
or non
-
occur
r
ence

of the other event.


Theorem of Total Probability.


P {E
1


E
2
} = P{E
1
} + P{E
2
}
-

P{E
1

E
2
}


Proof.

P{E
1

E
2
} = (n
1, 0

+ n
1, 2

+ n
0, 2
) / n




= (n
1, 0

+ n
1, 2
) / n + (n
1, 2

+ n
0, 2
) / n
-

n
1, 2

/ n




= P{E
1
} + P{E
2
}
-

P{E
1


E
2
}

Corollary.

If E
1

and E
2

are mutually exclusive, P{E
1

E
2
} = P{E
1
} + P{E
2
}


-

see
Axioms

and
Addition Rule

E

S

n = n
0, 0

+ n
1, 0

+ n
0, 2

+ n
1, 2

E
1

E
2

S

n
1, 0

n
1, 2

n
0, 2

n
0, 0

14


The probability P{E
1

| E
2
} that

E
1

occurs,
given that E
2

has occurred (or must occur)

is called the

conditional probability

of E
1
. Note
: only
possible outcomes of the experiment are confined to
E
2

and not to S.



Theorem of Compound Probability

Multiplication Rule.




P{E
1

E
2
} = P{E
1

| E
2
}


P{E
2
}.


Proof.

P{E
1

E
2
} = n
1, 2

/ n




= {n
1, 2

/ (n
1, 2

+ n
0, 2
) }


{ n
1, 2

+ n
0, 2
) / n}


Corollary

If E
1

and E
2

are independent, P{E
1

E
2
} = P{E
1
}


P{E
2
}.

Special case

of Multiplication Rule

Note:

If
E

itself compound, expands further =
Chain Rule: P{E
7

E
8


E
9
} =P{E
7

(E
8

E
9
)}


C
ounting

possib
le

outcomes
of
an event is crucial to calculating pro
babilities
.

A

permutation

of
size r of n different items,
is
an
arrangement

of r of the items,
where
order

of
arrangement is
important. If
order

is
not important
, the arrangement is called a

combination.


Example. There are
5

4
permutations and
5

4
/ (
2

1
) combinations of size 2 of A, B, C, D, E

Permutations:

AB, BA, AC, CA, AD, DA, AE, EA

CD, DC, CE, EC



BC, CB, BD, DB, BE, EB

DE, ED


Combinations:

AB, AC, AD, AE, BC, BD, BE, CD, CE, DE


Standard reference books on probability theory give a comprehensive treatment of how these
ideas are used to calculate the probability of occurrence of the outcomes of games of chance.


n
1, 0

n
1, 2

n
0, 2

n
0, 0

E
1

E
2

S

15

Bayes’ Rule (Theorem):
For a series of
mutually exclusive
and
exhaustive events
B
r
,
where union of the B
r
= B
1


B
2


B
3

…….
B
r
= all possibilities for B,

Then:


Where
the denominator
is the Total probability of A occurring.


Ex.
Paternity indices
: based on actual genotypes of mother, child, and
alleged
father.
Before collection of any evidence, have a prior probability of paternity P{C}. So, what
is the situation after the genetic evidence ‘E’ is in?

From Bayes’: P {man is father | E} P[E | man is father}
P{man is father}



P{man not father | E} P{E | man not father} P{man not father}

Written in terms of ratio of
posterior probs
. (= LHS), paternity index (L say) and ratio
of
prior probs.
(RHS). Rearrange and substitute in above to give prob. of an alleged
man with

particular genotype ‘
C’ being
the true father


NB
: L is a way of ‘weighting’ the genetic evidence; the

issue
is setting a
prior
.


=



16

Statistical Distributions
-

Characterisation

If a statistical experiment only gives rise to real numbers, the outcome of the experiment is
called a

random variable
. If a random variable X



takes values

X
1
, X
2
, … ,
X
n


with probabilities

p
1
, p
2
, … ,
p
n

then the
expected

(
average
)

value of X is defined to be




E[X] =
p
j

X
j

and its variance is



VAR[X] = E[X
2
]
-

E[X]
2

=

p
j

X
j
2

-

E[X]
2
.


Example. Let X be a random variable measuring

Prob.


Distance

the distance in Kilometres travelled by children


p
j

X
j


p
j

X
j

p
j

X
j
2

to a school and suppose that the following data


applies. Then the mean and variance are


0.15

2.0 0.30 0.60


E[X] = 5.30 Kilometres



0.40

4.0 1.60 6.40


VAR[X] =
33.80
-

5.30
2


=5.71 km
2


0.20

6.0 1.20 7.20


0.15 8.0 1.
20 9.60

Similar concepts apply to continuous distributions.

0.10 10.0 1.00 1.00

The
distribution function

is defined by


1.00
-

5.30 33.80










F(t) = P{ X t}

and its
derivative

is the
frequency function


f(t) = d F(t) /
dt



so that

F(t) =
f(x
) dx.

17

Sums
and Differences of Random Variables


Define the
covariance

of two random variables to be



COVAR [ X, Y]

= E [(X
-

E[X]) (Y
-

E[Y]) ] = E[X Y]
-

E[X] E[Y].

If X and Y are
independent
, COVAR [X, Y] = 0.


Lemma


E[
X

Y
]

= E[X] + E[Y]


VAR [
X


Y
]

= VAR [X]

VAR
[Y]


2 COVAR [X, Y]




E[ k. X] = k .E[X] VAR[ k. X] = k
2

.E[X] for a constant k.


Example. A company records the journey time X

X=
1

2

3

4

Totals

of a lorry from a depot to customers and


Y =1

7

5 4 4

20

the unloading times Y, as shown.


2
2

6 8 3

19

E[X] = {
1
(
10
)+
2
(
13
)+
3
(
17
)+
4
(
10
)}/
50

= 2.54


3
1


2 5 3


11

E[X
2
] = {1
2
(10+2
2
(13)+3
2
(17)+4
2
(10)}/50 = 7.5


VAR[X] = 7.5
-

(2.54)
2

=
1.0484
Totals
10 13 17 10 50


E[Y] = {1(20)+2(19)+3(11)}/50 = 1.82 E[Y
2
] = {1
2
(20)+2
2
(19)+3
2
(11)}/50 = 3.9

VAR[Y] = 3.9
-

(1.82)
2

= 0.5876


E[X+Y] = { 2(
7
)+3(
5
)+4(
4
)+5(
4
)+3(
2
)+4(
6
)+5(
8
)+6(
3
)+4(
1
)+5(
2
)+6(
5
)+7(
3
)}/50 = 4.36

E[(X + Y)
2
] = {2
2
(7)+3
2
(5)+4
2
(4)+5
2
(4)+3
2
(2)+4
2
(6)+5
2
(8)+6
2
(3)+4
2
(1)+5
2
(2)+6
2
(5)+7
2
(3)}/50 = 21.04

VAR[(X+Y)] = 21.04
-

(4.36)
2

= 2.0304


E[X Y] = {1(
7
)+2(
5
)+3(
4
)+4(
4
)+2(
2
)+4(
6
)+6(
8
)+8(
3
)+3(
1
)+6(
2
)+9(
5
)+12(
3
)}/50 = 4.82

COVAR (X, Y) = 4.82
-

(2.54)(1.82) = 0.1972

VAR[X] + VAR[Y] + 2 COVAR[ X, Y] = 1.0484 + 0.5876 + 2 ( 0.1972) = 2.0304


18

Standard Statistical Distributions


Most elementary statistical books provide a survey of commonly used statistical
distributions.


Importantly, we can characterise them by their
expectation
and
variance

(as for
random variables) and by the parameters on which these are based; (see lecture notes
for those we
refer
to).


So, e.g. for a
Binomial

distribution, the parameters are
p

the probability of
‘success
in
an individual
trial’
and
n

the No. of trials
.
The
probability of success remains
constant


otherwise,
another
distribution applies.


Use of the correct distribution is core to statistical
inference



I.e. estimating what is
happening in the population on the basis of a (correctly drawn,
probabilistic
)

sample.

The sample is then
representative

of the population.


Fundamental to statistical inference is the
Normal
(or
Gaussian
),
with parameters,


the mean (
or formally

expectation

of the distribution) and
s

(SD) or variance (
s

2
).


For
small samples, or when

s

2
not known but must be estimated from
a sample
,

a
slightly more conservative distribution
-

the Student’s T or just ‘t’
distribution, applies.
Introduces the
degrees of freedom

concept.

19

Student’s t Distribution


A random
var
i
able X has a t distribution with n
degrees of freedom

(
t
n

)
.


The t distribution is symmetrical about the origin, with


E[X]


= 0


VAR [X] = n / (n
-
2
).


For
small values of n, the
t
n

distribution is very flat. As n is increased the density
assumes a bell shape. For values of n


25, the
t
n

distribution is practically
indistinguishable from the
S
tandard Normal
curve
.



O If X and Y are independent random variables



If X has a standard normal distribution and Y has a
c
n
2

distribution



then


X

has a
t
n

distribution




(
Y / n
)



O If x
1
, x
2
, … ,
x
n

is a random sample from a normal distribution, with



mean

and variance
s
2

and if we define s
2

= 1 / ( n
-

1)


( x
i

-

x )
2




then


( x
-



) / ( s /


n) has a
t
n
-

1

distribution



Estimated Sample variance


-

see calculators
,tables etc.



+
Many other standard distributions

20

Sampling Theory


To draw a
random sample
from a distribution, assign numbers
1, 2, … to the
elements of the
distribution, use random number
tabes

or generated set to
decide
which elements are included in the sample. If the same element can not be selected
more than once, we say that the sample is drawn
without replacement
; otherwise,
the sample is said to be drawn
with replacement
.


U
sual
convention in sampling is that lower case letters
designate
the sample
characteristics, with capital letters
used
for the
(finite) parent population and
greek

letters for the infinite.
Thus if
sample
size =

n, its elements are designated, x
1
, x
2
, …,
x
n
, its mean is x and its modified variance is


s
2

=



(x
i

-

x )
2

/ (n
-

1
).


C
orresponding
parent population characteristics =

N, X
and
S
2


or (

,


and
s
2
)


Suppose
we
repeatedly draw random samples of size n (with replacement) from a
distribution with mean


and
variance
s
2
. Let x
1
, x
2
, …
be the collection of sample
means

and let


x
i
’ =

x
i

-




(
i

= 1, 2, … )



s  

n



The collection x
1
’, x
2
’, … is called the
sampling distribution of means, (usual U or Z)


Central Limit Theorem.

In the limit, as
sample size
n tends to infinity,

the sampling distribution of means

has
a
S
tandard Normal
distribution
.

Basis for
Statistical
I
nference
.

21

Attribute and Proportionate Sampling


If
sample
elements
are a
measurement of some characteristic
,
then have
attribute sampling
.
However, if
all
sample
elements are 1 or 0 (success/failure,

agree/
do
-
not agree
), we have
proportionate sampling
.

For
proportionate sampling, the sample average x and the sample proportion p
are
synon
y
mous
,
(just
as
for mean


and proportion P for the parent
population).
From our results on the
B
inomial

distribution, the sample variance is p (1
-

p) and
the variance of the parent distribution is P (1
-

P
) in the proportionate case.


T
he ‘sampling distribution’
of means
concept generalizes to
get the sampling
distribution of any statistic. We say that a sample characteristic is an
unbiased
estimator

of the parent population characteristic,
i.e.

the
expectation
of the
corresponding sampling distribution is equal to the parent characteristic.


Lemma
.

The sample average (proportion ) is an
unbiased

estimator of the parent



average (proportion):



E [ x] =
;


so
E [p] = P.


The quantity


( N
-

n) / ( N
-

1) is called the
finite population correction (
fpc
).

If the parent
population is infinite or w
e

have sampling
with replacement

the
fpc

= 1.


Lemma.


E [s] = S

fpc

for estimated sample S.D. with
fpc

22

Confidence Intervals


From the statistical tables for a Standard Normal (Gaussian)

distribution, we note that



Area Under


From

To


Density Function


0.90


-
1.64

1.64


0.95


-
1.96

1.96


0.99


-
2.58

2.58




From the
central limit theorem
, if x and s
2

are the mean and variance of a random sample
of size n (with n greater than 25) drawn from a large parent population,
size N

,
then
the
following statement
,about
the unknown parent mean
,

applies


Prob

{
-
1.64

x
-



   




s
/


n

i.e.

Prob

{ x
-

1.64 s /


n
 



x

 
s /


n }
 


The range x


1.64 s /


n is called a
90% confidence interval
for the parent mean

.



Example [ Attribute Sampling]

A random sample of size 25 has x = 15 and s = 2. Then a 95% confidence interval for


is


15


1.96 (2 / 5)

(i.e.) 14.22 to 15.78


Example [ Proportionate Sampling]

A random sample of size n = 1000 has p = 0.40


1.96


p (1
-

p) / (n
-

1) = 0.03.

A 95% confidence interval for P is 0.40


0.03 (i.e.) 0.37 to 0.43.


N

(0,1)

0

-
1.96

+1.96

0.95

23


Small Sampling Theory

For reference purposes, it is useful to regard the expression



x


1.96 s /


n

as

default formula” for a confidence interval and
to modify
it
for particular circumstances.


O If we are dealing with proportionate sampling, the sample proportion is the




sample mean and the
standard error

(
s.e.
) term s /

n
simplifies as
fol
l
ows
:






x
-
> p

and

s
/


n
-
>


p(1
-

p) / (n
-
1).

(
Also n
-
1
-
> n)



O A 90% confidence interval will bring about the swap 1.96
-
> 1.64.



O
For sample
size
n less
than 25, the
Normal
distribution must be replaced by




Student’s t
n
-

1

distribution.



O For sampling without replacement from a finite population, a
fpc

term must be




used.


The width of the confidence interval band
increases with the confidence level.


Example. A random sample of size n = 10, drawn from a large parent population,
has mean
x = 12 and a
standard deviation s = 2. Then a 99% confidence interval for the parent mean is


x


3.25 s /


n

(i.e.)

12

3.25
(2)/3


(i.e.)


9.83 to 14.17

and a 95% confidence interval for the parent mean is


x


2.262 s /


n

(i.e.)

12

2.262
(2)/3

(i.e.)

10.492 to 13.508.


Note
: For
n = 1000, 1.96


p (1
-

p) / n
 
for values of p

between 0.3 and 0.7. This gives rise to the
statement that public

opinion
polls have an “inherent error of 3%”.

S
implifies
calculations in the case of
pu
b
lic

opinion polls for large political parties.



24

Tests of Hypothesis


[Motivational Example]. It is claimed that
average
grade of all 12 year
olds in
a country in
a particular aptitude test is 60%. A random sample of n= 49

students gives a mean x =
55% with
standard
deviation s = 2%. Is the sample

finding consistent with the claim
?


T
he
original claim
regarded as a

null
hypothesis
(H
0
)

which is tentatively accepted as TRUE
:



H
0

:
  

If the null hypothesis is true, the
test statistic



T
S

=
x
-




s
 

n



is a Random Variable with a
Normal

(0, 1)
i.e.

Standardised Normal
Z(0,1) (or U(0,1
))
distribution.


Thus


55
-

60 =
-

35 / 2 =
-

17.5



2/


49
rejection regions


is a random value from
Z
(0, 1).




But this lies outside the 95% confidence interval (falls in the
rejection region)
, so either


(
i
) The null hypothesis is incorrect

or

(ii) An event with a probability of at most 0.05 has occurred.


Consequently,
reject

the null hypothesis, knowing

a probability of 0.05
exists
that we are

in error.
Technically
: reject the null hypothesis at the 0.05

level of significance.

The alternative to rejecting H
0
, is to declare the
test to be inconclusive.

This
mean
s

that
there is some tentative evidence to support the view that H
0

is
approximately correct
.

Z
(0,1)

0.95

1.96

-
1.96

25

Modifications


Based on the properties of the
N
ormal

,
Student ‘t’
and other distributions, we can
generalise these ideas. If the sample size n < 25,
a t
n
-
1

distribution should be used
;

the
level of significance of the test
may also be varied or the test applied to a
proportionate
sampling
environment.


Example. 40% of a random sample of 1000 people in a country indicate
satisfaction
with
government policy. Test at the 0.01 level of significance if this consistent with the claim that 45% of
the people support government policy?

Here, H
0
: P = 0.45

p =
0.40

n
= 1000

so



p (1
-
p) / n = 0.015

test statistic = (0.40
-

0.45) / 0.015 =
-

3.33

99% critical value = 2.58

so H
0

is rejected at the
0.01
level of significance.



One
-
Tailed Tests


If the null hypothesis is of the form H
0

: P
> 0.45
then
arbitrary large
values of p are
acceptable,
so that the
rejection region
for the test statistic lies in the
left hand tail
only.


Example. 40% of a random sample of 1000 people in a country indicate
satisfaction
with
government policy. Test at the 0.05 level of significance if this consistent with the claim that

at least 45% of the people support government policy?

Here the critical value is
-
1.64, so

the null hypothesis H
0
: P
 

is rejected at the
0.05
level of

significance

N
(0,1
)

-
1.64

0.95

Rejection region

26


Suppose
that

x
1

x
2


x
m

is a random
sample,
mean x and standard deviation
s
1

drawn
from a distribution with mean

1
and



y
1

y
2


y
n


is
a random
sample, mean
y and standard deviation
s
2

drawn
from a distribution with mean

2.

Suppose
that we wish to test the

null hypothesis that both samples are drawn from the
same parent population

(i.e.)



H
0
:

1

=

2.


The pooled estimate of the parent variance is



s
*

2

=

s
p
2

= {
(m
-

1) s
1
2

+ (n
-

1) s
2
2

} / ( m + n
-

2)

and the variance of
(
x


y), is the variance
of the difference of
two
independent random
variables,
i.e.



s
diff


2

= s
p
2
/
m + s
p
2

/ n.

This allows us to construct the test statistic, which
under H
0

has a t
m+n
-
2

distribution.


Example. A random sample of size m = 25 has mean x = 2.5 and standard deviation s
1

= 2, while a
second sample of size n = 41 has mean y = 2.8 and standard deviation s
2

= 1. Test at the 0.05 level of
significance if the means of the parent populations are identical.

Here


H
0

:

1

=

2

x
-

y =
-

0.3 and


s
p
2
=
{24(4) + 40(1)} / 64 = 2.125

so the test statistic is



-

0.3 /
 22  2  22   


8

The 0.05 critical value for
Z
(0, 1) is

,
so the test is
inconclusive


27

Testing Differences between Means

Paired Tests


If the sample values
(
x
i

,
y
i

) are paired, such as the marks of students in two examinations,
then let
d
i

= x
i

-

y
i

be their differences and treat these values as the elements of a sample to
generate a test statistic for the hypothesis



H
0
:

1

=

2.


The test statistic

d /
s
d

/


n


has a t
n
-
1

distribution if H
0

is true.


Example. In a random sample of 100 students in a national examination their examination mark in
English is subtracted from their continuous assessment mark, giving a mean of 5 and a standard
deviation of 2. Test at the
0.01
level of significance if the true mean mark for both components is the
same.

Here


n = 100, d = 5,

s
d

/


n = 2/10 = 0.2

so the test
statistic
is
then

5 / 0.2 = 10.

the 0.01 critical value for a
Z
(0, 1) distribution is 2.58, so H
0

is rejected at the
0.01
level of significance.


Tests for the Variance.


For normally distributed random variables, given



H
0
:
s

2

= k, a constant, then (n
-
1) s
2

/ k

has a
c
2
n
-

1

distribution.


Example. A random sample of size 30 drawn from a normal distribution has variance s
2

= 5.

Test at the
0.05
level of significance if this is consistent with H
0

:
s

2

= 2 .

Test statistic = (29) 5 /2 = 72.5, while the 0.05 critical value for
c
2
29

is 45.72,

so H
0

is rejected at the 0.05 level of significance.



28

Chi
-
Square Test of Goodness of Fit


C
an
be used to test the hypothesis H
0

that a set of observations is consistent with a given
probability distribution. G
iven
a set of categories
with observed (
O
j

) and
expected
(
E
j

)
number
of observations
(frequency) in
each category.
Under
H
0

T
est Statistic


S
(
O
j



E
j

)
2

/
E
j


has a
c
2
n
-

1

distribution,
with
n
the
number of categories.


Example
. A
pseudo random number generator is used to used to generate 40 random numbers in the
range 1
-

100.
Test,
at the 0.05 level of
significance,
if the results are consistent with the hypothesis
that the outcomes are randomly distributed.




Range



1
-
25 26
-

50 51
-

75 76
-

100 Total



Observed Number


6 12 14 8 40



Expected Number 10 10 10 10 40


Test statistic = (6
-
10)
2
/10 + (12
-
10)
2
/10 + (14
-
10)
2
/10 + (8
-
10)
2
/10 = 4.

The 0.05 critical value of
c
2
3

= 7.81, so the test is inconclusive.


Chi
-
Square Contingency Test


To test that two random variables are
statistically independent
, a set of
obs
e
rvations

can
be
tabled,
with m rows corresponding to categories for one random variable and

n columns for the other.
Under H
0
, the expected number of observations for the cell in row
i

and column j =

appropriate (row
total


column total)


(G
rand total).
Under
H
0



T
est
S
tatistic


S

S

(
O
ij



E
ij

)
2

/
E
ij


has a
c
2
(m
-
1)(n
-
1)

distribution.

29

Chi
-
Square Contingency Test
-

Example


In the following table, the

Results Maths


History


Geography Totals

figures in brackets are the

Honours
100 (50)


70 (67)


30 (83) 200

expected values.


Pass 130 (225
)

320 (300) 450 (375) 900




Fail


70 (25)


10 (33)

20 (42) 100

The test statistic

is


Totals



300


400




500


1200


S
[
(
O
ij



E
ij

)
2

/
E
ij

]
= (100
-
50)
2
/ 50 + (70
-

67)
2
/ 67 + (30
-
83)
2
/ 83 + (130
-
225)
2
/ 225

+ (320
-
300)
2
/ 300 + (450
-
375)
2
/375 + (70
-
25)
2
/ 25 + (10
-
33)
2
/ 33 + (20
-
42)
2
/ 42


= 248.976


The 0.05 critical value for
c
2
2 * 2

is 9.49 so H
0

is rejected at the 0.05 level of significance.


Note:
In general, chi
-
squared
tests tend to be very conservative
vis
-
a
-
vis

other tests of
hypothesis
, (
i.e.) they tend to give inconclusive results.


T
he meaning of the term “
degrees of freedom

.

In simplified terms, as the chi
-
square distribution is the sum of, say k, squares of
independent random variables, it is defined in a k
-
dimensional space. When
we impose

a
cons
t
raint

of the type that the sum of observed and expected observations in a column are
equal
or estimate

a parameter of the parent distribution, we reduce the dimensionality of
the space by 1. In the case of the chi
-
square contingency table, with m rows and n columns,
the expected values in the final row and column are predetermined, so the number of
degrees of freedom of the test statistic is (m
-
1
)

(
n
-
1
).

30

31

Analysis of Variance/Experimental Design

-
Many samples, Means and Variances


Analysis of Variance (AOV or ANOVA) was


originally devised for agricultural statistics


on
crop yields etc.
Typically, row and column

format, = small plots of a fixed size. The yield


y
i
, j

within each plot was recorded.


One Way classification


Model:

y
i
, j

= +
i

+
i
, j

,
i

,j

~ N (0,
s
2
) in the limit

where


= overall
mean as sample size large


i

= effect of the
i
th

factor



i
, j
= error term.


Hypothesis:

H
0
:
1

=
2

= … =
m


y
1, 3

y
1
, 1

y
1, 2

y
2, 2

y
1, 4

y
2, 1



y
2, 3

y
3, 1

y
3, 2

1

2

3

y
1
, 5

y
3, 3

32







Totals


Means

Factor 1

y
1, 1

y
1, 2

y
1, 3

y
1, n1


T
1
= y
1, j


y
1
.
= T
1

/ n
1





2

y
2, 1

y
2,, 2

y
2, 3
y
2, n2


T
2

= y
2, j


y
2
.
= T
2

/ n
2






m

y
m
, 1

y
m
, 2

y
m
, 3
y
m
, nm


T
m

=
y
m
, j

y
m
.

= T
m

/ n
m


Overall mean

y =
y
i
, j

/ n,

where n =
n
i


Decomposition (Partition) of Sums of Squares:



(
y
i
, j

-

y )
2

=
n
i

(
y
i

.

-

y )
2

+ (
y
i
, j

-

y
i

.
)
2





Total Variation (Q) = Between Factors (Q
1
) + Residual Variation (Q
E
)


Under H
0

: Q / (n
-
1)
-
>

2
n
-

1
, Q
1

/ (m
-

1)
-
>

2
m
-

1
, Q
E
/ (n
-

m)
-
>

2
n
-

m





Q
1

/ ( m
-

1 )
-
>
F
m

-

1, n
-

m



Q
E

/ ( n
-

m )


AOV Table:

Variation D.F. Sums of Squares Mean Squares F




Between

m
-
1 Q
1
=
n
i

(
y
i
.

-

y )
2
MS
1

= Q
1
/(m
-

1) MS
1
/ MS
E





Residual
n
-

m Q
E
= (
y
i
, j

-

y
i

.
)
2
MS
E

= Q
E
/(n
-

m)





Total
n
-
1 Q

=

(
y
i
, j.

-

y )
2

Q

/( n
-

1)

33

Two
-
Way Classification








Factor I




Means

Factor II y
1, 1

y
1, 2

y
1, 3

y
1, n


y
1
.



: :


: :


y
m
, 1

y
m
, 2

y
m
, 3
y
m
, n


y
m
.



Means

y
.

1

y
.

2

y
.

3
y
. n
y

. .

So we write as y


Partition SSQ:



(
y
i
, j

-

y )
2

= n (
y
i

.

-

y )
2

+ m (y
.

j

-

y )
2

+ (
y
i
, j

-

y
i

.

-

y

.

j

+ y )
2





Total


Between
Between

Residual



Variation Rows Columns Variation


Model:


y
i
, j

= +
i

+
j

+
i
, j




with
i
, j


~ N ( 0,
s
2
)


H
0
:


All
i

are equal.
H
0
: all
j

are equal


AOV

Variation D.F. Sums of Squares Mean Squares F





Between

m
-
1 Q
1
= n

(
y
i

.

-

y )
2
MS
1

= Q
1
/(m
-

1) MS
1
/ MS
E


Rows


Between

n
-
1 Q
2
= m (y
.

j

-

y )
2
MS
2

= Q
2
/(n
-

1) MS
2
/ MS
E




Columns


Residual
(m
-
1)(n
-
1)
Q
E
= (
y
i
, j

-

y
i

.

-

y
.

j

+ y)
2
MS
E

= Q
E
/(m
-
1)(n
-
1
)




Total
mn

-
1 Q

=

(
y
i
, j.

-

y )
2

Q

/(
mn

-

1)

34

Two
-
Way Example


ANOVA outline

Factor I


1 2 3 4 5

Totals Means
Variation d.f. SSQ MSQ F


Fact II 1

20 18 21 23 20 102 20.4
Rows

3 76.95 25.65 18.86**


2

19 18 17 18 18 90 18.0
Columns

4 8.50 2.13 1.57


3

23 21 22 23 20 109 21.8

Residual

12 16.30


4

17 16 18 16 17 84 16.8


Totals

79 73 78 80 75 385


Total

19 101.75

Means

19.75 18.25 19.50 20.00 18.75 19.25


FYI

software such as R,SAS,SPSS, MATLAB is designed for analysing these
data, e.g. SPSS as spreadsheet recorded with variables in columns and
individual observations in the rows. Thus the ANOVA data above would be
written as a set of columns or rows, e.g.



Var. value 20 18 21 23 20 19 18 17 18 18 23 21 22 23 20 17 16 18 16 17

Factor 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4

Factor 2 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4



35

DA APPLICATIONS
CONTEXT
e.g. BIO


GENETICS :
5 branches; aim = ‘Laws’ of Chemistry,
Physics, Maths. for Biology



GENOMICS :
Study of Genomes (complete set of DNA
carried by Gamete) by integration of 5 branches of
Genetics with ‘Informatics and Automated systems’



PURPOSE of GENOME RESEARCH :
Info. on Structure,
Function, Evolution of all Genomes


past and present



Techniques
of Genomics from molecular, quantitative,
population genetics:
Concepts and Terminology

from
Mendelian

genetics and
cytogenetics

36

CONTEXT: GENETICS
-

BRANCHES


Classical
Mendelian



Gene and Locus, Allele, Segregation,
Gamete, Dominance, Mutation



Cytogenetics



Cell,
Chromasome
, Meiosis and Mitosis,
Crossover and Linkage



Molecular


DNA sequencing, Gene Regulation and
Transcription, Translation and Genetic Code Mutations



Population


Allelic/Genotypic Frequencies, Equilibrium,
Selection, Drift, Migration, Mutation



Quantitative


Heritability/Additive, Non
-
additive Genetic
Effects, Genetic by Environment Interaction, Plant and
Animal Breeding

37

CONTEXT+ : GENOMICS
-
LINKAGES

Mendelian

Cytogenetics

Molecular

Population

Quantitative

GENOMICS

Genetic markers

DNA Sequences

Linkage/Physical Maps

Gene Location

QTL Mapping

38

GENOMICS


some KEY QUESTIONS


HOW
do Genes determine total phenotype?


HOW MANY
functional genes
necessary

and
sufficient

in a
given system?


WHAT
are
necessary

Physical/Chemical aspects of gene
structure?


IS

gene location in Genome
specific
?


WHAT
DNA sequences/structures are needed for gene
-
specific functions?


HOW MANY
different functional genes in whole
biosphere?


WHAT MEASURES

of essential DNA sameness in different
species?


39

‘DATA’ : STATISTICAL

GENOMICS

Some UNUSUAL/SPECIAL FEATURES


Size


databases very large

e.g. molecular marker and DNA
/ protein sequence
data;
unreconciled
, Legacy


Mixtures of variables
-

discrete/continuous

e.g.
combination of genotypes of genetic markers (D) and
values quantitative traits (C)


Empirical Distributions
needed for some

Test Statistics

e.g. QTL analysis, H.T. of locus order


Intensive Computation
e.g. Linkage Analysis, QTL and
computationally greedy algorithms in locus ordering,
derivation of empirical
distributions, sequence match
etc.


Likelihood Analysis
-

Linear Models typically insufficient
alone


40

DA
APPLICATIONS
CONTEXT e.g. BUSINESS/ FINANCE


http://big.computing.dcu.ie
/
;
http://sci
-
sym.dcu.ie




Data
-
rich environments


under
-
utilisation of resources


RAW DATA
into useful information and knowledge


Similar underpinning:
(‘Laws’)


based on analysis


Purpose


Informed decision
-
making


Techniques


quantitative.

Concepts & Nature


Pervasive, Dynamic, ‘Health’ subject to Internal/External
environments.
Key elements
-

Systems and
people


Forecasting/Prediction/Trigger



41

41

CONTEXT+ : FACTORS

Supply
Chain


Capital

Knowledge
& Systems

Labour

Globalisation,
technology


HEALTH of ENTERPRISE

(governmental, corporate,
educational, non
-
profit)

Adaptability

FRAMEWORK



Status
: Huge array of information
systems
&product software.



C
hallenges:
include
development, delivery, adoption,
and
implementation
of IT solutions into usable and effective systems that
mimic/support organisational
processes.
‘KS alignment with work
practice.’
(
Toffler
&
Drucker



80’s
: organisations
of 20
th

Century
-
>
knowledge
-

based. Greater autonomy, revised management structures).



Opportunities
: KM popularity
grew
through 90’s, spawned ideas of 'KM
models',
‘KM
strategy
', concepts of 'organisational
learning',
'knowledge
/practice networks
', 'knowledge discovery',
‘intellectual capital‘).



Objectives:
To Plan
, develop,
implement,
operate, optimise,
cost information /
communication
systems and interpret use
.



S
tarting
point
:

understanding
ICT opportunities
requires
both
technological
and
organisational perspective
+

understanding
of benefits associated with data capture and analysis.

42

Data Mining & KM



The
Knowledge Discovery Process


Classification
e.g. clusters, trees


Exploratory Data Analysis


Models
(including Bayesian Networks
), Graphical or other.


Frequent Pattern Mining
and special groups/subgroups


Key Features:


‘Learning models’
from data can be an important part of
building
an
intelligent decision support system.


Sophistication of analyses


computationally expensive data
mining methods, complexity of algorithms, interpretation and
application of models.




43

Hot Topics in BI


Business
Process Management and
Modelling


Supply Chain Management and Logistics


Innovation and ICT


Analytical Information
Systems, Databases
and Data Warehousing


Knowledge Management and Discovery


Social Networks and Knowledge Communities



Performance Indicators &
Measurement systems/
Information Quality


Data Analytics, Integration and
Interpretation


Cost
-
benefit
and
Impact Analysis


Reference Models and Modelling


Process Simulation and Optimization



Security
and Privacy


IT and IS
Architectures/Management


Info. Sys. development, Tools
and Software Engineering


44

Example

Questions


What

are the
characteristics

of internet purchases for a given
age
-
group?
How

can this be used to develop further E
-
business?


What
are
key
risk factors
for
profit/loss
on
a product on the
basis of
historical data
and demographic
variables?


Can

we segment into/identify groups of similar on
the basis of their
characteristics and purchase
behaviour?


Which

products are typically
bought together
in one transaction by
customers
?


What
are financial projections, given
market volatility
and
knock
-
on
for recent shock?


What

data should an in
-
house
information
system collect?
What

design principles are involved for a large
database
?


What

is involved in modelling and IT
-
supported optimisation of key
business processes?



45