# Statistics And Application

AI and Robotics

Nov 8, 2013 (4 years and 6 months ago)

96 views

Statistics And Application

Revealing Facts From Data

What Is Statistics

Statistics

is a
mathematical science

pertaining to collection, analysis,
interpretation, and presentation of
data
.

It is applicable to a wide variety of

from the physical and
social
sciences

to the
humanities
, as well
as to
,
government
,
medicine

and
industry
.

Statistics Is …

Almost every professionals need a
statistical tool.

Statistical skills enable you to intelligently
collect, analyze and interpret data relevant
to their decision
-
making.

Statistical concepts enable us to solve
problems in a diversity of contexts.

Statistical thinking enables you to add

Statistics is a science

To assist you making decisions under uncertainties.
Decision making process must be based on data
neither on personal opinion nor on belief.

It is already an accepted fact that "Statistical thinking
will one day be as necessary for efficient citizenship
as the ability to read and write." So, let us be ahead of
our time.

In US, students learn statistics from middle
school

Type Of Statistics

Descriptive statistics

deals with the description problem: Can
the data be summarized in a useful way, either numerically or
graphically, to yield insight about the population in question?
Basic examples of numerical descriptors include the
mean

and
standard deviation
. Graphical summarizations include various
kinds of charts and graphs.

Inferential statistics

is used to model patterns in the data,
accounting for randomness and drawing inferences about the
larger population. These inferences may take the form of
hypothesis testing
), estimates of
numerical characteristics (
estimation
),
prediction

of future
observations, descriptions of association (
correlation
), or
modeling of relationships (
regression
). Other
modeling

techniques include
ANOVA
,
time series
, and
data mining
.

Type of Studies

There are two major types of causal statistical studies, experimental
studies and observational studies. In both types of studies, the effect
of differences of an independent variable (or variables) on the
behavior of the dependent variable are observed. The difference
between the two types is in how the study is actually conducted.
Each can be very effective.

An experimental study involves taking measurements of the system
under study, manipulating the system, and then taking additional
measurements using the same procedure to determine if the
manipulation may have modified the values of the measurements. In
contrast, an observational study does not involve experimental
manipulation. Instead data are gathered and correlations between
predictors and the response are investigated.

Type of Statistical Courses

Two types:

Greater statistics is everything related to
learning from data, from the first planning or
collection, to the last presentation or report,
which is deep respect for data and truth.

Lesser statistics is the body of statistical
methodology, which has no interest in data or
truth, and are generally arithmetic exercises. If a
certain assumption is needed to justify a
procedure, they will simply to "assume the ... are
normally distributed"
--

no matter how unlikely
that might be.

Statistical Models

Statistical models are currently used in
various fields of business and science.

The terminology differs from field to field.
For example, the fitting of models to data,
called calibration, history matching, and
data assimilation, are all synonymous with
parameter

estimation.

Data Analysis

Developments in statistical data analysis
fields to which statistical methods are fruitfully
applied.

Decision making process under uncertainty is
largely based on application of statistical data
analysis for probabilistic risk assessment of

(cont.)

Decision makers need to lead others to apply
statistical thinking in day to day activities and
secondly,

Decision makers need to apply the concept
for the purpose of continuous improvement.

Is Data Information?

Database in your office contains a wealth of information.

The decision technology group members tap a fraction of
it

Employees waste time scouring multiple sources for a
database.

The decision
-
makers are frustrated because they cannot
-
critical data exactly when they need it.

Therefore, too many decisions are based on guesswork,
not facts. Many opportunities are also missed, if they are
even noticed at all.

Data itself is not information, but might generate
information.

Knowledge

Knowledge is what we know well. Information is the
communication of knowledge.

In every knowledge exchange, the sender make
common what is private, does the informing, the
communicating.

Information can be classified as
explicit and tacit

forms.

The explicit information can be explained in structured
form, while tacit information is inconsistent and fuzzy to
explain.

Know that data are only crude information and not
knowledge by themselves.

Data

Knowledge (?)

Data is known to be crude information and not
knowledge by itself.

The sequence from data to knowledge is:
from Data to
Information, from Information to Facts, and finally,
from Facts to Knowledge
.

Data becomes information, when it becomes relevant to

Information becomes fact, when the data can support it.
Facts are what the data reveals.

However the decisive instrumental (i.e., applied)
knowledge is expressed together with some statistical
degree of confidence.

Fact

knowledge

Fact becomes knowledge, when it is used in the
successful completion of a statistical process.

Statistical Analysis

The exactness of a statistical model increases,
the level of improvements in decision
-
making
increases: the reason of using statistical data
analysis.

Statistical data analysis arose from the need to
place knowledge on a systematic evidence base.

Statistics is a study of the laws of probability, the
development of measures of data properties and
relationships, and so on.

Statistical Inference

Verify the statistical hypothesis: Determining whether
any statistical significance can be attached that results
after due allowance is made for any random variation as
a source of error.

Intelligent and critical inferences cannot be made by
those who do not understand the purpose, the conditions,
and applicability of the various techniques for judging
significance.

Considering the uncertain environment, the chance that
"good decisions" are made increases with the availability
of "good information." The chance that "good
information" is available increases with the level of
structuring the process of Knowledge Management.

Knowledge Needs Wisdom

Wisdom is the power to put our time and
our knowledge to the proper use.

Wisdom is the accurate application of
accurate knowledge.

Wisdom is about knowing how technical
staff can be best used to meet the needs
of the decision
-
maker.

History Of Statistics

The word
statistics

ultimately derives from the
modern Latin

term
statisticum collegium

("council of state") and the
Italian

word
statista

("
statesman
" or "
politician
").

The birth of statistics occurred in mid
-
17th century. A commoner,
named John Graunt, who was a native of London, begin reviewing a
weekly church publication issued by the local parish clerk that listed
the number of births, christenings, and deaths in each parish. These
so called Bills of Mortality also listed the causes of death. Graunt
who was a shopkeeper organized this data in the forms we call
descriptive statistics, which was published as
Natural and Political
Observation Made upon the Bills of Mortality
. Shortly thereafter, he
was elected as a member of Royal Society. Thus, statistics has to
borrow some concepts from sociology, such as the concept of
"Population". It has been argued that since statistics usually involves
the study of human behavior, it cannot claim the precision of the
physical sciences.

Statistics is for Government

The original principal purpose of
Statistik

was data to be
used by governmental and (often centralized)
and localities continues, largely through
national and
international statistical services
.

Censuses

population
.

During the 20th century, the creation of precise
instruments for
public health

concerns (
epidemiology
,
biostatistics
, etc.) and economic and social purposes
(
unemployment

rate,
econometry
, etc.) necessitated

History of Probability

Probability has much longer history. Probability is derived from the
verb to probe meaning to "find out" what is not too easily accessible
or understandable. The word "proof" has the same origin that
provides necessary details to understand what is claimed to be true.

Probability originated from the study of games of chance and
gambling during the sixteenth century. Probability theory was a
branch of mathematics studied by Blaise Pascal and Pierre de
Fermat in the seventeenth century.

Currently; in 21st century, probabilistic modeling are used to control
the flow of traffic through a highway system, a telephone
interchange, or a computer processor; find the genetic makeup of
individuals or populations; quality control; insurance; investment;
and other sectors of business and industry.

Stat Merge With Prob

Statistics eventually merged with the field of
inverse probability
,
referring to the estimation of a parameter from experimental data in
the experimental sciences (most notably
astronomy
).

Today the use of statistics has broadened far beyond the service of
a state or government, to include such areas as business, natural
and social sciences, and medicine, among others.

Statistics emerged in part from
probability theory
, which can be
dated to the correspondence of
Pierre de Fermat

and
Blaise Pascal

(1654).
Christiaan Huygens

(1657) gave the earliest known scientific
treatment of the subject.
Jakob Bernoulli
's
Ars Conjectandi

(posthumous, 1713) and
Abraham de Moivre
's
Doctrine of Chances

(1718) treated the subject as a branch of mathematics.

Development in 18
-
19 centery

The
theory of errors

may be traced back to
Roger Cotes
's
Opera
Miscellanea

(posthumous, 1722), but a memoir prepared by
Thomas
Simpson

in 1755 (printed 1756) first applied the theory to the
discussion of errors of observation.

Daniel Bernoulli

(1778) introduced the principle of the maximum
product of the probabilities of a system of concurrent errors.

The
method of least squares
, which was used to minimize errors in
data
measurement
, is due to
Robert

-
Marie Legendre (1805) by the problems of
survey measurements, reconciling disparate physical measurements.

General theory in statistics: by Laplace (1810, 1812), Gauss (1823),
James Ivory

(1825, 1826), Hagen (1837),
Friedrich Bessel

(1838),
W. F. Donkin

(1844, 1856), and
Morgan Crofton

(1870). Other
contributors were Ellis (1844),
De Morgan

(1864),
Glaisher

(1872),
and
Giovanni Schiaparelli

(1875).

Statistics in 20 Century

Karl Pearson

(
March 27
,
1857

April 27
,
1936
) was a major contributor to the early
development of
statistics
. Pearson's work was all
-
embracing in the wide application
and development of mathematical statistics, and encompassed the fields of
biology
,
epidemiology
, anthropometry,
medicine

and social
history
, his main contributions are:
Linear regression

and
correlation
.

The
Pearson product
-
moment correlation
coefficient

was the first important
effect size

to be introduced into statistics;
Classification of distributions
forms the basis for a lot of modern statistical theory;
in particular, the
exponential family

of distributions underlies the theory of
generalized
linear models
;
Pearson's chi
-
square test
.

Sir Ronald Aylmer Fisher
,
FRS

(
17 February

1890

29 July

1962
)

Fisher invented the techniques of
maximum likelihood

and
analysis of variance
, and
originated the concepts of
sufficiency
,
ancillarity
,
Fisher's linear discriminator

and
Fisher information
. His
1924

article "On a distribution yielding the error functions of
several well known statistics" presented
Karl Pearson's

chi
-
squared

and
Student's

t

in
the same framework as the normal distribution and his own analysis of variance
distribution z

(more commonly used today in the form of the
F distribution
). These
contributions easily made him a major figure in
20th century

statistics. He began the
field of
non
-
parametric statistics
, entropy as well as Fish information were essential
for developing Bayesian analysis.

Statistics in 20 Century

Gertrude Mary Cox

(
January 13
,
1900

1978
)
Experimental Design

Charles Edward Spearman (
September 10
,
1863

-

September 7
,
1945
)

non
-
parametric analysis, rank correlation coefficient

Chebyshev's inequality

Lyapunov's central limit theorem

John Wilder Tukey

(
June 16
,
1915

-

July 26
,
2000
):
jackknife estimation
,
exploratory data analysis

and
confirmatory data analysis
.

George Bernard Dantzig

(
8 November

1914

13 May

2005
):developing
the simplex method and furthering linear programming, advanced the fields
of decomposition theory, sensitivity analysis, complementary pivot
methods, large
-
scale optimization, nonlinear programming, and
programming under uncertainty.

Bayes' theorem

Sir
David Roxbee Cox

(born
Birmingham, England
,
1924
pioneering and important contributions to numerous areas of statistics and
applied probability, of which the best known is perhaps the proportional
hazards model, which is widely used in the analysis of survival data.

School Thought of Statistics

The Classical, attributed to
Laplace
:

Relative Frequency, attributed to
Fisher

Bayesian, attributed to
Savage

What Type of Statistician Are You?

Classic Statistics

The problem with the Classical Approach is that what
constitutes an outcome is not objectively determined.
One person's simple event is another person's
compound event. One researcher may ask, of a newly
discovered planet, "what is the probability that life exists
on the new planet?" while another may ask "what is the
probability that carbon
-
based life exists on it?"

Bruno de Finetti, in the introduction to his two
-
volume
treatise on Bayesian ideas, clearly states that
"Probabilities Do not Exist". By this he means that
probabilities are not located in coins or dice; they are not
characteristics of things like mass, density, etc

Relative Frequency Statistics

Consider probabilities as "objective" attributes
of things (or situations) which are really out
there (availability

of data).

Use the data we have only to make
interpretation.

Even substantial prior information is available,
Frequentists do not use it, while Bayesians are
willing to assign probability distribution
function(s) to the population's parameter(s).

Bayesian approaches

Consider probability theory as an extension of deductive
logic (including dialogue logic, interrogative logic,
informal logic, and artificial intelligence) to handle
uncertainty.

First principle that the uniquely correct way is your belief
about the state of things (Prior), and updating them in
the light of the evidence.

The laws of probability have the same status as the laws
of logic.

Bayesian approaches are explicitly "subjective" in the
sense that they deal with the plausibility which a rational
agent ought to attach to the propositions he/she
considers, "given his/her current state of knowledge and
experience."

Discussion

From a scientist's perspective, there are good grounds to reject
Bayesian reasoning. Bayesian deals not with objective, but
subjective probabilities. The result is that any reasoning using a
Bayesian approach cannot be checked
--

something that makes it
worthless to science, like non replicate experiments.

Bayesian perspectives often shed a helpful light on classical
procedures. It is necessary to go into a Bayesian framework to give
confidence intervals. This insight is helpful in drawing attention to
the point that another prior distribution would lead to a different
interval.

A Bayesian may cheat by basing the prior distribution on the data,
because priors must be personal for coherence to hold before the
study, which is more complex.

Objective Bayesian: There is a clear connection between probability
and logic: both appear to tell us how we should reason. But how,
exactly, are the two concepts related? Objective Bayesians offers

Steps Of The Analysis

1.
Defining the problem:
An exact definition of the problem
is imperative in order to obtain accurate data about it.

2.
Collecting the data:
Designing ways to collect data is an
important job in statistical data analysis. Population

and
Sample are VIP aspects.

3.
Analyzing the data:
Exploratory methods are used to
discover what the data seems to be saying by using
simple arithmetic and easy
-
to
-
draw pictures to
summarize data. Confirmatory methods use ideas from
probability theory in the attempt to answer specific
questions.

4.
Reporting the results

Type of Data, Levels of
Measurement & Errors

Qualitative and Quantitative

Discrete and Continuous

Nominal, Ordinal, Interval and Ratio

Types of error: Recording error, typing error,
transcription error (incorrect copying),
Inversion (e.g., 123.45 is typed as 123.54),
Repetition (when a number is repeated),
Deliberate error, Type Error, etc.

Data Collection: Experiments

Experiment

is a set of actions and
observations
, performed for solving a given
problem
, to test a
hypothesis

or
research

concerning
phenomena
. Itis an
empirical

approach acquiring deeper
knowledge

Design of experiments

In the "hard" sciences tends to focus on the elimination of extraneous effects, in the
"soft" sciences it focuses more on the problems of external validity, by using
statistical methods
. Events occur naturally from which scientific evidence can be
drawn, which is the basis for
natural experiments
.

Controlled experiments

To demonstrate a cause and effect hypothesis, an experiment must often show that,
for example, a phenomenon occurs after a certain treatment is given to a subject,
and that the phenomenon does
not

occur in the
absence

of the treatment.

A
controlled

experiment generally compares the results obtained from an
experimental sample against a
control

sample, which is practically identical to
the experimental sample except for the one aspect whose effect is being tested.

Data Collection: Experiments

Natural experiments or
quasi
-
experiments

Natural experiments rely solely on observations of the
variables

of the
system

under
study, rather than manipulation of just one or a few variables as occurs in controlled
experiments. Much research in several important
science

disciplines, including
geology
,
paleontology
,
ecology
,
meteorology
, and
astronomy
, relies on quasi
-
experiments.

Observational studies

Observational studies are very much like controlled experiments except that they
lack probabilistic equivalency between groups. These types of experiments often
arise in the area of medicine where, for ethical reasons, it is not possible to create a
truly controlled group. ]

Field Experiments

Named in order to draw a contrast with
laboratory experiments
. Often used in the
social sciences, economics etc. Field experiments suffer from the possibility of
contamination: experimental conditions can be controlled with more precision and
certainty in the lab.

Data Analysis

Applied Statistics

Actuarial science

Applies
mathematical

and
statistical

methods to
finance

and
insurance
,
particularly to the assessment of
risk
.
Actuaries

are professionals who are
qualified in this field.

Actuarial science

Actuarial science

is the discipline that applies
mathematical

and
statistical

methods to
assess risk

in the
insurance

and
finance

industries.
Actuaries

are professionals who are
qualified in this field through examinations and experience.

Actuarial science includes a number of interrelating subjects,
including
probability

and
statistics
,
finance
, and
economics
.
Historically, actuarial science used deterministic models in
the construction of tables and premiums. The science has
gone through revolutionary changes during the last 30 years
due to the proliferation of high speed computers and the
synergy of
stochastic

actuarial models with modern financial
theory
(Frees 1990)
.

programs in actuarial science. In 2002, a
Wall Street Journal

survey on the best jobs in the United States listed “actuary”
as the second best job
(Lee 2002)
.

Where Do Actuaries Work and What Do They Do?

The insurance industry can't function without actuaries, and that's where most of them
work. They calculate the costs to assume risk

how much to charge policyholders for
life or health insurance premiums or how much an insurance company can expect to
pay in claims when the next hurricane hits Florida.

Actuaries provide a financial evaluation of risk for their companies to be used for strategic
management decisions. Because their judgement is heavily relied upon, actuaries'
career paths often lead to upper management and executive positions.

When other businesses that do not have actuaries on staff need certain financial advice,
they hire actuarial consultants. A consultant can be self
-
employed in a one
-
person
practice or work for a nationwide consulting firm. Consultants help companies design
pension and benefit plans and evaluate assets and liabilities. By delving into the
financial complexities of corporations, they help companies calculate the cost of a
variety of business risks. Consulting actuaries rub elbows with chief financial officers,
operating and human resource executives, and often chief executive officers.

Actuaries work for the government too, helping manage such programs as the Social
Security system and Medicare. Since the government regulates the insurance
industry and administers laws on pensions and financial liabilities, it also needs
actuaries to determine whether companies are complying with the law.

Who else asks an actuary to assess risks and solve thorny statistical and financial
problem? You name it: Banks and Investment firms, large corporations, public
accounting firms, insurance rating bureaus, labor unions, and fraternal organizations..,

Typical actuarial projects:

Analyzing insurance rates, such as for cars,
homes or life insurance.

Estimating the money to be set
-
aside for claims
that have not yet been paid.

Participating in corporate planning, such as
mergers and acquisitions.

Calculating a fair price for a new insurance
product.

Forecasting the potential impact of catastrophes.

Analyzing investment programs.

VEE

Applied Statistical Methods

Courses that meet this requirement may be taught in the mathematics, statistics, or
economics department, or in the business school. In economics departments, this
course may be called Econometrics. The material could be covered in one course or
two. The mathematical sophistication of these courses will vary widely and all levels
are intended to be acceptable. Some analysis of real data should be included. Most
of the topics listed below should be covered:

Probability.

3 pts.

Statistical Inference.

3 pts.

Linear Regression Models.

3 pts.

Time Series Analysis.

3 pts.

Survival Analysis.

3 pts.

Elementary Stochastic Processes.

3 pts.

Simulation.

3 pts.

Introduction to the Mathematics of Finance.

3 pts.

Statistical Inference and Time
-
Series Modelling.

3 pts.

Stochastic Methods in Finance.

3 pts.

Stochastic Differential Equations and Applications.

3 pts.

3 pts.

Data Mining.

3 pts.

Statistical Methods in Finance.

3 pts.

Nonparametric Statistics.

3 pts.

Stochastic Processes and Applications,

3 pts.

Some Books

Generalized Linear Models for Insurance Data, by Piet
de Jong and Gillian Z. Heller

Stochastic Claims Reserving Methods in Insurance (The
Wiley Finance Series) by Mario V. Wüthrich and
Michael Merz

Actuarial Modelling of Claim Counts: Risk Classification,
Credibility and Bonus
-
Malus Systems, by Michel
Denuit, Xavier Marechal, Sandra Pitrebois and Jean
-
Francois Walhin

Loss Models: From Data to Decisions (Wiley Series in
Probability and Statistics) (Hardcover) by Stuart A.
Klugman, Harry H. Panjer and Gordon E. Willmot

Biostatistics

or
Biometry

Biostatistics

or
biometry

is the application of
statistics

to a wide range of topics in
biology
.

Public health
, including
epidemiology
,
nutrition

and
environmental health
,

Design and analysis of
clinical trials

in
medicine

Genomics
,
population genetics
, and
statistical
genetics

in populations in order to link variation
in
genotype

with a variation in
phenotype
.

Ecology

Biological
sequence analysis

.

Data Mining

Knowledge
-
Discovery in Databases (KDD), is the process of
automatically searching large volumes of
data

for patterns.

The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data

Data mining involves the process of analyzing data

Data Mining is a fairly recent and contemporary topic in
computing.

Data Mining applies many older computational techniques
from
statistics
,
machine learning

and
pattern recognition
.

Intelligence

Increasing potential

to support

End User

Analyst

Data

Analyst

DBA

Decision

Making

Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

Data Mining: Confluence of Multiple
Disciplines

Data Mining

Database

Technology

Statistics

Machine

Learning

Pattern

Recognition

Algorithm

Other

Disciplines

Visualization

Data Mining: On What Kinds of
Data?

Database
-
oriented data sets and applications

Relational database, data warehouse, transactional database

Data streams and sensor data

Time
-
series data, temporal data, sequence data (incl. bio
-
sequences)

Structure data, graphs, social networks and multi
-

Object
-
relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data

Multimedia database

Text databases

The World
-
Wide Web

Top
-
10 Most Popular DM Algorithms:

18 Identified Candidates (I)

Classification

#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning.
Morgan Kaufmann., 1993.

#2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth, 1984.

#3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R.
1996. Discriminant Adaptive Nearest Neighbor Classification.
TPAMI. 18(6)

#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So
Stupid After All? Internat. Statist. Rev. 69, 385
-
398.

Statistical Learning

#5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning
Theory. Springer
-
Verlag.

#6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture
Models. J. Wiley, New York. Association Analysis

#7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In VLDB '94.

#8. FP
-
Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent
patterns without candidate generation. In SIGMOD '00.

The 18 Identified Candidates (II)

#9. PageRank: Brin, S. and Page, L. 1998. The
anatomy of a large
-
scale hypertextual Web search
engine. In WWW
-
7, 1998.

#10. HITS: Kleinberg, J. M. 1998. Authoritative
sources in a hyperlinked environment. SODA, 1998.

Clustering

#11. K
-
Means: MacQueen, J. B., Some methods for
classification and analysis of multivariate
observations, in Proc. 5th Berkeley Symp.
Mathematical Statistics and Probability, 1967.

#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny,
M. 1996. BIRCH: an efficient data clustering method
for very large databases. In SIGMOD '96.

Bagging and Boosting

#13. AdaBoost: Freund, Y. and Schapire, R. E. 1997.
A decision
-
theoretic generalization of on
-
line learning
and an application to boosting. J. Comput. Syst. Sci.
55, 1 (Aug. 1997), 119
-
139.

The 18 Identified Candidates (III)

Sequential Patterns

#14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential
Patterns: Generalizations and Performance Improvements. In
Proceedings of the 5th International Conference on Extending
Database Technology, 1996.

#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi
-
Asl, H. Pinto, Q.
Chen, U. Dayal and M
-
C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix
-
Projected Pattern Growth. In ICDE
'01.

Integrated Mining

#16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating
classification and association rule mining. KDD
-
98.

Rough Sets

#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical
Norwell, MA, 1992

Graph Mining

#18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph
-
Based
Substructure Pattern Mining. In ICDM '02.

Major Issues in Data Mining

Mining methodology

Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods

Integration of the discovered knowledge with existing one: knowledge
fusion

User interaction

Data mining query languages and ad
-
hoc mining

Expression and visualization of data mining results

Interactive mining of

knowledge at multiple levels of abstraction

Applications and social impacts

Domain
-
specific data mining & invisible data mining

Protection of data security, integrity, and privacy

Challenge Problems in Data Mining

Developing a Unifying Theory of Data Mining

Scaling Up for High Dimensional Data and High Speed
Data Streams

Mining Sequence Data and Time Series Data

Mining Complex Knowledge from Complex Data

Data Mining in a Network Setting

Distributed Data Mining and Mining Multi
-
agent Data

Data Mining for Biological and Environmental Problems

Data
-
Mining
-
Process Related Problems

Security, Privacy and Data Integrity

Dealing with Non
-
static, Unbalanced and Cost
-
sensitive
Data

Recommended Reference
Books

S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi
-
Structured Data.
Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley
-
Interscience, 2000

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

-
Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996

U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2
nd

ed.,
2006

D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer
-
Verlag, 2001

B. Liu, Web Data Mining, Springer 2006.

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

G. Piatetsky
-
Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,
1991

P.
-
N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2
nd

ed. 2005

Economic statistics

Economic statistics is a branch of applied statistics
focusing on the collection, processing, compilation and
dissemination of statistics concerning the economy of a
region, a country or a group of countries.

Economic statistics is also referred as a subtopic of
official statistics, since most of the economic statistics
are produced by official organizations (e.g. statistical
institutes, supranational organizations, central banks,
ministries, etc.).

Economic statistics provide the empirical data needed in
economic research (econometrics) and they are the
basis for decision and economic policy making.

Econometrics

Econometrics is concerned with the tasks of developing
and applying quantitative or statistical methods to the
study and elucidation of economic principles.
Econometrics combines economic theory with statistics
to analyze and test economic relationships.

Theoretical econometrics considers questions about the
statistical properties of estimators and tests, while
applied econometrics is concerned with the application
of econometric methods to assess economic theories.
Although the first known use of the term "econometrics"
was by Pawel Ciompa in 1910, Ragnar Frisch is given
credit for coining the term in the sense that it is used
today.

Method in Econometrics

Although many econometric methods represent applications of
standard statistical models, there are some special features of
economic data that distinguish econometrics from other branches of
statistics.

Economic data are generally observational, rather than being
derived from controlled experiments. Because the individual units in
an economy interact with each other, the observed data tend to
reflect complex economic equilibrium conditions rather than simple
behavioral relationships based on preferences or technology.
Consequently, the field of econometrics has developed methods for
identification and estimation of simultaneous equation models.
These methods allow researchers to make causal inferences in the
absence of controlled experiments.

Early work in econometrics focused on time
-
series data, but now
econometrics also fully covers cross
-
sectional and panel data.

Data in Econometrics

Data is broadly classified according to the number of dimensions.

A data set containing observations on a single phenomenon observed over
multiple time periods is called
time series
. In time series data, both the
values and the ordering of the data points have meaning.

A data set containing observations on multiple phenomena observed at a
single point in time is called
cross
-
sectional
. In cross
-
sectional data sets,
the values of the data points have meaning, but the ordering of the data
points does not.

A data set containing observations on multiple phenomena observed over
multiple time periods is called
panel data
. Alternatively, the second
dimension of data may be some entity other than time. For example, when
there is a sample of groups, such as siblings or families, and several
observations from every group, the data is panel data. Whereas time series
and cross
-
sectional data are both one
-
dimensional, panel data sets are two
-
dimensional.

Data sets with more than two dimensions are typically called multi
-
dimensional panel data.

Program

Research Area: Theoretical econometrics, including time series analysis,
nonparametric and semi
-
parametric estimation, panel data analysis, and
financial econometrics; applied econometrics, including applied labor
economics and empirical finance.

Courses

Probability and Statistics

Time Series Models

Micro Econometrics

Panel Data Econometrics

Financial Econometric

Nonparametric and semi
-
parametric econometrics

Data Analysis in Academic Research (using SAS)

Statistics and Data Analysis for Economics

Nonlinear Models

Some researches in U. of Chicago

Proposal: "Selective Publicity and Stock Prices" By: David Solomon

Proposal: "Activating Self
-
Control: Isolated vs. Interrelated Temptations" By: Kristian
Myrseth

Proposal: "Buyer's Remorse: When Evaluation is Based on Simulation Before You Chose
but Deliberation After" By: Yan Zhang

Proposal: "Brokerage, Second
-
Hand Brokerage and Difficult Working Relationships: The
Role of the Informal Organization on Speaking Up about Difficult Relationships and
Being Deemed Uncooperative by Co
-
Workers" By: Jennifer Hitler

Proposal: "Resource Space Dynamics in the Evolution of Industries: Formation,
Expansion and Contraction of the Resource Space and its Effects on the Survival of
Organizations: By: Aleksios Gotsopoulos

Defense: "An Examination of Status Dynamics in the U.S. Venture Capital Industry" By:
Young
-
Kyu Kim

Defense: "Group Dynamics and Contact: A Natural Experiment" By: Arjun Chakravarti

Defense: "Essays in Corporate Governance" By: Ashwini Agrawal

Some Researches in U. of Chicago

Defense: "Essays on Consumer Finance" By: Brian Melzer

Defense: "Male Incarceration and Teen Fertility" By: Amee Kamdar

Defense: "Essays on Economic Fundamentals in Asset Pricing" By: Jie (Jennie) Bai

Defense: "Asset
-
Intensity and the Cross
-
Section of Stock Returns" By: Raife Giovinazzo

Defense: "Essays on Household Behavior" By: Marlena Lee

Proposal: "How (Un)Accomplished Goal Actions Affect Goal Striving and Goal Setting" By: Minjung
Koo

Defense: "Empirical Entry Games with Complementarities: An Application to the Shopping Center
Industry" By: Maria Ana Vitorino

Defense: "Betas, Characterisitcs, and the Cross
-
Section of Hedge Fund Returns" By: Mark Klebanov

Defense: "Expropriatin Risk and Technology" By: Marcus Opp

Defense: "Essays in Corporate Finance and Real Estate" By: Itzhak Ben
-
David

Proposal: "Group Dynamics and Interpersonal Contact: A Natural Experiment" By: Arjun Chakravarti

Proposal: "Structural Estimation of a Moral Hazard Model: An Application to Industrial Selling"

By: Renna Jiang

Proposal: "Status, Quality, and Earnings Announcements: An Analysis of the Effect of News of which
-
Quality Correlation on the Stock of a Company“ By: Daniela
Lup

Defense: "Diversification and its Discontents: Idiosyncratic and Entrepreneurial Risk in the Quest for
Social Status" By: Nick Roussanov

Summary of Econometrics

It is a combination of
mathematical economics
,
statistics
,
economic statistics and economic theory.

Regression analysis

is popular

Time
-
series analysis

and
cross
-
sectional analysis

are
useful.

Panel analyses
, which related to multi
-
dimension
regression

Fixed effect models: There are unique attributes of
individuals that are not the results of random variation
and that do not vary across time. Adequate, if we want to
draw inferences only about the examined individuals.

Random effect models:There are unique, time constant
attributes of individuals that are the results of random
variation and do not correlate with the individual
regressors. This model is adequate, if we want to draw
inferences about the whole population, not only the
examined sample.

References

Arellano, Manuel
.
Panel Data Econometrics
, Oxford
University Press 2003.

Hsiao, Cheng, 2003.
Analysis of Panel Data
, Cambridge
University Press.

Davies, A. and Lahiri, K., 2000. "Re
-
examining the
Rational Expectations Hypothesis Using Panel Data on
Multi
-
Period Forecasts,"
Analysis of Panels and Limited
Dependent Variable Models
, Cambridge University
Press.

Davies, A. and Lahiri, K., 1995. "A New Framework for
Testing Rationality and Measuring Aggregate Shocks
Using Panel Data," Journal of Econometrics 68: 205
-
227.

Frees, E., 2004.
Longitudinal and Panel Data
,
Cambridge University Press.

Engineering Statistics

(DOE) or
design of experiments

uses statistical
techniques to test and construct models of
engineering components and systems.

Quality control

and
process control

use statistics
as a tool to manage conformance to
specifications of manufacturing processes and
their products.

Time and methods engineering

use statistics to
study repetitive operations in manufacturing in
order to set standards and find optimum (in
some sense) manufacturing procedures

Statistical Physics

Using methods of
statistics

in solving physical problems with
stochastic

nature.

The term
statistical physics

encompasses
probabilistic

and
statistical

approaches to
classical mechanics

and
quantum
mechanics
. Hence might be called as
Statistical mechanics

It works well in classical systems when the number of
degrees
of freedom

is so large that exact solution is not possible, or not
really useful.

Statistical mechanics

can also describe work in
non
-
linear
dynamics
,
chaos theory
, thermal physics,
fluid dynamics

(particularly at low
Knudsen numbers
), or
plasma physics
.

Demography

The study of human
population dynamics
. It
encompasses the study of the size, structure
and distribution of
populations
, and how
populations change over time due to births,
deaths,
migration

and
ageing
.

Methods are including
census

returns and
vital
statistics

registers, or incorporate survey data
using indirect estimation techniques.

Psychological Statistics

The application of statistics to psychology.

Some of the more commonly used statistical
tests in psychology are:

Student's t
-
test , Chi
-
square, ANOVA,
ANCOVA, MANOVA, Regression analysis ,
Correlation, Survival analysis, Cliniqual
trial , etc.

Social Statistics

Using
statistical

measurement systems to study
human

behavior in a social environment

Advanced statistical analyses have become popular
among social science.

A new branch: quantitative social science in Harvard

Structural Equation Modeling

and
factor analysis

Multilevel models

Cluster analysis

Latent class models

Item response theory

Survey methodology

and
survey sampling

Chemometrics

Apply mathematical or statistical methods to chemical data.

Chemometrics is the science of relating measurements made
on a chemical system or process to the state of the system via
application of mathematical or statistical methods.

Chemometric research spans a wide area of different methods
which can be applied in chemistry. There are techniques for
collecting good data (optimization of experimental parameters,
design of experiments
,
calibration
,
signal processing
) and for
getting information from these data (
statistics
,
pattern
recognition
,
modeling
, structure
-
property
-
relationship
estimations).

Chemometrics tries to build a bridge between the methods and
their application in chemistry.

Reliability Engineering

Reliability engineers perform a wide variety of
special management and engineering tasks to
ensure that sufficient attention is given to details
that will affect the reliability of a given system.

Reliability engineers rely heavily on
statistics
,
probability theory
, and
reliability theory
. Many
engineering techniques are used in reliability
engineering, such as reliability prediction,
Weibull

analysis, thermal management,
reliability testing and accelerated life testing.

Statistical Methods

A common goal for a statistical research
project is to investigate causality, and in
particular to draw a conclusion on the
effect of changes in the values of
predictors or
independent variables

on a
response or
dependent variable
.

Two major types of studies: Experimental
and observational studies

Well Known Techniques

Student's t
-
test
: test
means

of two
normally distributed

populations are equal

chi
-
square
: test two distributions are the same

analysis of variance

(ANOVA): test the difference of
mean or effects.

Mann
-
Whitney U
: test difference in
medians

between
two observed distributions

regression analysis
: model relationships between
random variables
, determine the magnitude of the
relationships between variables, and can be used to
make predictions based on the models

Correlation
: indicates the strength and direction
of a linear relationship between two
random
variables

Fisher's Least Significant Difference test

: test
difference of means in multiple comparison.

Pearson product
-
moment correlation coefficient
:
a measure of how well a
linear equation

describes the relation between two variables
X

and
Y

measured on the same object or
organism.

Spearman's rank correlation coefficient
: a
non
-
parametric

measure of
correlation

between two
variables

Simple Statistic Application

Compare two means

Compare two proportions

Compare two populations

Estimate mean or proportion

Find empirical distribution

Statistical Topics

Sampling Distribution

Sampling distribution is used to describe the distribution of
outcomes that one would observe from replication of a particular
sampling plan.

Know that to estimate means to esteem (to give value to).

Know that estimates computed from one sample will be different
from estimates that would be computed from another sample.

Understand that estimates are expected to differ from the
population characteristics (parameters) that we are trying to
estimate, but that the properties of sampling distributions allow us
to quantify, probabilistically, how they will differ.

Understand that different statistics have different sampling
distributions with distribution shape depending on (a) the specific
statistic, (b) the sample size, and (c) the parent distribution.

Understand the relationship between sample size and the
distribution of sample estimates.

Understand that the variability in a sampling distribution can be
reduced by increasing the

Research

Sequential sampling technique

Low response rate

Biased response

Outlier Removal

Outliers are a few observations that are not well fitted by the "best"
available model. When occurring, one must first investigate the
source of data, if there is no doubt about the accuracy or veracity of
the observation, then it should be removed and the model should be
refitted. Robust statistical techniques are needed to cope with any
undetected outliers; otherwise the result will be misleading.

Because of the potentially large variance, outliers could be the
outcome of sampling. It's perfectly correct to have such an
observation that legitimately belongs to the study group by definition.
Say, Lognormally distributed data.

To be very careful and cautious: before declaring an observation "an
outlier," find out why and how such observation occurred. It could
even be an error at the data entering stage.

First, construct the BoxPlot of your data. Form the Q1, Q2, and Q3
points which divide the samples into four equally sized groups. (Q2
= median) Let IQR = Q3
-

Q1. Outliers are defined as those points
outside the values Q3+k*IQR and Q1
-
k*IQR. For most case one
sets k=1.5 or 3.

Another alternative outlier definition is out off: mean + ks, mean
-

ks
sigma (k is 2, 2.5, or 3)

Central Limit Theorem

The average of a sample of observations drawn from some
population with any shape
-
distribution is approximately
distributed as a normal distribution if certain conditions are
met.

It is well known that whatever the parent population is, the
standardized variable will have a distribution with a mean 0
and standard deviation 1 under random sampling with a large
sample size.

The sample size needed for the approximation to be adequate
depends strongly on the shape of the parent distribution.
Symmetry is particularly important. For a symmetric and short
tail parent distribution, even if very different from the shape of
a normal distribution, an adequate approximation can be
obtained with small samples (e.g., 10 or 12 for the uniform
distribution). In some extreme cases (e.g. binomial with )
samples sizes far exceeding the typical guidelines (say, 30) are

P
-
values

The P
-
value, which directly depends on a given sample, attempts to provide a
measure of the strength of the results of a test, in contrast to a simple reject or do not
reject. If the null hypothesis is true and the chance of random variation is the only
reason for sample differences, then the P
-
value is a quantitative measure to feed into
the decision making process as evidence. The following table provides a reasonable
interpretation of P
-
values:

P< 0.01 very strong evidence against H0; 0.01

P < 0.05 moderate evidence against
H0; 0.05 ≤ P < 0.10 suggestive evidence against H0; 0.10 ≤ P little or no real
evidence against H0

This interpretation is widely accepted, and many scientific journals routinely publish
papers using this interpretation for the result of test of hypothesis.

For the fixed
-
sample size, when the number of realizations is decided in advance, the
distribution of p is uniform (assuming the null hypothesis). We would express this as
P(p ≤ x) = x. That means the criterion of p <0.05 achieves a of 0.05.

When a p
-
value is associated with a set of data, it is a measure of the probability that
the data could have arisen as a random sample from some population described by
the statistical (testing) model.

A p
-
value is a measure of how much evidence you have against the null hypothesis.
The smaller the p
-
value, the more evidence you have. One may combine the p
-
value
with the significance level to make decision on a given test of hypothesis. In such a
case, if the p
-
value is less than some threshold (usually .05, sometimes a bit larger
like 0.1 or a bit smaller like .01) then you reject the null hypothesis.

Accuracy, Precision, Robustness,
and Data Quality

Accuracy

is the degree of conformity of a measured/calculated quantity to its actual
(true) value.

Precision

is the degree to which further measurements or calculations will show the
same or similar results.

Robustness

is the resilience of the system, especially when under stress or when
confronted with invalid input.

Data are of high quality "if they are fit for their intended uses in
operations
,
decision
making

and
planning

.

An "accurate" estimate has small bias. A "precise" estimate has both small bias and
variance.

The robustness of a procedure is the extent to which its properties do not depend on
those assumptions which you do not wish to make.

Distinguish between bias robustness and efficiency robustness.

Example: Sample mean is seen as a robust estimator, it is because the CLT
guarantees a 0 bias for large samples regardless of the underlying distribution. This
estimator is bias robust, but it is clearly not efficiency robust as its variance can
increase endlessly. That variance can even be infinite if the underlying distribution is
Cauchy or Pareto with a large scale parameter.

Bias Reduction Techniques

The most effective tools for bias reduction is non
-
biased estimators are the
Bootstrap and the Jackknifing. The bootstrap uses resampling from a given
set of data to mimic the variability that produced the data in the first place,
has a rather more dependable theoretical basis and can be a highly
effective procedure for estimation of error quantities in statistical problems.

Bootstrap is to create a virtual population by duplicating the same sample
over and over, and then re
-
samples from the virtual population to form a
reference set. Then you compare your original sample with the reference
set to get the exact p
-
value. Very often, a certain structure is "assumed" so
that a residual is computed for each case. What is then re
-
sampled is from
the set of residuals, which are then added to those assumed structures,
before some statistic is evaluated. The purpose is often to estimate a P
-
level.

Jackknife is to re
-
compute the data by leaving on observation out each time.
Jackknifing does a bit of logical folding to provide estimators of coefficients
and error that will have reduced bias.

Bias reduction techniques have wide applications in anthropology,
chemistry, climatology, clinical trials, cybernetics, and ecology, etc.

Effect Size

Effect size (ES) permits the comparative effect of different
treatments to be compared, even when based on different samples
and different measuring instruments. The ES is the mean difference
between the control group and the treatment group.

Glass's method: Suppose an experimental treatment group has a
mean score of Xe and a control group has a mean score of Xc and a
standard deviation of Sc, then the effect size is equal to (Xe
-

Xc)/Sc.

Hunter and Schmidt (1990) suggested using a pooled within
-
group
standard deviation because it has less sampling error than the
control group standard deviation under the condition of equal
sample size. In addition, Hunter and Schmidt corrected the effect
size for measurement error by dividing the effect size by the square
root of the reliability coefficient of the dependent variable:

Cohen's ES: (mean1
-

mean2)/pooled SD

Nonparametric Technique

Parametric techniques are more useful the more one knows about your subject matter, since
knowledge about the data can be built into parametric models. Nonparametric methods, including
both senses of the term, distribution free tests and flexible functional forms, are more useful when
knowing less about the subject matter. One must use statistical technique called nonparametric if
it satisfies at least on of the following five types of criteria:

1. The data entering the analysis are enumerative
-

that is, count data representing the number of
observations in each category or cross
-
category.

2. The data are measured and /or analyzed using a nominal or ordinal scale of measurement.

3. The inference does not concern a parameter in the population distribution.

4. The probability distribution of the statistic upon which the analysis is based is very general, such as
continuous, discrete, or symmetric etc.

The Statistics are:

Mann
-
Whitney Rank Test as a nonparametric alternative to Students T
-
test when one does not
have normally distributed data.

Mann
-
Whitney: To be used with two independent groups (analogous to the independent groups t
-
test)

Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the
related samples t
-
test)

Kruskall
-
Wallis: To be used with two or more independent groups (analogous to the single
-
factor
between
-
subjects ANOVA)

Friedman: To be used with two or more related groups (analogous to the single
-
factor within
-
subjects ANOVA)

Least Squares Models

Many problems in analyzing data involve describing how variables
are related. The simplest of all models describing the relationship
between two variables is a linear, or straight
-
line, model. The
conventional method is that of least squares, which finds the line
minimizing the sum of distances between observed points and the
fitted line.

There is a simple connection between the numerical coefficients in
the regression equation and the slope and intercept of regression
line.

The summary statistic like a correlation coefficient or does not tell
the whole story. A scatter plot is an essential complement to
examining the relationship between the two variables.

Model checking is an essential part of the process of statistical
modeling. After all, conclusions based on models that do not
properly describe an observed set of data will be invalid.

The impact of violation of regression model assumptions (i.e.,
conditions) and possible solutions by analyzing the residuals.

Least Median of Squares Models

The standard least squares techniques for
estimation in linear models are not robust in the
sense that outliers or contaminated data can
strongly influence estimates.

A robust technique, which protects against
contamination is least median of squares (LMS)
or least absolute deviation (LAD) .

An extension of LMS estimation to generalized
linear models, giving rise to the least median of
deviance (LMD) estimator.

Multivariate Data Analysis

Multivariate analysis is a branch of statistics involving the consideration of objects on
each of which are observed the values of a number of variables. Multivariate
techniques are used across the whole range of fields of statistical application. The
techniques are:

Principal components analysis

Factor analysis

Cluster analysis

Discriminant analysis

Principal component analysis used for exploring data to reduce the dimension.
Generally, PCA seeks to represent n correlated random variables by a reduced set of
uncorrelated variables, which are obtained by transformation of the original set onto
an appropriate subspace.

Two closely related techniques, principal component analysis and factor analysis, are
used to reduce the dimensionality of multivariate data. In these techniques
correlations and interactions among the variables are summarized in terms of a small
number of underlying factors. The methods rapidly identify key variables or groups of
variables that control the system under study.

Cluster analysis is an exploratory data analysis tool which aims at sorting different
objects into groups in a way that the degree of association between two objects is
maximal if they belong to the same group and minimal otherwise.

Discriminant function analysis used to classify cases into the values of a categorical
dependent, usually a dichotomy.

Regression Analysis

Models the relationship between one or more
response variables

(
Y
), and the
predictors

(
X
1,...,
Xp
). If there is more than one response variable, we speak of
multivariate regression
.

Types of regression

Simple and multiple linear regression

Simple linear regression

and
multiple linear regression

are related statistical methods for modeling
the relationship between two or more random variables using a
linear equation
. Linear regression
assumes the best estimate of the response is a
linear function

of some parameters (though not
necessarily linear on the predictors).

Nonlinear regression models

If the relationship between the variables being analyzed is not linear in parameters, a number of
nonlinear regression

techniques may be used to obtain a more accurate regression.

Other models

Although these three types are the most common, there also exist
Poisson regression
,
supervised
learning
, and
unit
-
weighted regression
.

Linear models

Predictor variables may be defined quantitatively or qualitatively(or
categorical
). Categorical
predictors are sometimes called
factors
. Although the method of estimating the model is the
same for each case, different situations are sometimes known by different names for historical
reasons:

If the predictors are all quantitative, we speak of
multiple regression
.

If the predictors are all qualitative, one performs
analysis of variance
.

If some predictors are quantitative and some qualitative, one performs an
analysis of covariance
.

General Linear Regression

The
general linear model

(GLM) is a statistical
linear model
. It may be written as

where
Y

is a matrix with series of multivariate measurements,
X

is a matrix that
might be a
design matrix
,
B

is a matrix containing parameters that are usually to be
estimated and
U

is a matrix containing residuals (i.e., errors or noise). The residual is
multivariate normal distribution

or other distribution,
such as a distribution in exponential family.

The general linear model incorporates a number of different statistical models:
ANOVA
,
ANCOVA
,
MANOVA
,
MANCOVA
, ordinary
linear regression
,
t
-
test

and
F
-
test
. If there is only one column in
Y

(i.e., one dependent variable) then the model
can also be referred to as the
multiple regression

model (multiple linear regression).

For example, if the response variable can take only binary values (for example, a
Boolean or Yes/No variable),
logistic regression

is preferred. The outcome of this
type of regression is a function which describes how the probability of a given event
(e.g. probability of getting "yes") varies with the predictors

Hypothesis tests with the general linear model can be made in two ways:
multivariate

and mass
-
univariate.

U
XB
Y

Semiparametric and Non
-
parametric modeling

The Generalized Linear Model (GLM)

Y= G(X
1
*b
1

+ ... + X
p
*b
p
) + e

where G is called the link function. All these models lead to the
problem of estimating a multivariate regression. Parametric
regression estimation has the disadvantage, that by the parametric
"form" certain properties of the resulting estimate are already implied.

Nonparametric techniques allow diagnostics of the data without this
restriction, and the model structure is not specified a priori. However,
this requires large sample sizes and causes problems in graphical
visualization.

Semiparametric methods are a compromise between both: they
support a nonparametric modeling of certain features and profit from
the simplicity of parametric methods. Example: Cox Proportional
Hazard Model.

Survival analysis

It deals with “death” in biological organisms and failure in mechanical systems. Death or failure is
called an "event" in the survival analysis literature, and so models of death or failure are
generically termed
time
-
to
-
event models
.

Survival data arise in a literal form from trials concerning life
-
threatening conditions, but the
methodology can also be applied to other waiting times such as the duration of pain relief.

Censoring
: Nearly every sample contains some cases that do not experience an event. If the
dependent variable is the time of the event, what do you do with these "censored" cases?

Survival analysis attempts to answer questions such as: what is the fraction of a population
which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can
multiple causes of death or failure be taken into account? How do particular circumstances or
characteristics increase or decrease the odds of survival?

Time
-
dependent covariate
: Many explanatory variables (like income or blood pressure)change
in value over time. How do you put such variables in a regression analysis?

Survival Analysis is a group of statistical methods for analysis and interpretation of survival data.
Survival and hazard functions, the methods of estimating parameters and testing hypotheses that
are the main part of analyses of survival data.

Main topics relevant to survival data analysis are: Survival and hazard functions, Types of
censoring, Estimation of survival and hazard functions: the Kaplan
-
Meier and life table estimators,
Simple life tables, Comparison of survival functions: The logrank and Mantel
-
Haenszel tests,
Wilcoxon test;The proportional hazards model: time independent and time dependent covariates,
Recurrent model, and Methods for determining sample sizes.

Repeated Measures and
Longitudinal Data

Repeated measures and longitudinal data require special attention because they
involve correlated data that commonly arise when the primary sampling units are
measured repeatedly over time or under different conditions.

The experimental units are often subjects. It is usually interested in between
-
subject
and within
-
subject effects. Between
-
subject effects are those whose values change
only from subject to subject and remain the same for all observations on a single
subject, for example, treatment and gender. Within
-
subject effects are those whose
values may differ from measurement to measurement.

Since measurements on the same experimental unit are likely to be correlated,
repeated measurements analysis must account for that correlation.

Normal theory models for split
-
plot experiments and repeated measures ANOVA can
be used to introduce the concept of correlated data.

PROC GLM, PROC GENMOD and PROC MIXED in the SAS system may be used.
Mixed linear models provide a general framework for modeling covariance structures, a
critical first step that influences parameter estimation and tests of hypotheses. The
primary objectives are to investigate trends over time and how they relate to treatment
groups or other covariates.

Techniques applicable to non
-
normal data, such as McNemar's test for binary data,
weighted least squares for categorical data, and generalized estimating equations
(GEE) are the main topics. The GEE method can be used to accommodate correlation
when the means at each time point are modeled using a generalized linear model.

Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models.
Biometrika 1986;73:13

22

Information Theory

Information theory is a branch probability and mathematical statistics that deal with communication systems,
data transmission, cryptography, signal to noise ratios, data compression, etc. Claude Shannon is the
father of information theory. His theory considered the transmission of information as a statistical
phenomenon and gave communications engineers a way to determine the capacity of a communication
channel about the common currency of bits Shannon defined a measure of entropy as:

H =
-

p
i

log p
i
,

that, when applied to an information source, could determine the capacity of the channel required
to transmit the source as encoded binary digits. The
entropy

is a measure of the amount of
uncertainty

one has about which message will be chosen. It is defined as the
average

self
-
information of a message
i

from that message space.

Entropy as defined by Shannon is closely related to entropy as defined by physicists in statistical
thermodynamics. This work was the inspiration for adopting the term entropy in information theory. Other
useful measures of information include mutual information which is a measure of the correlation between
two event sets. Mutual information is defined for two events X and Y as:

M (X, Y) = H(X, Y)
-

H(X)
-

H(Y)

where H(X, Y) is the join entropy defined as:

H(X, Y) =
-

∑ p (x
i
, y
i
) log p (x
i
, y
i
),

Mutual information is closely related to the log
-
likelihood ratio test for multinomial distribution, and to
Pearson's Chi
-
square test. The field of Information Science has since expanded to cover the full range of
techniques and abstract descriptions for the storage, retrieval and transmittal of information.

Applications: Coding theory,
making

and
breaking

cryptographic systems, intelligent work, Bayesian
analysis, gabling, investing, etc.

Incomplete Data

Methods dealing with analysis of data with missing values can be
classified into:
-

Analysis of complete cases, including weighting

-

Imputation methods, and extensions to multiple imputation, and

-

Methods that analyze the incomplete data directly without requiring
a rectangular data set, such as maximum likelihood and Bayesian
methods.

Multiple imputation (MI) is a general paradigm for the analysis of
incomplete data. Each missing datum is replaced by m> 1 simulated
values, producing m simulated versions of the complete data. Each
version is analyzed by standard complete
-
data methods, and the
results are combined using simple rules to produce inferential
statements that incorporate missing data uncertainty. The focus is
on the practice of MI for real statistical problems in modern
computing environments.

Interactions

ANOVA programs generally produce all possible interactions, while regression
programs generally do not produce any interactions. So it's up to the user to construct
interaction terms to multiply together.

If the standard error term is high, it might be Multicolinearity. But it is not the only
factor that can cause large SE's for estimators of "slope" coefficients any regression
models. SE's are inversely proportional to the range of variability in the predictor
variable. To increase the precision of estimators, we should increase the range of the
input.

Another cause of large SE's is a small number of "event" observations or a small
number of "non
-
event" observations

There is also another cause of high standard errors; it's called serial correlation, when
using time
-
series.

When X and W are category systems. The interaction describes a two
-
way analysis
of variance (ANOV) model; when X and W are (quasi
-
)continuous variables, this
equation describes a multiple linear regression (MLR) model.

In ANOVA contexts, the existence of an interaction can be described as a difference
between differences.

In MLR contexts, an interaction implies a change in the slope (of the regression of Y
on X) from one value of W to another value of W.

Sufficient Statistic

A sufficient estimator based on a statistic contains all the information
which is present in the raw data. For example, the sum of your data
is sufficient to estimate the mean of the population. You do not have
to know the data set itself. This saves a lot ... Simply, send out the
total, and the sample size.

A
sufficient statistic
t for a parameter q is a function of the sample
data x1,...,xn, which contains all information in the sample about the
parameter q . More formally, sufficiency is defined in terms of the
likelihood function for q . For a sufficient statistic t, the Likelihood
L(x1,...,xn| q ) can be written as g (t | q )*k(x1,...,xn). Since the
second term does not depend on q , t is said to be a sufficient
statistic for q .

To illustrate, let the observations be independent Bernoulli trials with
the same probability of success. Suppose that there are n trials, and
that person A observes which observations are successes, and
person B only finds out the number of successes. If seeing these
successes at random points without replication, B and A will see the
same ting.

Tests

Significance tests are based on assumptions: The data have to be
random, out of a well defined basic population and one has to
assume that some variables follow a certain distribution. Power of a
test is the probability of correctly rejecting a false null hypothesis. It
is one minus the probability of making a Type II error. The Type I
error: fail to reject the false hypothesis. Decrease the probability of
making a Type I error will increase the probability of making a Type
II error.

Power and the True Difference between Population Means:

The
distance between the two population means will affect the power of
our test.

Power as a Function of Sample Size and Variance:
Sample size
has an indirect effect on power because it affects the measure of
variance we used in the test. When n is large we will have a lower
standard error than when n is small.

Pilot Studies:

When the needed estimates for sample size
calculation is not available from existing database, a pilot study is
needed for adequate estimation with a given precision.

ANOVA: Analysis of Variance

Test the difference between 2 or more means. ANOVA does
this by examining the ratio of variability between two conditions
and variability within each condition.

Say we give a drug that we believe will improve memory to a
group of people and give a placebo to another group of people.
We might measure memory performance by the number of
words recalled from a list we ask everyone to memorize. An
ANOVA test would compare the variability that we observe
between the two conditions to the variability observed within
each condition. Recall that we measure variability as the sum
of the difference of each score from the mean.

When the variability that we predict (between the two groups)
is much greater than the variability we don't predict (within
each group) then we will conclude that our treatments produce
different results.

Data Mining and Knowledge
Discovery

It uses sophisticated statistical analysis and modeling techniques to uncover patterns
and relationships hidden in organizational databases.

Aim at tools and techniques to process structured information from databases to data
warehouses to data mining, and to knowledge discovery. Data warehouse
-
critical.

It can compress even more value out of these huge repositories of information. The
continuing rapid growth of on
-
line data and the widespread use of databases
necessitate the development of techniques for extracting useful knowledge and for
facilitating database access.

The challenge of extracting knowledge from data is of common interest to several
fields, including statistics, databases, pattern recognition, machine learning, data
visualization, optimization, and high
-
performance computing.

The data mining process involves identifying an appropriate data set to "mine" or sift
through to discover data content relationships. Data mining tools include techniques
like case
-
based reasoning, cluster analysis, data visualization, fuzzy query and
analysis, and neural networks. Data mining sometimes resembles the traditional
scientific method of identifying a hypothesis and then testing it using an appropriate
data set.

It is reminiscent of what happens when data has been collected and no significant
results were found and hence an ad hoc, exploratory analysis is conducted to find a
significant relationship.

Data mining is the process of extracting knowledge from data. For clever marketers, that