Statistics And Application

sharpfartsAI and Robotics

Nov 8, 2013 (4 years and 8 months ago)


Statistics And Application

Revealing Facts From Data

What Is Statistics


is a
mathematical science

pertaining to collection, analysis,
interpretation, and presentation of

It is applicable to a wide variety of
academic disciplines

from the physical and

to the
, as well
as to


Statistics Is …

Almost every professionals need a
statistical tool.

Statistical skills enable you to intelligently
collect, analyze and interpret data relevant
to their decision

Statistical concepts enable us to solve
problems in a diversity of contexts.

Statistical thinking enables you to add
substance to your decisions

Statistics is a science

To assist you making decisions under uncertainties.
Decision making process must be based on data
neither on personal opinion nor on belief.

It is already an accepted fact that "Statistical thinking
will one day be as necessary for efficient citizenship
as the ability to read and write." So, let us be ahead of
our time.

In US, students learn statistics from middle

Type Of Statistics

Descriptive statistics

deals with the description problem: Can
the data be summarized in a useful way, either numerically or
graphically, to yield insight about the population in question?
Basic examples of numerical descriptors include the

standard deviation
. Graphical summarizations include various
kinds of charts and graphs.

Inferential statistics

is used to model patterns in the data,
accounting for randomness and drawing inferences about the
larger population. These inferences may take the form of
answers to yes/no questions (
hypothesis testing
), estimates of
numerical characteristics (

of future
observations, descriptions of association (
), or
modeling of relationships (
). Other

techniques include
time series
, and
data mining

Type of Studies

There are two major types of causal statistical studies, experimental
studies and observational studies. In both types of studies, the effect
of differences of an independent variable (or variables) on the
behavior of the dependent variable are observed. The difference
between the two types is in how the study is actually conducted.
Each can be very effective.

An experimental study involves taking measurements of the system
under study, manipulating the system, and then taking additional
measurements using the same procedure to determine if the
manipulation may have modified the values of the measurements. In
contrast, an observational study does not involve experimental
manipulation. Instead data are gathered and correlations between
predictors and the response are investigated.

Type of Statistical Courses

Two types:

Greater statistics is everything related to
learning from data, from the first planning or
collection, to the last presentation or report,
which is deep respect for data and truth.

Lesser statistics is the body of statistical
methodology, which has no interest in data or
truth, and are generally arithmetic exercises. If a
certain assumption is needed to justify a
procedure, they will simply to "assume the ... are
normally distributed"

no matter how unlikely
that might be.

Statistical Models

Statistical models are currently used in
various fields of business and science.

The terminology differs from field to field.
For example, the fitting of models to data,
called calibration, history matching, and
data assimilation, are all synonymous with


Data Analysis

Developments in statistical data analysis
often parallel or follow advancements in other
fields to which statistical methods are fruitfully

Decision making process under uncertainty is
largely based on application of statistical data
analysis for probabilistic risk assessment of
your decision.


Decision makers need to lead others to apply
statistical thinking in day to day activities and

Decision makers need to apply the concept
for the purpose of continuous improvement.

Is Data Information?

Database in your office contains a wealth of information.

The decision technology group members tap a fraction of

Employees waste time scouring multiple sources for a

The decision
makers are frustrated because they cannot
get business
critical data exactly when they need it.

Therefore, too many decisions are based on guesswork,
not facts. Many opportunities are also missed, if they are
even noticed at all.

Data itself is not information, but might generate


Knowledge is what we know well. Information is the
communication of knowledge.

In every knowledge exchange, the sender make
common what is private, does the informing, the

Information can be classified as
explicit and tacit


The explicit information can be explained in structured
form, while tacit information is inconsistent and fuzzy to

Know that data are only crude information and not
knowledge by themselves.


Knowledge (?)

Data is known to be crude information and not
knowledge by itself.

The sequence from data to knowledge is:
from Data to
Information, from Information to Facts, and finally,
from Facts to Knowledge

Data becomes information, when it becomes relevant to
your decision problem.

Information becomes fact, when the data can support it.
Facts are what the data reveals.

However the decisive instrumental (i.e., applied)
knowledge is expressed together with some statistical
degree of confidence.



Fact becomes knowledge, when it is used in the
successful completion of a statistical process.

Statistical Analysis

The exactness of a statistical model increases,
the level of improvements in decision
increases: the reason of using statistical data

Statistical data analysis arose from the need to
place knowledge on a systematic evidence base.

Statistics is a study of the laws of probability, the
development of measures of data properties and
relationships, and so on.

Statistical Inference

Verify the statistical hypothesis: Determining whether
any statistical significance can be attached that results
after due allowance is made for any random variation as
a source of error.

Intelligent and critical inferences cannot be made by
those who do not understand the purpose, the conditions,
and applicability of the various techniques for judging

Considering the uncertain environment, the chance that
"good decisions" are made increases with the availability
of "good information." The chance that "good
information" is available increases with the level of
structuring the process of Knowledge Management.

Knowledge Needs Wisdom

Wisdom is the power to put our time and
our knowledge to the proper use.

Wisdom is the accurate application of
accurate knowledge.

Wisdom is about knowing how technical
staff can be best used to meet the needs
of the decision

History Of Statistics

The word

ultimately derives from the
modern Latin

statisticum collegium

("council of state") and the


" or "

The birth of statistics occurred in mid
17th century. A commoner,
named John Graunt, who was a native of London, begin reviewing a
weekly church publication issued by the local parish clerk that listed
the number of births, christenings, and deaths in each parish. These
so called Bills of Mortality also listed the causes of death. Graunt
who was a shopkeeper organized this data in the forms we call
descriptive statistics, which was published as
Natural and Political
Observation Made upon the Bills of Mortality
. Shortly thereafter, he
was elected as a member of Royal Society. Thus, statistics has to
borrow some concepts from sociology, such as the concept of
"Population". It has been argued that since statistics usually involves
the study of human behavior, it cannot claim the precision of the
physical sciences.

Statistics is for Government

The original principal purpose of

was data to be
used by governmental and (often centralized)
administrative bodies. The collection of data about states
and localities continues, largely through
national and
international statistical services


provide regular information about the

During the 20th century, the creation of precise
instruments for
public health

concerns (
, etc.) and economic and social purposes

, etc.) necessitated
substantial advances in statistical practices.

History of Probability

Probability has much longer history. Probability is derived from the
verb to probe meaning to "find out" what is not too easily accessible
or understandable. The word "proof" has the same origin that
provides necessary details to understand what is claimed to be true.

Probability originated from the study of games of chance and
gambling during the sixteenth century. Probability theory was a
branch of mathematics studied by Blaise Pascal and Pierre de
Fermat in the seventeenth century.

Currently; in 21st century, probabilistic modeling are used to control
the flow of traffic through a highway system, a telephone
interchange, or a computer processor; find the genetic makeup of
individuals or populations; quality control; insurance; investment;
and other sectors of business and industry.

Stat Merge With Prob

Statistics eventually merged with the field of
inverse probability
referring to the estimation of a parameter from experimental data in
the experimental sciences (most notably

Today the use of statistics has broadened far beyond the service of
a state or government, to include such areas as business, natural
and social sciences, and medicine, among others.

Statistics emerged in part from
probability theory
, which can be
dated to the correspondence of
Pierre de Fermat

Blaise Pascal

Christiaan Huygens

(1657) gave the earliest known scientific
treatment of the subject.
Jakob Bernoulli
Ars Conjectandi

(posthumous, 1713) and
Abraham de Moivre
Doctrine of Chances

(1718) treated the subject as a branch of mathematics.

Development in 18
19 centery

theory of errors

may be traced back to
Roger Cotes

(posthumous, 1722), but a memoir prepared by

in 1755 (printed 1756) first applied the theory to the
discussion of errors of observation.

Daniel Bernoulli

(1778) introduced the principle of the maximum
product of the probabilities of a system of concurrent errors.

method of least squares
, which was used to minimize errors in
, is due to

Adrain (1808), Carl Gauss
(1809), and Adrien
Marie Legendre (1805) by the problems of
survey measurements, reconciling disparate physical measurements.

General theory in statistics: by Laplace (1810, 1812), Gauss (1823),
James Ivory

(1825, 1826), Hagen (1837),
Friedrich Bessel

W. F. Donkin

(1844, 1856), and
Morgan Crofton

(1870). Other
contributors were Ellis (1844),
De Morgan


Giovanni Schiaparelli


Statistics in 20 Century

Karl Pearson

March 27

April 27
) was a major contributor to the early
development of
. Pearson's work was all
embracing in the wide application
and development of mathematical statistics, and encompassed the fields of
, anthropometry,

and social
, his main contributions are:
Linear regression


Pearson product
moment correlation

was the first important
effect size

to be introduced into statistics;
Classification of distributions
forms the basis for a lot of modern statistical theory;
in particular, the
exponential family

of distributions underlies the theory of
linear models
Pearson's chi
square test

Sir Ronald Aylmer Fisher

17 February


29 July


Fisher invented the techniques of
maximum likelihood

analysis of variance
, and
originated the concepts of
Fisher's linear discriminator

Fisher information
. His

article "On a distribution yielding the error functions of
several well known statistics" presented
Karl Pearson's




the same framework as the normal distribution and his own analysis of variance
distribution z

(more commonly used today in the form of the
F distribution
). These
contributions easily made him a major figure in
20th century

statistics. He began the
field of
parametric statistics
, entropy as well as Fish information were essential
for developing Bayesian analysis.

Statistics in 20 Century

Gertrude Mary Cox

January 13

Experimental Design

Charles Edward Spearman (
September 10


September 7

parametric analysis, rank correlation coefficient

Chebyshev's inequality

Lyapunov's central limit theorem

John Wilder Tukey

June 16


July 26
jackknife estimation
exploratory data analysis

confirmatory data analysis

George Bernard Dantzig

8 November


13 May

the simplex method and furthering linear programming, advanced the fields
of decomposition theory, sensitivity analysis, complementary pivot
methods, large
scale optimization, nonlinear programming, and
programming under uncertainty.

Bayes' theorem

David Roxbee Cox

Birmingham, England
) has made
pioneering and important contributions to numerous areas of statistics and
applied probability, of which the best known is perhaps the proportional
hazards model, which is widely used in the analysis of survival data.

School Thought of Statistics

The Classical, attributed to

Relative Frequency, attributed to

Bayesian, attributed to

What Type of Statistician Are You?

Classic Statistics

The problem with the Classical Approach is that what
constitutes an outcome is not objectively determined.
One person's simple event is another person's
compound event. One researcher may ask, of a newly
discovered planet, "what is the probability that life exists
on the new planet?" while another may ask "what is the
probability that carbon
based life exists on it?"

Bruno de Finetti, in the introduction to his two
treatise on Bayesian ideas, clearly states that
"Probabilities Do not Exist". By this he means that
probabilities are not located in coins or dice; they are not
characteristics of things like mass, density, etc

Relative Frequency Statistics

Consider probabilities as "objective" attributes
of things (or situations) which are really out
there (availability

of data).

Use the data we have only to make

Even substantial prior information is available,
Frequentists do not use it, while Bayesians are
willing to assign probability distribution
function(s) to the population's parameter(s).

Bayesian approaches

Consider probability theory as an extension of deductive
logic (including dialogue logic, interrogative logic,
informal logic, and artificial intelligence) to handle

First principle that the uniquely correct way is your belief
about the state of things (Prior), and updating them in
the light of the evidence.

The laws of probability have the same status as the laws
of logic.

Bayesian approaches are explicitly "subjective" in the
sense that they deal with the plausibility which a rational
agent ought to attach to the propositions he/she
considers, "given his/her current state of knowledge and


From a scientist's perspective, there are good grounds to reject
Bayesian reasoning. Bayesian deals not with objective, but
subjective probabilities. The result is that any reasoning using a
Bayesian approach cannot be checked

something that makes it
worthless to science, like non replicate experiments.

Bayesian perspectives often shed a helpful light on classical
procedures. It is necessary to go into a Bayesian framework to give
confidence intervals. This insight is helpful in drawing attention to
the point that another prior distribution would lead to a different

A Bayesian may cheat by basing the prior distribution on the data,
because priors must be personal for coherence to hold before the
study, which is more complex.

Objective Bayesian: There is a clear connection between probability
and logic: both appear to tell us how we should reason. But how,
exactly, are the two concepts related? Objective Bayesians offers
one answer to this question.

Steps Of The Analysis

Defining the problem:
An exact definition of the problem
is imperative in order to obtain accurate data about it.

Collecting the data:
Designing ways to collect data is an
important job in statistical data analysis. Population

Sample are VIP aspects.

Analyzing the data:
Exploratory methods are used to
discover what the data seems to be saying by using
simple arithmetic and easy
draw pictures to
summarize data. Confirmatory methods use ideas from
probability theory in the attempt to answer specific

Reporting the results

Type of Data, Levels of
Measurement & Errors

Qualitative and Quantitative

Discrete and Continuous

Nominal, Ordinal, Interval and Ratio

Types of error: Recording error, typing error,
transcription error (incorrect copying),
Inversion (e.g., 123.45 is typed as 123.54),
Repetition (when a number is repeated),
Deliberate error, Type Error, etc.

Data Collection: Experiments


is a set of actions and
, performed for solving a given
, to test a


. Itis an

approach acquiring deeper

about the physical world.

Design of experiments

In the "hard" sciences tends to focus on the elimination of extraneous effects, in the
"soft" sciences it focuses more on the problems of external validity, by using
statistical methods
. Events occur naturally from which scientific evidence can be
drawn, which is the basis for
natural experiments

Controlled experiments

To demonstrate a cause and effect hypothesis, an experiment must often show that,
for example, a phenomenon occurs after a certain treatment is given to a subject,
and that the phenomenon does

occur in the

of the treatment.


experiment generally compares the results obtained from an
experimental sample against a

sample, which is practically identical to
the experimental sample except for the one aspect whose effect is being tested.

Data Collection: Experiments

Natural experiments or

Natural experiments rely solely on observations of the

of the

study, rather than manipulation of just one or a few variables as occurs in controlled
experiments. Much research in several important

disciplines, including
, and
, relies on quasi

Observational studies

Observational studies are very much like controlled experiments except that they
lack probabilistic equivalency between groups. These types of experiments often
arise in the area of medicine where, for ethical reasons, it is not possible to create a
truly controlled group. ]

Field Experiments

Named in order to draw a contrast with
laboratory experiments
. Often used in the
social sciences, economics etc. Field experiments suffer from the possibility of
contamination: experimental conditions can be controlled with more precision and
certainty in the lab.

Data Analysis

It will follow different approaches!

Applied Statistics

Actuarial science



methods to

particularly to the assessment of

are professionals who are
qualified in this field.

Actuarial science

Actuarial science

is the discipline that applies


methods to
assess risk

in the



are professionals who are
qualified in this field through examinations and experience.

Actuarial science includes a number of interrelating subjects,

, and
Historically, actuarial science used deterministic models in
the construction of tables and premiums. The science has
gone through revolutionary changes during the last 30 years
due to the proliferation of high speed computers and the
synergy of

actuarial models with modern financial
(Frees 1990)

Many universities have undergraduate and graduate degree
programs in actuarial science. In 2002, a
Wall Street Journal

survey on the best jobs in the United States listed “actuary”
as the second best job
(Lee 2002)

Where Do Actuaries Work and What Do They Do?

The insurance industry can't function without actuaries, and that's where most of them
work. They calculate the costs to assume risk

how much to charge policyholders for
life or health insurance premiums or how much an insurance company can expect to
pay in claims when the next hurricane hits Florida.

Actuaries provide a financial evaluation of risk for their companies to be used for strategic
management decisions. Because their judgement is heavily relied upon, actuaries'
career paths often lead to upper management and executive positions.

When other businesses that do not have actuaries on staff need certain financial advice,
they hire actuarial consultants. A consultant can be self
employed in a one
practice or work for a nationwide consulting firm. Consultants help companies design
pension and benefit plans and evaluate assets and liabilities. By delving into the
financial complexities of corporations, they help companies calculate the cost of a
variety of business risks. Consulting actuaries rub elbows with chief financial officers,
operating and human resource executives, and often chief executive officers.

Actuaries work for the government too, helping manage such programs as the Social
Security system and Medicare. Since the government regulates the insurance
industry and administers laws on pensions and financial liabilities, it also needs
actuaries to determine whether companies are complying with the law.

Who else asks an actuary to assess risks and solve thorny statistical and financial
problem? You name it: Banks and Investment firms, large corporations, public
accounting firms, insurance rating bureaus, labor unions, and fraternal organizations..,

Typical actuarial projects:

Analyzing insurance rates, such as for cars,
homes or life insurance.

Estimating the money to be set
aside for claims
that have not yet been paid.

Participating in corporate planning, such as
mergers and acquisitions.

Calculating a fair price for a new insurance

Forecasting the potential impact of catastrophes.

Analyzing investment programs.


Applied Statistical Methods

Courses that meet this requirement may be taught in the mathematics, statistics, or
economics department, or in the business school. In economics departments, this
course may be called Econometrics. The material could be covered in one course or
two. The mathematical sophistication of these courses will vary widely and all levels
are intended to be acceptable. Some analysis of real data should be included. Most
of the topics listed below should be covered:


3 pts.

Statistical Inference.

3 pts.

Linear Regression Models.

3 pts.

Time Series Analysis.

3 pts.

Survival Analysis.

3 pts.

Elementary Stochastic Processes.

3 pts.


3 pts.

Introduction to the Mathematics of Finance.

3 pts.

Statistical Inference and Time
Series Modelling.

3 pts.

Stochastic Methods in Finance.

3 pts.

Stochastic Differential Equations and Applications.

3 pts.

Advanced Data Analysis.

3 pts.

Data Mining.

3 pts.

Statistical Methods in Finance.

3 pts.

Nonparametric Statistics.

3 pts.

Stochastic Processes and Applications,

3 pts.

Some Books

Generalized Linear Models for Insurance Data, by Piet
de Jong and Gillian Z. Heller

Stochastic Claims Reserving Methods in Insurance (The
Wiley Finance Series) by Mario V. Wüthrich and
Michael Merz

Actuarial Modelling of Claim Counts: Risk Classification,
Credibility and Bonus
Malus Systems, by Michel
Denuit, Xavier Marechal, Sandra Pitrebois and Jean
Francois Walhin

Loss Models: From Data to Decisions (Wiley Series in
Probability and Statistics) (Hardcover) by Stuart A.
Klugman, Harry H. Panjer and Gordon E. Willmot





is the application of

to a wide range of topics in

Public health
, including

environmental health

Design and analysis of
clinical trials


population genetics
, and

in populations in order to link variation

with a variation in


sequence analysis


Data Mining

Discovery in Databases (KDD), is the process of
automatically searching large volumes of

for patterns.

The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data

Data mining involves the process of analyzing data

Data Mining is a fairly recent and contemporary topic in

Data Mining applies many older computational techniques
machine learning

pattern recognition

Data Mining and Business

Increasing potential

to support

business decisions

End User








Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

Data Mining: Confluence of Multiple

Data Mining












Data Mining: On What Kinds of

oriented data sets and applications

Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

Data streams and sensor data

series data, temporal data, sequence data (incl. bio

Structure data, graphs, social networks and multi
linked data

relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data

Multimedia database

Text databases

The World
Wide Web

10 Most Popular DM Algorithms:

18 Identified Candidates (I)


#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning.
Morgan Kaufmann., 1993.

#2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth, 1984.

#3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R.
1996. Discriminant Adaptive Nearest Neighbor Classification.
TPAMI. 18(6)

#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So
Stupid After All? Internat. Statist. Rev. 69, 385

Statistical Learning

#5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning
Theory. Springer

#6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture
Models. J. Wiley, New York. Association Analysis

#7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In VLDB '94.

#8. FP
Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent
patterns without candidate generation. In SIGMOD '00.

The 18 Identified Candidates (II)

Link Mining

#9. PageRank: Brin, S. and Page, L. 1998. The
anatomy of a large
scale hypertextual Web search
engine. In WWW
7, 1998.

#10. HITS: Kleinberg, J. M. 1998. Authoritative
sources in a hyperlinked environment. SODA, 1998.


#11. K
Means: MacQueen, J. B., Some methods for
classification and analysis of multivariate
observations, in Proc. 5th Berkeley Symp.
Mathematical Statistics and Probability, 1967.

#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny,
M. 1996. BIRCH: an efficient data clustering method
for very large databases. In SIGMOD '96.

Bagging and Boosting

#13. AdaBoost: Freund, Y. and Schapire, R. E. 1997.
A decision
theoretic generalization of on
line learning
and an application to boosting. J. Comput. Syst. Sci.
55, 1 (Aug. 1997), 119

The 18 Identified Candidates (III)

Sequential Patterns

#14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential
Patterns: Generalizations and Performance Improvements. In
Proceedings of the 5th International Conference on Extending
Database Technology, 1996.

#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi
Asl, H. Pinto, Q.
Chen, U. Dayal and M
C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix
Projected Pattern Growth. In ICDE

Integrated Mining

#16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating
classification and association rule mining. KDD

Rough Sets

#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical
Aspects of Reasoning about Data, Kluwer Academic Publishers,
Norwell, MA, 1992

Graph Mining

#18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph
Substructure Pattern Mining. In ICDM '02.

Major Issues in Data Mining

Mining methodology

Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods

Integration of the discovered knowledge with existing one: knowledge

User interaction

Data mining query languages and ad
hoc mining

Expression and visualization of data mining results

Interactive mining of

knowledge at multiple levels of abstraction

Applications and social impacts

specific data mining & invisible data mining

Protection of data security, integrity, and privacy

Challenge Problems in Data Mining

Developing a Unifying Theory of Data Mining

Scaling Up for High Dimensional Data and High Speed
Data Streams

Mining Sequence Data and Time Series Data

Mining Complex Knowledge from Complex Data

Data Mining in a Network Setting

Distributed Data Mining and Mining Multi
agent Data

Data Mining for Biological and Environmental Problems

Process Related Problems

Security, Privacy and Data Integrity

Dealing with Non
static, Unbalanced and Cost

Recommended Reference

S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi
Structured Data.
Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley
Interscience, 2000

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

U. M. Fayyad, G. Piatetsky
Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996

U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2


D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer
Verlag, 2001

B. Liu, Web Data Mining, Springer 2006.

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

G. Piatetsky
Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,

N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2

ed. 2005

Economic statistics

Economic statistics is a branch of applied statistics
focusing on the collection, processing, compilation and
dissemination of statistics concerning the economy of a
region, a country or a group of countries.

Economic statistics is also referred as a subtopic of
official statistics, since most of the economic statistics
are produced by official organizations (e.g. statistical
institutes, supranational organizations, central banks,
ministries, etc.).

Economic statistics provide the empirical data needed in
economic research (econometrics) and they are the
basis for decision and economic policy making.


Econometrics is concerned with the tasks of developing
and applying quantitative or statistical methods to the
study and elucidation of economic principles.
Econometrics combines economic theory with statistics
to analyze and test economic relationships.

Theoretical econometrics considers questions about the
statistical properties of estimators and tests, while
applied econometrics is concerned with the application
of econometric methods to assess economic theories.
Although the first known use of the term "econometrics"
was by Pawel Ciompa in 1910, Ragnar Frisch is given
credit for coining the term in the sense that it is used

Method in Econometrics

Although many econometric methods represent applications of
standard statistical models, there are some special features of
economic data that distinguish econometrics from other branches of

Economic data are generally observational, rather than being
derived from controlled experiments. Because the individual units in
an economy interact with each other, the observed data tend to
reflect complex economic equilibrium conditions rather than simple
behavioral relationships based on preferences or technology.
Consequently, the field of econometrics has developed methods for
identification and estimation of simultaneous equation models.
These methods allow researchers to make causal inferences in the
absence of controlled experiments.

Early work in econometrics focused on time
series data, but now
econometrics also fully covers cross
sectional and panel data.

Data in Econometrics

Data is broadly classified according to the number of dimensions.

A data set containing observations on a single phenomenon observed over
multiple time periods is called
time series
. In time series data, both the
values and the ordering of the data points have meaning.

A data set containing observations on multiple phenomena observed at a
single point in time is called
. In cross
sectional data sets,
the values of the data points have meaning, but the ordering of the data
points does not.

A data set containing observations on multiple phenomena observed over
multiple time periods is called
panel data
. Alternatively, the second
dimension of data may be some entity other than time. For example, when
there is a sample of groups, such as siblings or families, and several
observations from every group, the data is panel data. Whereas time series
and cross
sectional data are both one
dimensional, panel data sets are two

Data sets with more than two dimensions are typically called multi
dimensional panel data.


Research Area: Theoretical econometrics, including time series analysis,
nonparametric and semi
parametric estimation, panel data analysis, and
financial econometrics; applied econometrics, including applied labor
economics and empirical finance.


Probability and Statistics

Advanced Econometrics

Time Series Models

Micro Econometrics

Panel Data Econometrics

Financial Econometric

Nonparametric and semi
parametric econometrics

Lecture on Advanced Econometrics

Data Analysis in Academic Research (using SAS)

Statistics and Data Analysis for Economics

Nonlinear Models

Some researches in U. of Chicago

Proposal: "Selective Publicity and Stock Prices" By: David Solomon

Proposal: "Activating Self
Control: Isolated vs. Interrelated Temptations" By: Kristian

Proposal: "Buyer's Remorse: When Evaluation is Based on Simulation Before You Chose
but Deliberation After" By: Yan Zhang

Proposal: "Brokerage, Second
Hand Brokerage and Difficult Working Relationships: The
Role of the Informal Organization on Speaking Up about Difficult Relationships and
Being Deemed Uncooperative by Co
Workers" By: Jennifer Hitler

Proposal: "Resource Space Dynamics in the Evolution of Industries: Formation,
Expansion and Contraction of the Resource Space and its Effects on the Survival of
Organizations: By: Aleksios Gotsopoulos

Defense: "An Examination of Status Dynamics in the U.S. Venture Capital Industry" By:
Kyu Kim

Defense: "Group Dynamics and Contact: A Natural Experiment" By: Arjun Chakravarti

Defense: "Essays in Corporate Governance" By: Ashwini Agrawal

Some Researches in U. of Chicago

Defense: "Essays on Consumer Finance" By: Brian Melzer

Defense: "Male Incarceration and Teen Fertility" By: Amee Kamdar

Defense: "Essays on Economic Fundamentals in Asset Pricing" By: Jie (Jennie) Bai

Defense: "Asset
Intensity and the Cross
Section of Stock Returns" By: Raife Giovinazzo

Defense: "Essays on Household Behavior" By: Marlena Lee

Proposal: "How (Un)Accomplished Goal Actions Affect Goal Striving and Goal Setting" By: Minjung

Defense: "Empirical Entry Games with Complementarities: An Application to the Shopping Center
Industry" By: Maria Ana Vitorino

Defense: "Betas, Characterisitcs, and the Cross
Section of Hedge Fund Returns" By: Mark Klebanov

Defense: "Expropriatin Risk and Technology" By: Marcus Opp

Defense: "Essays in Corporate Finance and Real Estate" By: Itzhak Ben

Proposal: "Group Dynamics and Interpersonal Contact: A Natural Experiment" By: Arjun Chakravarti

Proposal: "Structural Estimation of a Moral Hazard Model: An Application to Industrial Selling"

By: Renna Jiang

Proposal: "Status, Quality, and Earnings Announcements: An Analysis of the Effect of News of which
Confirms or Contradicts the Status
Quality Correlation on the Stock of a Company“ By: Daniela

Defense: "Diversification and its Discontents: Idiosyncratic and Entrepreneurial Risk in the Quest for
Social Status" By: Nick Roussanov

Summary of Econometrics

It is a combination of
mathematical economics
economic statistics and economic theory.

Regression analysis

is popular

series analysis

sectional analysis


Panel analyses
, which related to multi

Fixed effect models: There are unique attributes of
individuals that are not the results of random variation
and that do not vary across time. Adequate, if we want to
draw inferences only about the examined individuals.

Random effect models:There are unique, time constant
attributes of individuals that are the results of random
variation and do not correlate with the individual
regressors. This model is adequate, if we want to draw
inferences about the whole population, not only the
examined sample.


Arellano, Manuel
Panel Data Econometrics
, Oxford
University Press 2003.

Hsiao, Cheng, 2003.
Analysis of Panel Data
, Cambridge
University Press.

Davies, A. and Lahiri, K., 2000. "Re
examining the
Rational Expectations Hypothesis Using Panel Data on
Period Forecasts,"
Analysis of Panels and Limited
Dependent Variable Models
, Cambridge University

Davies, A. and Lahiri, K., 1995. "A New Framework for
Testing Rationality and Measuring Aggregate Shocks
Using Panel Data," Journal of Econometrics 68: 205

Frees, E., 2004.
Longitudinal and Panel Data
Cambridge University Press.

Engineering Statistics

(DOE) or
design of experiments

uses statistical
techniques to test and construct models of
engineering components and systems.

Quality control

process control

use statistics
as a tool to manage conformance to
specifications of manufacturing processes and
their products.

Time and methods engineering

use statistics to
study repetitive operations in manufacturing in
order to set standards and find optimum (in
some sense) manufacturing procedures

Statistical Physics

Using methods of

in solving physical problems with


The term
statistical physics



approaches to
classical mechanics

. Hence might be called as
Statistical mechanics

It works well in classical systems when the number of
of freedom

is so large that exact solution is not possible, or not
really useful.

Statistical mechanics

can also describe work in
chaos theory
, thermal physics,
fluid dynamics

(particularly at low
Knudsen numbers
), or
plasma physics


The study of human
population dynamics
. It
encompasses the study of the size, structure
and distribution of
, and how
populations change over time due to births,


Methods are including

returns and

registers, or incorporate survey data
using indirect estimation techniques.

Psychological Statistics

The application of statistics to psychology.

Some of the more commonly used statistical
tests in psychology are:

Student's t
test , Chi
square, ANOVA,
ANCOVA, MANOVA, Regression analysis ,
Correlation, Survival analysis, Cliniqual
trial , etc.

Social Statistics


measurement systems to study

behavior in a social environment

Advanced statistical analyses have become popular
among social science.

A new branch: quantitative social science in Harvard

Structural Equation Modeling

factor analysis

Multilevel models

Cluster analysis

Latent class models

Item response theory

Survey methodology

survey sampling


Apply mathematical or statistical methods to chemical data.

Chemometrics is the science of relating measurements made
on a chemical system or process to the state of the system via
application of mathematical or statistical methods.

Chemometric research spans a wide area of different methods
which can be applied in chemistry. There are techniques for
collecting good data (optimization of experimental parameters,
design of experiments
signal processing
) and for
getting information from these data (
, structure

Chemometrics tries to build a bridge between the methods and
their application in chemistry.

Reliability Engineering

Reliability engineers perform a wide variety of
special management and engineering tasks to
ensure that sufficient attention is given to details
that will affect the reliability of a given system.

Reliability engineers rely heavily on
probability theory
, and
reliability theory
. Many
engineering techniques are used in reliability
engineering, such as reliability prediction,

analysis, thermal management,
reliability testing and accelerated life testing.

Statistical Methods

A common goal for a statistical research
project is to investigate causality, and in
particular to draw a conclusion on the
effect of changes in the values of
predictors or
independent variables

on a
response or
dependent variable

Two major types of studies: Experimental
and observational studies

Well Known Techniques

Student's t
: test

of two
normally distributed

populations are equal

: test two distributions are the same

analysis of variance

(ANOVA): test the difference of
mean or effects.

Whitney U
: test difference in

two observed distributions

regression analysis
: model relationships between
random variables
, determine the magnitude of the
relationships between variables, and can be used to
make predictions based on the models

: indicates the strength and direction
of a linear relationship between two

Fisher's Least Significant Difference test

: test
difference of means in multiple comparison.

Pearson product
moment correlation coefficient
a measure of how well a
linear equation

describes the relation between two variables


measured on the same object or

Spearman's rank correlation coefficient
: a

measure of

between two

Simple Statistic Application

Compare two means

Compare two proportions

Compare two populations

Estimate mean or proportion

Find empirical distribution

Statistical Topics

Sampling Distribution

Sampling distribution is used to describe the distribution of
outcomes that one would observe from replication of a particular
sampling plan.

Know that to estimate means to esteem (to give value to).

Know that estimates computed from one sample will be different
from estimates that would be computed from another sample.

Understand that estimates are expected to differ from the
population characteristics (parameters) that we are trying to
estimate, but that the properties of sampling distributions allow us
to quantify, probabilistically, how they will differ.

Understand that different statistics have different sampling
distributions with distribution shape depending on (a) the specific
statistic, (b) the sample size, and (c) the parent distribution.

Understand the relationship between sample size and the
distribution of sample estimates.

Understand that the variability in a sampling distribution can be
reduced by increasing the


Sequential sampling technique

Low response rate

Biased response

Outlier Removal

Outliers are a few observations that are not well fitted by the "best"
available model. When occurring, one must first investigate the
source of data, if there is no doubt about the accuracy or veracity of
the observation, then it should be removed and the model should be
refitted. Robust statistical techniques are needed to cope with any
undetected outliers; otherwise the result will be misleading.

Because of the potentially large variance, outliers could be the
outcome of sampling. It's perfectly correct to have such an
observation that legitimately belongs to the study group by definition.
Say, Lognormally distributed data.

To be very careful and cautious: before declaring an observation "an
outlier," find out why and how such observation occurred. It could
even be an error at the data entering stage.

First, construct the BoxPlot of your data. Form the Q1, Q2, and Q3
points which divide the samples into four equally sized groups. (Q2
= median) Let IQR = Q3

Q1. Outliers are defined as those points
outside the values Q3+k*IQR and Q1
k*IQR. For most case one
sets k=1.5 or 3.

Another alternative outlier definition is out off: mean + ks, mean

sigma (k is 2, 2.5, or 3)

Central Limit Theorem

The average of a sample of observations drawn from some
population with any shape
distribution is approximately
distributed as a normal distribution if certain conditions are

It is well known that whatever the parent population is, the
standardized variable will have a distribution with a mean 0
and standard deviation 1 under random sampling with a large
sample size.

The sample size needed for the approximation to be adequate
depends strongly on the shape of the parent distribution.
Symmetry is particularly important. For a symmetric and short
tail parent distribution, even if very different from the shape of
a normal distribution, an adequate approximation can be
obtained with small samples (e.g., 10 or 12 for the uniform
distribution). In some extreme cases (e.g. binomial with )
samples sizes far exceeding the typical guidelines (say, 30) are
needed for an adequate approximation


The P
value, which directly depends on a given sample, attempts to provide a
measure of the strength of the results of a test, in contrast to a simple reject or do not
reject. If the null hypothesis is true and the chance of random variation is the only
reason for sample differences, then the P
value is a quantitative measure to feed into
the decision making process as evidence. The following table provides a reasonable
interpretation of P

P< 0.01 very strong evidence against H0; 0.01

P < 0.05 moderate evidence against
H0; 0.05 ≤ P < 0.10 suggestive evidence against H0; 0.10 ≤ P little or no real
evidence against H0

This interpretation is widely accepted, and many scientific journals routinely publish
papers using this interpretation for the result of test of hypothesis.

For the fixed
sample size, when the number of realizations is decided in advance, the
distribution of p is uniform (assuming the null hypothesis). We would express this as
P(p ≤ x) = x. That means the criterion of p <0.05 achieves a of 0.05.

When a p
value is associated with a set of data, it is a measure of the probability that
the data could have arisen as a random sample from some population described by
the statistical (testing) model.

A p
value is a measure of how much evidence you have against the null hypothesis.
The smaller the p
value, the more evidence you have. One may combine the p
with the significance level to make decision on a given test of hypothesis. In such a
case, if the p
value is less than some threshold (usually .05, sometimes a bit larger
like 0.1 or a bit smaller like .01) then you reject the null hypothesis.

Accuracy, Precision, Robustness,
and Data Quality


is the degree of conformity of a measured/calculated quantity to its actual
(true) value.


is the degree to which further measurements or calculations will show the
same or similar results.


is the resilience of the system, especially when under stress or when
confronted with invalid input.

Data are of high quality "if they are fit for their intended uses in



An "accurate" estimate has small bias. A "precise" estimate has both small bias and

The robustness of a procedure is the extent to which its properties do not depend on
those assumptions which you do not wish to make.

Distinguish between bias robustness and efficiency robustness.

Example: Sample mean is seen as a robust estimator, it is because the CLT
guarantees a 0 bias for large samples regardless of the underlying distribution. This
estimator is bias robust, but it is clearly not efficiency robust as its variance can
increase endlessly. That variance can even be infinite if the underlying distribution is
Cauchy or Pareto with a large scale parameter.

Bias Reduction Techniques

The most effective tools for bias reduction is non
biased estimators are the
Bootstrap and the Jackknifing. The bootstrap uses resampling from a given
set of data to mimic the variability that produced the data in the first place,
has a rather more dependable theoretical basis and can be a highly
effective procedure for estimation of error quantities in statistical problems.

Bootstrap is to create a virtual population by duplicating the same sample
over and over, and then re
samples from the virtual population to form a
reference set. Then you compare your original sample with the reference
set to get the exact p
value. Very often, a certain structure is "assumed" so
that a residual is computed for each case. What is then re
sampled is from
the set of residuals, which are then added to those assumed structures,
before some statistic is evaluated. The purpose is often to estimate a P

Jackknife is to re
compute the data by leaving on observation out each time.
Jackknifing does a bit of logical folding to provide estimators of coefficients
and error that will have reduced bias.

Bias reduction techniques have wide applications in anthropology,
chemistry, climatology, clinical trials, cybernetics, and ecology, etc.

Effect Size

Effect size (ES) permits the comparative effect of different
treatments to be compared, even when based on different samples
and different measuring instruments. The ES is the mean difference
between the control group and the treatment group.

Glass's method: Suppose an experimental treatment group has a
mean score of Xe and a control group has a mean score of Xc and a
standard deviation of Sc, then the effect size is equal to (Xe


Hunter and Schmidt (1990) suggested using a pooled within
standard deviation because it has less sampling error than the
control group standard deviation under the condition of equal
sample size. In addition, Hunter and Schmidt corrected the effect
size for measurement error by dividing the effect size by the square
root of the reliability coefficient of the dependent variable:

Cohen's ES: (mean1

mean2)/pooled SD

Nonparametric Technique

Parametric techniques are more useful the more one knows about your subject matter, since
knowledge about the data can be built into parametric models. Nonparametric methods, including
both senses of the term, distribution free tests and flexible functional forms, are more useful when
knowing less about the subject matter. One must use statistical technique called nonparametric if
it satisfies at least on of the following five types of criteria:

1. The data entering the analysis are enumerative

that is, count data representing the number of
observations in each category or cross

2. The data are measured and /or analyzed using a nominal or ordinal scale of measurement.

3. The inference does not concern a parameter in the population distribution.

4. The probability distribution of the statistic upon which the analysis is based is very general, such as
continuous, discrete, or symmetric etc.

The Statistics are:

Whitney Rank Test as a nonparametric alternative to Students T
test when one does not
have normally distributed data.

Whitney: To be used with two independent groups (analogous to the independent groups t

Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the
related samples t

Wallis: To be used with two or more independent groups (analogous to the single
subjects ANOVA)

Friedman: To be used with two or more related groups (analogous to the single
factor within
subjects ANOVA)

Least Squares Models

Many problems in analyzing data involve describing how variables
are related. The simplest of all models describing the relationship
between two variables is a linear, or straight
line, model. The
conventional method is that of least squares, which finds the line
minimizing the sum of distances between observed points and the
fitted line.

There is a simple connection between the numerical coefficients in
the regression equation and the slope and intercept of regression

The summary statistic like a correlation coefficient or does not tell
the whole story. A scatter plot is an essential complement to
examining the relationship between the two variables.

Model checking is an essential part of the process of statistical
modeling. After all, conclusions based on models that do not
properly describe an observed set of data will be invalid.

The impact of violation of regression model assumptions (i.e.,
conditions) and possible solutions by analyzing the residuals.

Least Median of Squares Models

least absolute deviation (LAD)

The standard least squares techniques for
estimation in linear models are not robust in the
sense that outliers or contaminated data can
strongly influence estimates.

A robust technique, which protects against
contamination is least median of squares (LMS)
or least absolute deviation (LAD) .

An extension of LMS estimation to generalized
linear models, giving rise to the least median of
deviance (LMD) estimator.

Multivariate Data Analysis

Multivariate analysis is a branch of statistics involving the consideration of objects on
each of which are observed the values of a number of variables. Multivariate
techniques are used across the whole range of fields of statistical application. The
techniques are:

Principal components analysis

Factor analysis

Cluster analysis

Discriminant analysis

Principal component analysis used for exploring data to reduce the dimension.
Generally, PCA seeks to represent n correlated random variables by a reduced set of
uncorrelated variables, which are obtained by transformation of the original set onto
an appropriate subspace.

Two closely related techniques, principal component analysis and factor analysis, are
used to reduce the dimensionality of multivariate data. In these techniques
correlations and interactions among the variables are summarized in terms of a small
number of underlying factors. The methods rapidly identify key variables or groups of
variables that control the system under study.

Cluster analysis is an exploratory data analysis tool which aims at sorting different
objects into groups in a way that the degree of association between two objects is
maximal if they belong to the same group and minimal otherwise.

Discriminant function analysis used to classify cases into the values of a categorical
dependent, usually a dichotomy.

Regression Analysis

Models the relationship between one or more
response variables

), and the

). If there is more than one response variable, we speak of
multivariate regression

Types of regression

Simple and multiple linear regression

Simple linear regression

multiple linear regression

are related statistical methods for modeling
the relationship between two or more random variables using a
linear equation
. Linear regression
assumes the best estimate of the response is a
linear function

of some parameters (though not
necessarily linear on the predictors).

Nonlinear regression models

If the relationship between the variables being analyzed is not linear in parameters, a number of
nonlinear regression

techniques may be used to obtain a more accurate regression.

Other models

Although these three types are the most common, there also exist
Poisson regression
, and
weighted regression

Linear models

Predictor variables may be defined quantitatively or qualitatively(or
). Categorical
predictors are sometimes called
. Although the method of estimating the model is the
same for each case, different situations are sometimes known by different names for historical

If the predictors are all quantitative, we speak of
multiple regression

If the predictors are all qualitative, one performs
analysis of variance

If some predictors are quantitative and some qualitative, one performs an
analysis of covariance

General Linear Regression

general linear model

(GLM) is a statistical
linear model
. It may be written as


is a matrix with series of multivariate measurements,

is a matrix that
might be a
design matrix

is a matrix containing parameters that are usually to be
estimated and

is a matrix containing residuals (i.e., errors or noise). The residual is
usually assumed to follow a
multivariate normal distribution

or other distribution,
such as a distribution in exponential family.

The general linear model incorporates a number of different statistical models:
, ordinary
linear regression

. If there is only one column in

(i.e., one dependent variable) then the model
can also be referred to as the
multiple regression

model (multiple linear regression).

For example, if the response variable can take only binary values (for example, a
Boolean or Yes/No variable),
logistic regression

is preferred. The outcome of this
type of regression is a function which describes how the probability of a given event
(e.g. probability of getting "yes") varies with the predictors

Hypothesis tests with the general linear model can be made in two ways:

and mass


Semiparametric and Non
parametric modeling

The Generalized Linear Model (GLM)

Y= G(X

+ ... + X
) + e

where G is called the link function. All these models lead to the
problem of estimating a multivariate regression. Parametric
regression estimation has the disadvantage, that by the parametric
"form" certain properties of the resulting estimate are already implied.

Nonparametric techniques allow diagnostics of the data without this
restriction, and the model structure is not specified a priori. However,
this requires large sample sizes and causes problems in graphical

Semiparametric methods are a compromise between both: they
support a nonparametric modeling of certain features and profit from
the simplicity of parametric methods. Example: Cox Proportional
Hazard Model.

Survival analysis

It deals with “death” in biological organisms and failure in mechanical systems. Death or failure is
called an "event" in the survival analysis literature, and so models of death or failure are
generically termed
event models

Survival data arise in a literal form from trials concerning life
threatening conditions, but the
methodology can also be applied to other waiting times such as the duration of pain relief.

: Nearly every sample contains some cases that do not experience an event. If the
dependent variable is the time of the event, what do you do with these "censored" cases?

Survival analysis attempts to answer questions such as: what is the fraction of a population
which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can
multiple causes of death or failure be taken into account? How do particular circumstances or
characteristics increase or decrease the odds of survival?

dependent covariate
: Many explanatory variables (like income or blood pressure)change
in value over time. How do you put such variables in a regression analysis?

Survival Analysis is a group of statistical methods for analysis and interpretation of survival data.
Survival and hazard functions, the methods of estimating parameters and testing hypotheses that
are the main part of analyses of survival data.

Main topics relevant to survival data analysis are: Survival and hazard functions, Types of
censoring, Estimation of survival and hazard functions: the Kaplan
Meier and life table estimators,
Simple life tables, Comparison of survival functions: The logrank and Mantel
Haenszel tests,
Wilcoxon test;The proportional hazards model: time independent and time dependent covariates,
Recurrent model, and Methods for determining sample sizes.

Repeated Measures and
Longitudinal Data

Repeated measures and longitudinal data require special attention because they
involve correlated data that commonly arise when the primary sampling units are
measured repeatedly over time or under different conditions.

The experimental units are often subjects. It is usually interested in between
and within
subject effects. Between
subject effects are those whose values change
only from subject to subject and remain the same for all observations on a single
subject, for example, treatment and gender. Within
subject effects are those whose
values may differ from measurement to measurement.

Since measurements on the same experimental unit are likely to be correlated,
repeated measurements analysis must account for that correlation.

Normal theory models for split
plot experiments and repeated measures ANOVA can
be used to introduce the concept of correlated data.

PROC GLM, PROC GENMOD and PROC MIXED in the SAS system may be used.
Mixed linear models provide a general framework for modeling covariance structures, a
critical first step that influences parameter estimation and tests of hypotheses. The
primary objectives are to investigate trends over time and how they relate to treatment
groups or other covariates.

Techniques applicable to non
normal data, such as McNemar's test for binary data,
weighted least squares for categorical data, and generalized estimating equations
(GEE) are the main topics. The GEE method can be used to accommodate correlation
when the means at each time point are modeled using a generalized linear model.

Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models.
Biometrika 1986;73:13


Information Theory

Information theory is a branch probability and mathematical statistics that deal with communication systems,
data transmission, cryptography, signal to noise ratios, data compression, etc. Claude Shannon is the
father of information theory. His theory considered the transmission of information as a statistical
phenomenon and gave communications engineers a way to determine the capacity of a communication
channel about the common currency of bits Shannon defined a measure of entropy as:

H =


log p

that, when applied to an information source, could determine the capacity of the channel required
to transmit the source as encoded binary digits. The

is a measure of the amount of

one has about which message will be chosen. It is defined as the

information of a message

from that message space.

Entropy as defined by Shannon is closely related to entropy as defined by physicists in statistical
thermodynamics. This work was the inspiration for adopting the term entropy in information theory. Other
useful measures of information include mutual information which is a measure of the correlation between
two event sets. Mutual information is defined for two events X and Y as:

M (X, Y) = H(X, Y)



where H(X, Y) is the join entropy defined as:

H(X, Y) =

∑ p (x
, y
) log p (x
, y

Mutual information is closely related to the log
likelihood ratio test for multinomial distribution, and to
Pearson's Chi
square test. The field of Information Science has since expanded to cover the full range of
techniques and abstract descriptions for the storage, retrieval and transmittal of information.

Applications: Coding theory,


cryptographic systems, intelligent work, Bayesian
analysis, gabling, investing, etc.

Incomplete Data

Methods dealing with analysis of data with missing values can be
classified into:

Analysis of complete cases, including weighting


Imputation methods, and extensions to multiple imputation, and


Methods that analyze the incomplete data directly without requiring
a rectangular data set, such as maximum likelihood and Bayesian

Multiple imputation (MI) is a general paradigm for the analysis of
incomplete data. Each missing datum is replaced by m> 1 simulated
values, producing m simulated versions of the complete data. Each
version is analyzed by standard complete
data methods, and the
results are combined using simple rules to produce inferential
statements that incorporate missing data uncertainty. The focus is
on the practice of MI for real statistical problems in modern
computing environments.


ANOVA programs generally produce all possible interactions, while regression
programs generally do not produce any interactions. So it's up to the user to construct
interaction terms to multiply together.

If the standard error term is high, it might be Multicolinearity. But it is not the only
factor that can cause large SE's for estimators of "slope" coefficients any regression
models. SE's are inversely proportional to the range of variability in the predictor
variable. To increase the precision of estimators, we should increase the range of the

Another cause of large SE's is a small number of "event" observations or a small
number of "non
event" observations

There is also another cause of high standard errors; it's called serial correlation, when
using time

When X and W are category systems. The interaction describes a two
way analysis
of variance (ANOV) model; when X and W are (quasi
)continuous variables, this
equation describes a multiple linear regression (MLR) model.

In ANOVA contexts, the existence of an interaction can be described as a difference
between differences.

In MLR contexts, an interaction implies a change in the slope (of the regression of Y
on X) from one value of W to another value of W.

Sufficient Statistic

A sufficient estimator based on a statistic contains all the information
which is present in the raw data. For example, the sum of your data
is sufficient to estimate the mean of the population. You do not have
to know the data set itself. This saves a lot ... Simply, send out the
total, and the sample size.

sufficient statistic
t for a parameter q is a function of the sample
data x1,...,xn, which contains all information in the sample about the
parameter q . More formally, sufficiency is defined in terms of the
likelihood function for q . For a sufficient statistic t, the Likelihood
L(x1,...,xn| q ) can be written as g (t | q )*k(x1,...,xn). Since the
second term does not depend on q , t is said to be a sufficient
statistic for q .

To illustrate, let the observations be independent Bernoulli trials with
the same probability of success. Suppose that there are n trials, and
that person A observes which observations are successes, and
person B only finds out the number of successes. If seeing these
successes at random points without replication, B and A will see the
same ting.


Significance tests are based on assumptions: The data have to be
random, out of a well defined basic population and one has to
assume that some variables follow a certain distribution. Power of a
test is the probability of correctly rejecting a false null hypothesis. It
is one minus the probability of making a Type II error. The Type I
error: fail to reject the false hypothesis. Decrease the probability of
making a Type I error will increase the probability of making a Type
II error.

Power and the True Difference between Population Means:

distance between the two population means will affect the power of
our test.

Power as a Function of Sample Size and Variance:
Sample size
has an indirect effect on power because it affects the measure of
variance we used in the test. When n is large we will have a lower
standard error than when n is small.

Pilot Studies:

When the needed estimates for sample size
calculation is not available from existing database, a pilot study is
needed for adequate estimation with a given precision.

ANOVA: Analysis of Variance

Test the difference between 2 or more means. ANOVA does
this by examining the ratio of variability between two conditions
and variability within each condition.

Say we give a drug that we believe will improve memory to a
group of people and give a placebo to another group of people.
We might measure memory performance by the number of
words recalled from a list we ask everyone to memorize. An
ANOVA test would compare the variability that we observe
between the two conditions to the variability observed within
each condition. Recall that we measure variability as the sum
of the difference of each score from the mean.

When the variability that we predict (between the two groups)
is much greater than the variability we don't predict (within
each group) then we will conclude that our treatments produce
different results.

Data Mining and Knowledge

It uses sophisticated statistical analysis and modeling techniques to uncover patterns
and relationships hidden in organizational databases.

Aim at tools and techniques to process structured information from databases to data
warehouses to data mining, and to knowledge discovery. Data warehouse
applications have become business

It can compress even more value out of these huge repositories of information. The
continuing rapid growth of on
line data and the widespread use of databases
necessitate the development of techniques for extracting useful knowledge and for
facilitating database access.

The challenge of extracting knowledge from data is of common interest to several
fields, including statistics, databases, pattern recognition, machine learning, data
visualization, optimization, and high
performance computing.

The data mining process involves identifying an appropriate data set to "mine" or sift
through to discover data content relationships. Data mining tools include techniques
like case
based reasoning, cluster analysis, data visualization, fuzzy query and
analysis, and neural networks. Data mining sometimes resembles the traditional
scientific method of identifying a hypothesis and then testing it using an appropriate
data set.

It is reminiscent of what happens when data has been collected and no significant
results were found and hence an ad hoc, exploratory analysis is conducted to find a
significant relationship.

Data mining is the process of extracting knowledge from data. For clever marketers, that