Statistics And Application
Revealing Facts From Data
What Is Statistics
•
Statistics
is a
mathematical science
pertaining to collection, analysis,
interpretation, and presentation of
data
.
•
It is applicable to a wide variety of
academic disciplines
from the physical and
social
sciences
to the
humanities
, as well
as to
business
,
government
,
medicine
and
industry
.
Statistics Is …
•
Almost every professionals need a
statistical tool.
•
Statistical skills enable you to intelligently
collect, analyze and interpret data relevant
to their decision

making.
•
Statistical concepts enable us to solve
problems in a diversity of contexts.
•
Statistical thinking enables you to add
substance to your decisions
Statistics is a science
•
To assist you making decisions under uncertainties.
Decision making process must be based on data
neither on personal opinion nor on belief.
•
It is already an accepted fact that "Statistical thinking
will one day be as necessary for efficient citizenship
as the ability to read and write." So, let us be ahead of
our time.
•
In US, students learn statistics from middle
school
Type Of Statistics
•
Descriptive statistics
deals with the description problem: Can
the data be summarized in a useful way, either numerically or
graphically, to yield insight about the population in question?
Basic examples of numerical descriptors include the
mean
and
standard deviation
. Graphical summarizations include various
kinds of charts and graphs.
•
Inferential statistics
is used to model patterns in the data,
accounting for randomness and drawing inferences about the
larger population. These inferences may take the form of
answers to yes/no questions (
hypothesis testing
), estimates of
numerical characteristics (
estimation
),
prediction
of future
observations, descriptions of association (
correlation
), or
modeling of relationships (
regression
). Other
modeling
techniques include
ANOVA
,
time series
, and
data mining
.
Type of Studies
•
There are two major types of causal statistical studies, experimental
studies and observational studies. In both types of studies, the effect
of differences of an independent variable (or variables) on the
behavior of the dependent variable are observed. The difference
between the two types is in how the study is actually conducted.
Each can be very effective.
•
An experimental study involves taking measurements of the system
under study, manipulating the system, and then taking additional
measurements using the same procedure to determine if the
manipulation may have modified the values of the measurements. In
contrast, an observational study does not involve experimental
manipulation. Instead data are gathered and correlations between
predictors and the response are investigated.
Type of Statistical Courses
Two types:
•
Greater statistics is everything related to
learning from data, from the first planning or
collection, to the last presentation or report,
which is deep respect for data and truth.
•
Lesser statistics is the body of statistical
methodology, which has no interest in data or
truth, and are generally arithmetic exercises. If a
certain assumption is needed to justify a
procedure, they will simply to "assume the ... are
normally distributed"

no matter how unlikely
that might be.
Statistical Models
•
Statistical models are currently used in
various fields of business and science.
•
The terminology differs from field to field.
For example, the fitting of models to data,
called calibration, history matching, and
data assimilation, are all synonymous with
parameter
estimation.
Data Analysis
•
Developments in statistical data analysis
often parallel or follow advancements in other
fields to which statistical methods are fruitfully
applied.
•
Decision making process under uncertainty is
largely based on application of statistical data
analysis for probabilistic risk assessment of
your decision.
(cont.)
•
Decision makers need to lead others to apply
statistical thinking in day to day activities and
secondly,
•
Decision makers need to apply the concept
for the purpose of continuous improvement.
Is Data Information?
•
Database in your office contains a wealth of information.
•
The decision technology group members tap a fraction of
it
•
Employees waste time scouring multiple sources for a
database.
•
The decision

makers are frustrated because they cannot
get business

critical data exactly when they need it.
•
Therefore, too many decisions are based on guesswork,
not facts. Many opportunities are also missed, if they are
even noticed at all.
•
Data itself is not information, but might generate
information.
Knowledge
•
Knowledge is what we know well. Information is the
communication of knowledge.
•
In every knowledge exchange, the sender make
common what is private, does the informing, the
communicating.
•
Information can be classified as
explicit and tacit
forms.
•
The explicit information can be explained in structured
form, while tacit information is inconsistent and fuzzy to
explain.
•
Know that data are only crude information and not
knowledge by themselves.
Data
→
Knowledge (?)
•
Data is known to be crude information and not
knowledge by itself.
•
The sequence from data to knowledge is:
from Data to
Information, from Information to Facts, and finally,
from Facts to Knowledge
.
•
Data becomes information, when it becomes relevant to
your decision problem.
•
Information becomes fact, when the data can support it.
Facts are what the data reveals.
•
However the decisive instrumental (i.e., applied)
knowledge is expressed together with some statistical
degree of confidence.
Fact
→
knowledge
Fact becomes knowledge, when it is used in the
successful completion of a statistical process.
Statistical Analysis
•
The exactness of a statistical model increases,
the level of improvements in decision

making
increases: the reason of using statistical data
analysis.
•
Statistical data analysis arose from the need to
place knowledge on a systematic evidence base.
•
Statistics is a study of the laws of probability, the
development of measures of data properties and
relationships, and so on.
Statistical Inference
•
Verify the statistical hypothesis: Determining whether
any statistical significance can be attached that results
after due allowance is made for any random variation as
a source of error.
•
Intelligent and critical inferences cannot be made by
those who do not understand the purpose, the conditions,
and applicability of the various techniques for judging
significance.
•
Considering the uncertain environment, the chance that
"good decisions" are made increases with the availability
of "good information." The chance that "good
information" is available increases with the level of
structuring the process of Knowledge Management.
Knowledge Needs Wisdom
•
Wisdom is the power to put our time and
our knowledge to the proper use.
•
Wisdom is the accurate application of
accurate knowledge.
•
Wisdom is about knowing how technical
staff can be best used to meet the needs
of the decision

maker.
History Of Statistics
•
The word
statistics
ultimately derives from the
modern Latin
term
statisticum collegium
("council of state") and the
Italian
word
statista
("
statesman
" or "
politician
").
•
The birth of statistics occurred in mid

17th century. A commoner,
named John Graunt, who was a native of London, begin reviewing a
weekly church publication issued by the local parish clerk that listed
the number of births, christenings, and deaths in each parish. These
so called Bills of Mortality also listed the causes of death. Graunt
who was a shopkeeper organized this data in the forms we call
descriptive statistics, which was published as
Natural and Political
Observation Made upon the Bills of Mortality
. Shortly thereafter, he
was elected as a member of Royal Society. Thus, statistics has to
borrow some concepts from sociology, such as the concept of
"Population". It has been argued that since statistics usually involves
the study of human behavior, it cannot claim the precision of the
physical sciences.
Statistics is for Government
•
The original principal purpose of
Statistik
was data to be
used by governmental and (often centralized)
administrative bodies. The collection of data about states
and localities continues, largely through
national and
international statistical services
.
•
Censuses
provide regular information about the
population
.
•
During the 20th century, the creation of precise
instruments for
public health
concerns (
epidemiology
,
biostatistics
, etc.) and economic and social purposes
(
unemployment
rate,
econometry
, etc.) necessitated
substantial advances in statistical practices.
History of Probability
•
Probability has much longer history. Probability is derived from the
verb to probe meaning to "find out" what is not too easily accessible
or understandable. The word "proof" has the same origin that
provides necessary details to understand what is claimed to be true.
•
Probability originated from the study of games of chance and
gambling during the sixteenth century. Probability theory was a
branch of mathematics studied by Blaise Pascal and Pierre de
Fermat in the seventeenth century.
•
Currently; in 21st century, probabilistic modeling are used to control
the flow of traffic through a highway system, a telephone
interchange, or a computer processor; find the genetic makeup of
individuals or populations; quality control; insurance; investment;
and other sectors of business and industry.
Stat Merge With Prob
•
Statistics eventually merged with the field of
inverse probability
,
referring to the estimation of a parameter from experimental data in
the experimental sciences (most notably
astronomy
).
•
Today the use of statistics has broadened far beyond the service of
a state or government, to include such areas as business, natural
and social sciences, and medicine, among others.
•
Statistics emerged in part from
probability theory
, which can be
dated to the correspondence of
Pierre de Fermat
and
Blaise Pascal
(1654).
Christiaan Huygens
(1657) gave the earliest known scientific
treatment of the subject.
Jakob Bernoulli
's
Ars Conjectandi
(posthumous, 1713) and
Abraham de Moivre
's
Doctrine of Chances
(1718) treated the subject as a branch of mathematics.
Development in 18

19 centery
•
The
theory of errors
may be traced back to
Roger Cotes
's
Opera
Miscellanea
(posthumous, 1722), but a memoir prepared by
Thomas
Simpson
in 1755 (printed 1756) first applied the theory to the
discussion of errors of observation.
•
Daniel Bernoulli
(1778) introduced the principle of the maximum
product of the probabilities of a system of concurrent errors.
•
The
method of least squares
, which was used to minimize errors in
data
measurement
, is due to
Robert
Adrain (1808), Carl Gauss
(1809), and Adrien

Marie Legendre (1805) by the problems of
survey measurements, reconciling disparate physical measurements.
•
General theory in statistics: by Laplace (1810, 1812), Gauss (1823),
James Ivory
(1825, 1826), Hagen (1837),
Friedrich Bessel
(1838),
W. F. Donkin
(1844, 1856), and
Morgan Crofton
(1870). Other
contributors were Ellis (1844),
De Morgan
(1864),
Glaisher
(1872),
and
Giovanni Schiaparelli
(1875).
Statistics in 20 Century
Karl Pearson
(
March 27
,
1857
–
April 27
,
1936
) was a major contributor to the early
development of
statistics
. Pearson's work was all

embracing in the wide application
and development of mathematical statistics, and encompassed the fields of
biology
,
epidemiology
, anthropometry,
medicine
and social
history
, his main contributions are:
Linear regression
and
correlation
.
The
Pearson product

moment correlation
coefficient
was the first important
effect size
to be introduced into statistics;
Classification of distributions
forms the basis for a lot of modern statistical theory;
in particular, the
exponential family
of distributions underlies the theory of
generalized
linear models
;
Pearson's chi

square test
.
Sir Ronald Aylmer Fisher
,
FRS
(
17 February
1890
–
29 July
1962
)
Fisher invented the techniques of
maximum likelihood
and
analysis of variance
, and
originated the concepts of
sufficiency
,
ancillarity
,
Fisher's linear discriminator
and
Fisher information
. His
1924
article "On a distribution yielding the error functions of
several well known statistics" presented
Karl Pearson's
chi

squared
and
Student's
t
in
the same framework as the normal distribution and his own analysis of variance
distribution z
(more commonly used today in the form of the
F distribution
). These
contributions easily made him a major figure in
20th century
statistics. He began the
field of
non

parametric statistics
, entropy as well as Fish information were essential
for developing Bayesian analysis.
Statistics in 20 Century
•
Gertrude Mary Cox
(
January 13
,
1900
–
1978
)
Experimental Design
•
Charles Edward Spearman (
September 10
,
1863

September 7
,
1945
)
non

parametric analysis, rank correlation coefficient
•
Chebyshev's inequality
•
Lyapunov's central limit theorem
•
John Wilder Tukey
(
June 16
,
1915

July 26
,
2000
):
jackknife estimation
,
exploratory data analysis
and
confirmatory data analysis
.
•
George Bernard Dantzig
(
8 November
1914
–
13 May
2005
):developing
the simplex method and furthering linear programming, advanced the fields
of decomposition theory, sensitivity analysis, complementary pivot
methods, large

scale optimization, nonlinear programming, and
programming under uncertainty.
•
Bayes' theorem
•
Sir
David Roxbee Cox
(born
Birmingham, England
,
1924
) has made
pioneering and important contributions to numerous areas of statistics and
applied probability, of which the best known is perhaps the proportional
hazards model, which is widely used in the analysis of survival data.
School Thought of Statistics
•
The Classical, attributed to
Laplace
:
•
Relative Frequency, attributed to
Fisher
•
Bayesian, attributed to
Savage
What Type of Statistician Are You?
Classic Statistics
•
The problem with the Classical Approach is that what
constitutes an outcome is not objectively determined.
One person's simple event is another person's
compound event. One researcher may ask, of a newly
discovered planet, "what is the probability that life exists
on the new planet?" while another may ask "what is the
probability that carbon

based life exists on it?"
•
Bruno de Finetti, in the introduction to his two

volume
treatise on Bayesian ideas, clearly states that
"Probabilities Do not Exist". By this he means that
probabilities are not located in coins or dice; they are not
characteristics of things like mass, density, etc
Relative Frequency Statistics
•
Consider probabilities as "objective" attributes
of things (or situations) which are really out
there (availability
of data).
•
Use the data we have only to make
interpretation.
•
Even substantial prior information is available,
Frequentists do not use it, while Bayesians are
willing to assign probability distribution
function(s) to the population's parameter(s).
Bayesian approaches
•
Consider probability theory as an extension of deductive
logic (including dialogue logic, interrogative logic,
informal logic, and artificial intelligence) to handle
uncertainty.
•
First principle that the uniquely correct way is your belief
about the state of things (Prior), and updating them in
the light of the evidence.
•
The laws of probability have the same status as the laws
of logic.
•
Bayesian approaches are explicitly "subjective" in the
sense that they deal with the plausibility which a rational
agent ought to attach to the propositions he/she
considers, "given his/her current state of knowledge and
experience."
Discussion
•
From a scientist's perspective, there are good grounds to reject
Bayesian reasoning. Bayesian deals not with objective, but
subjective probabilities. The result is that any reasoning using a
Bayesian approach cannot be checked

something that makes it
worthless to science, like non replicate experiments.
•
Bayesian perspectives often shed a helpful light on classical
procedures. It is necessary to go into a Bayesian framework to give
confidence intervals. This insight is helpful in drawing attention to
the point that another prior distribution would lead to a different
interval.
•
A Bayesian may cheat by basing the prior distribution on the data,
because priors must be personal for coherence to hold before the
study, which is more complex.
•
Objective Bayesian: There is a clear connection between probability
and logic: both appear to tell us how we should reason. But how,
exactly, are the two concepts related? Objective Bayesians offers
one answer to this question.
Steps Of The Analysis
1.
Defining the problem:
An exact definition of the problem
is imperative in order to obtain accurate data about it.
2.
Collecting the data:
Designing ways to collect data is an
important job in statistical data analysis. Population
and
Sample are VIP aspects.
3.
Analyzing the data:
Exploratory methods are used to
discover what the data seems to be saying by using
simple arithmetic and easy

to

draw pictures to
summarize data. Confirmatory methods use ideas from
probability theory in the attempt to answer specific
questions.
4.
Reporting the results
Type of Data, Levels of
Measurement & Errors
•
Qualitative and Quantitative
•
Discrete and Continuous
•
Nominal, Ordinal, Interval and Ratio
•
Types of error: Recording error, typing error,
transcription error (incorrect copying),
Inversion (e.g., 123.45 is typed as 123.54),
Repetition (when a number is repeated),
Deliberate error, Type Error, etc.
Data Collection: Experiments
•
Experiment
is a set of actions and
observations
, performed for solving a given
problem
, to test a
hypothesis
or
research
concerning
phenomena
. Itis an
empirical
approach acquiring deeper
knowledge
about the physical world.
•
Design of experiments
In the "hard" sciences tends to focus on the elimination of extraneous effects, in the
"soft" sciences it focuses more on the problems of external validity, by using
statistical methods
. Events occur naturally from which scientific evidence can be
drawn, which is the basis for
natural experiments
.
•
Controlled experiments
To demonstrate a cause and effect hypothesis, an experiment must often show that,
for example, a phenomenon occurs after a certain treatment is given to a subject,
and that the phenomenon does
not
occur in the
absence
of the treatment.
A
controlled
experiment generally compares the results obtained from an
experimental sample against a
control
sample, which is practically identical to
the experimental sample except for the one aspect whose effect is being tested.
Data Collection: Experiments
•
Natural experiments or
quasi

experiments
Natural experiments rely solely on observations of the
variables
of the
system
under
study, rather than manipulation of just one or a few variables as occurs in controlled
experiments. Much research in several important
science
disciplines, including
geology
,
paleontology
,
ecology
,
meteorology
, and
astronomy
, relies on quasi

experiments.
•
Observational studies
Observational studies are very much like controlled experiments except that they
lack probabilistic equivalency between groups. These types of experiments often
arise in the area of medicine where, for ethical reasons, it is not possible to create a
truly controlled group. ]
•
Field Experiments
Named in order to draw a contrast with
laboratory experiments
. Often used in the
social sciences, economics etc. Field experiments suffer from the possibility of
contamination: experimental conditions can be controlled with more precision and
certainty in the lab.
Data Analysis
•
It will follow different approaches!
Applied Statistics
Actuarial science
Applies
mathematical
and
statistical
methods to
finance
and
insurance
,
particularly to the assessment of
risk
.
Actuaries
are professionals who are
qualified in this field.
Actuarial science
•
Actuarial science
is the discipline that applies
mathematical
and
statistical
methods to
assess risk
in the
insurance
and
finance
industries.
Actuaries
are professionals who are
qualified in this field through examinations and experience.
•
Actuarial science includes a number of interrelating subjects,
including
probability
and
statistics
,
finance
, and
economics
.
Historically, actuarial science used deterministic models in
the construction of tables and premiums. The science has
gone through revolutionary changes during the last 30 years
due to the proliferation of high speed computers and the
synergy of
stochastic
actuarial models with modern financial
theory
(Frees 1990)
.
•
Many universities have undergraduate and graduate degree
programs in actuarial science. In 2002, a
Wall Street Journal
survey on the best jobs in the United States listed “actuary”
as the second best job
(Lee 2002)
.
Where Do Actuaries Work and What Do They Do?
The insurance industry can't function without actuaries, and that's where most of them
work. They calculate the costs to assume risk
—
how much to charge policyholders for
life or health insurance premiums or how much an insurance company can expect to
pay in claims when the next hurricane hits Florida.
Actuaries provide a financial evaluation of risk for their companies to be used for strategic
management decisions. Because their judgement is heavily relied upon, actuaries'
career paths often lead to upper management and executive positions.
When other businesses that do not have actuaries on staff need certain financial advice,
they hire actuarial consultants. A consultant can be self

employed in a one

person
practice or work for a nationwide consulting firm. Consultants help companies design
pension and benefit plans and evaluate assets and liabilities. By delving into the
financial complexities of corporations, they help companies calculate the cost of a
variety of business risks. Consulting actuaries rub elbows with chief financial officers,
operating and human resource executives, and often chief executive officers.
Actuaries work for the government too, helping manage such programs as the Social
Security system and Medicare. Since the government regulates the insurance
industry and administers laws on pensions and financial liabilities, it also needs
actuaries to determine whether companies are complying with the law.
Who else asks an actuary to assess risks and solve thorny statistical and financial
problem? You name it: Banks and Investment firms, large corporations, public
accounting firms, insurance rating bureaus, labor unions, and fraternal organizations..,
Typical actuarial projects:
•
Analyzing insurance rates, such as for cars,
homes or life insurance.
•
Estimating the money to be set

aside for claims
that have not yet been paid.
•
Participating in corporate planning, such as
mergers and acquisitions.
•
Calculating a fair price for a new insurance
product.
•
Forecasting the potential impact of catastrophes.
•
Analyzing investment programs.
VEE
–
Applied Statistical Methods
Courses that meet this requirement may be taught in the mathematics, statistics, or
economics department, or in the business school. In economics departments, this
course may be called Econometrics. The material could be covered in one course or
two. The mathematical sophistication of these courses will vary widely and all levels
are intended to be acceptable. Some analysis of real data should be included. Most
of the topics listed below should be covered:
Probability.
3 pts.
Statistical Inference.
3 pts.
Linear Regression Models.
3 pts.
Time Series Analysis.
3 pts.
Survival Analysis.
3 pts.
Elementary Stochastic Processes.
3 pts.
Simulation.
3 pts.
Introduction to the Mathematics of Finance.
3 pts.
Statistical Inference and Time

Series Modelling.
3 pts.
Stochastic Methods in Finance.
3 pts.
Stochastic Differential Equations and Applications.
3 pts.
Advanced Data Analysis.
3 pts.
Data Mining.
3 pts.
Statistical Methods in Finance.
3 pts.
Nonparametric Statistics.
3 pts.
Stochastic Processes and Applications,
3 pts.
Some Books
Generalized Linear Models for Insurance Data, by Piet
de Jong and Gillian Z. Heller
Stochastic Claims Reserving Methods in Insurance (The
Wiley Finance Series) by Mario V. Wüthrich and
Michael Merz
Actuarial Modelling of Claim Counts: Risk Classification,
Credibility and Bonus

Malus Systems, by Michel
Denuit, Xavier Marechal, Sandra Pitrebois and Jean

Francois Walhin
Loss Models: From Data to Decisions (Wiley Series in
Probability and Statistics) (Hardcover) by Stuart A.
Klugman, Harry H. Panjer and Gordon E. Willmot
Biostatistics
or
Biometry
•
Biostatistics
or
biometry
is the application of
statistics
to a wide range of topics in
biology
.
•
Public health
, including
epidemiology
,
nutrition
and
environmental health
,
•
Design and analysis of
clinical trials
in
medicine
•
Genomics
,
population genetics
, and
statistical
genetics
in populations in order to link variation
in
genotype
with a variation in
phenotype
.
•
Ecology
•
Biological
sequence analysis
.
Data Mining
•
Knowledge

Discovery in Databases (KDD), is the process of
automatically searching large volumes of
data
for patterns.
•
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from data
•
Data mining involves the process of analyzing data
•
Data Mining is a fairly recent and contemporary topic in
computing.
•
Data Mining applies many older computational techniques
from
statistics
,
machine learning
and
pattern recognition
.
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database
Technology
Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
Data Mining: On What Kinds of
Data?
•
Database

oriented data sets and applications
–
Relational database, data warehouse, transactional database
•
Advanced data sets and advanced applications
–
Data streams and sensor data
–
Time

series data, temporal data, sequence data (incl. bio

sequences)
–
Structure data, graphs, social networks and multi

linked data
–
Object

relational databases
–
Heterogeneous databases and legacy databases
–
Spatial data and spatiotemporal data
–
Multimedia database
–
Text databases
–
The World

Wide Web
Top

10 Most Popular DM Algorithms:
18 Identified Candidates (I)
•
Classification
–
#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning.
Morgan Kaufmann., 1993.
–
#2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth, 1984.
–
#3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R.
1996. Discriminant Adaptive Nearest Neighbor Classification.
TPAMI. 18(6)
–
#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So
Stupid After All? Internat. Statist. Rev. 69, 385

398.
•
Statistical Learning
–
#5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning
Theory. Springer

Verlag.
–
#6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture
Models. J. Wiley, New York. Association Analysis
–
#7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In VLDB '94.
–
#8. FP

Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent
patterns without candidate generation. In SIGMOD '00.
The 18 Identified Candidates (II)
•
Link Mining
–
#9. PageRank: Brin, S. and Page, L. 1998. The
anatomy of a large

scale hypertextual Web search
engine. In WWW

7, 1998.
–
#10. HITS: Kleinberg, J. M. 1998. Authoritative
sources in a hyperlinked environment. SODA, 1998.
•
Clustering
–
#11. K

Means: MacQueen, J. B., Some methods for
classification and analysis of multivariate
observations, in Proc. 5th Berkeley Symp.
Mathematical Statistics and Probability, 1967.
–
#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny,
M. 1996. BIRCH: an efficient data clustering method
for very large databases. In SIGMOD '96.
•
Bagging and Boosting
–
#13. AdaBoost: Freund, Y. and Schapire, R. E. 1997.
A decision

theoretic generalization of on

line learning
and an application to boosting. J. Comput. Syst. Sci.
55, 1 (Aug. 1997), 119

139.
The 18 Identified Candidates (III)
•
Sequential Patterns
–
#14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential
Patterns: Generalizations and Performance Improvements. In
Proceedings of the 5th International Conference on Extending
Database Technology, 1996.
–
#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi

Asl, H. Pinto, Q.
Chen, U. Dayal and M

C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix

Projected Pattern Growth. In ICDE
'01.
•
Integrated Mining
–
#16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating
classification and association rule mining. KDD

98.
•
Rough Sets
–
#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical
Aspects of Reasoning about Data, Kluwer Academic Publishers,
Norwell, MA, 1992
•
Graph Mining
–
#18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph

Based
Substructure Pattern Mining. In ICDM '02.
Major Issues in Data Mining
•
Mining methodology
–
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
–
Performance: efficiency, effectiveness, and scalability
–
Pattern evaluation: the interestingness problem
–
Incorporation of background knowledge
–
Handling noise and incomplete data
–
Parallel, distributed and incremental mining methods
–
Integration of the discovered knowledge with existing one: knowledge
fusion
•
User interaction
–
Data mining query languages and ad

hoc mining
–
Expression and visualization of data mining results
–
Interactive mining of
knowledge at multiple levels of abstraction
•
Applications and social impacts
–
Domain

specific data mining & invisible data mining
–
Protection of data security, integrity, and privacy
Challenge Problems in Data Mining
•
Developing a Unifying Theory of Data Mining
•
Scaling Up for High Dimensional Data and High Speed
Data Streams
•
Mining Sequence Data and Time Series Data
•
Mining Complex Knowledge from Complex Data
•
Data Mining in a Network Setting
•
Distributed Data Mining and Mining Multi

agent Data
•
Data Mining for Biological and Environmental Problems
•
Data

Mining

Process Related Problems
•
Security, Privacy and Data Integrity
•
Dealing with Non

static, Unbalanced and Cost

sensitive
Data
Recommended Reference
Books
•
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi

Structured Data.
Morgan Kaufmann, 2002
•
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley

Interscience, 2000
•
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
•
U. M. Fayyad, G. Piatetsky

Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996
•
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
•
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2
nd
ed.,
2006
•
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
•
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer

Verlag, 2001
•
B. Liu, Web Data Mining, Springer 2006.
•
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
•
G. Piatetsky

Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,
1991
•
P.

N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
•
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
•
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2
nd
ed. 2005
Economic statistics
•
Economic statistics is a branch of applied statistics
focusing on the collection, processing, compilation and
dissemination of statistics concerning the economy of a
region, a country or a group of countries.
•
Economic statistics is also referred as a subtopic of
official statistics, since most of the economic statistics
are produced by official organizations (e.g. statistical
institutes, supranational organizations, central banks,
ministries, etc.).
•
Economic statistics provide the empirical data needed in
economic research (econometrics) and they are the
basis for decision and economic policy making.
Econometrics
Econometrics is concerned with the tasks of developing
and applying quantitative or statistical methods to the
study and elucidation of economic principles.
Econometrics combines economic theory with statistics
to analyze and test economic relationships.
Theoretical econometrics considers questions about the
statistical properties of estimators and tests, while
applied econometrics is concerned with the application
of econometric methods to assess economic theories.
Although the first known use of the term "econometrics"
was by Pawel Ciompa in 1910, Ragnar Frisch is given
credit for coining the term in the sense that it is used
today.
Method in Econometrics
•
Although many econometric methods represent applications of
standard statistical models, there are some special features of
economic data that distinguish econometrics from other branches of
statistics.
•
Economic data are generally observational, rather than being
derived from controlled experiments. Because the individual units in
an economy interact with each other, the observed data tend to
reflect complex economic equilibrium conditions rather than simple
behavioral relationships based on preferences or technology.
Consequently, the field of econometrics has developed methods for
identification and estimation of simultaneous equation models.
These methods allow researchers to make causal inferences in the
absence of controlled experiments.
•
Early work in econometrics focused on time

series data, but now
econometrics also fully covers cross

sectional and panel data.
Data in Econometrics
•
Data is broadly classified according to the number of dimensions.
•
A data set containing observations on a single phenomenon observed over
multiple time periods is called
time series
. In time series data, both the
values and the ordering of the data points have meaning.
•
A data set containing observations on multiple phenomena observed at a
single point in time is called
cross

sectional
. In cross

sectional data sets,
the values of the data points have meaning, but the ordering of the data
points does not.
•
A data set containing observations on multiple phenomena observed over
multiple time periods is called
panel data
. Alternatively, the second
dimension of data may be some entity other than time. For example, when
there is a sample of groups, such as siblings or families, and several
observations from every group, the data is panel data. Whereas time series
and cross

sectional data are both one

dimensional, panel data sets are two

dimensional.
•
Data sets with more than two dimensions are typically called multi

dimensional panel data.
Program
•
Research Area: Theoretical econometrics, including time series analysis,
nonparametric and semi

parametric estimation, panel data analysis, and
financial econometrics; applied econometrics, including applied labor
economics and empirical finance.
•
Courses
Probability and Statistics
Advanced Econometrics
Time Series Models
Micro Econometrics
Panel Data Econometrics
Financial Econometric
Nonparametric and semi

parametric econometrics
Lecture on Advanced Econometrics
Data Analysis in Academic Research (using SAS)
Statistics and Data Analysis for Economics
Nonlinear Models
Some researches in U. of Chicago
Proposal: "Selective Publicity and Stock Prices" By: David Solomon
Proposal: "Activating Self

Control: Isolated vs. Interrelated Temptations" By: Kristian
Myrseth
Proposal: "Buyer's Remorse: When Evaluation is Based on Simulation Before You Chose
but Deliberation After" By: Yan Zhang
Proposal: "Brokerage, Second

Hand Brokerage and Difficult Working Relationships: The
Role of the Informal Organization on Speaking Up about Difficult Relationships and
Being Deemed Uncooperative by Co

Workers" By: Jennifer Hitler
Proposal: "Resource Space Dynamics in the Evolution of Industries: Formation,
Expansion and Contraction of the Resource Space and its Effects on the Survival of
Organizations: By: Aleksios Gotsopoulos
Defense: "An Examination of Status Dynamics in the U.S. Venture Capital Industry" By:
Young

Kyu Kim
Defense: "Group Dynamics and Contact: A Natural Experiment" By: Arjun Chakravarti
Defense: "Essays in Corporate Governance" By: Ashwini Agrawal
Some Researches in U. of Chicago
Defense: "Essays on Consumer Finance" By: Brian Melzer
Defense: "Male Incarceration and Teen Fertility" By: Amee Kamdar
Defense: "Essays on Economic Fundamentals in Asset Pricing" By: Jie (Jennie) Bai
Defense: "Asset

Intensity and the Cross

Section of Stock Returns" By: Raife Giovinazzo
Defense: "Essays on Household Behavior" By: Marlena Lee
Proposal: "How (Un)Accomplished Goal Actions Affect Goal Striving and Goal Setting" By: Minjung
Koo
Defense: "Empirical Entry Games with Complementarities: An Application to the Shopping Center
Industry" By: Maria Ana Vitorino
Defense: "Betas, Characterisitcs, and the Cross

Section of Hedge Fund Returns" By: Mark Klebanov
Defense: "Expropriatin Risk and Technology" By: Marcus Opp
Defense: "Essays in Corporate Finance and Real Estate" By: Itzhak Ben

David
Proposal: "Group Dynamics and Interpersonal Contact: A Natural Experiment" By: Arjun Chakravarti
Proposal: "Structural Estimation of a Moral Hazard Model: An Application to Industrial Selling"
By: Renna Jiang
Proposal: "Status, Quality, and Earnings Announcements: An Analysis of the Effect of News of which
Confirms or Contradicts the Status

Quality Correlation on the Stock of a Company“ By: Daniela
Lup
Defense: "Diversification and its Discontents: Idiosyncratic and Entrepreneurial Risk in the Quest for
Social Status" By: Nick Roussanov
Summary of Econometrics
It is a combination of
mathematical economics
,
statistics
,
economic statistics and economic theory.
Regression analysis
is popular
Time

series analysis
and
cross

sectional analysis
are
useful.
Panel analyses
, which related to multi

dimension
regression
Fixed effect models: There are unique attributes of
individuals that are not the results of random variation
and that do not vary across time. Adequate, if we want to
draw inferences only about the examined individuals.
Random effect models:There are unique, time constant
attributes of individuals that are the results of random
variation and do not correlate with the individual
regressors. This model is adequate, if we want to draw
inferences about the whole population, not only the
examined sample.
References
•
Arellano, Manuel
.
Panel Data Econometrics
, Oxford
University Press 2003.
•
Hsiao, Cheng, 2003.
Analysis of Panel Data
, Cambridge
University Press.
•
Davies, A. and Lahiri, K., 2000. "Re

examining the
Rational Expectations Hypothesis Using Panel Data on
Multi

Period Forecasts,"
Analysis of Panels and Limited
Dependent Variable Models
, Cambridge University
Press.
•
Davies, A. and Lahiri, K., 1995. "A New Framework for
Testing Rationality and Measuring Aggregate Shocks
Using Panel Data," Journal of Econometrics 68: 205

227.
•
Frees, E., 2004.
Longitudinal and Panel Data
,
Cambridge University Press.
Engineering Statistics
•
(DOE) or
design of experiments
uses statistical
techniques to test and construct models of
engineering components and systems.
•
Quality control
and
process control
use statistics
as a tool to manage conformance to
specifications of manufacturing processes and
their products.
•
Time and methods engineering
use statistics to
study repetitive operations in manufacturing in
order to set standards and find optimum (in
some sense) manufacturing procedures
Statistical Physics
•
Using methods of
statistics
in solving physical problems with
stochastic
nature.
•
The term
statistical physics
encompasses
probabilistic
and
statistical
approaches to
classical mechanics
and
quantum
mechanics
. Hence might be called as
Statistical mechanics
•
It works well in classical systems when the number of
degrees
of freedom
is so large that exact solution is not possible, or not
really useful.
•
Statistical mechanics
can also describe work in
non

linear
dynamics
,
chaos theory
, thermal physics,
fluid dynamics
(particularly at low
Knudsen numbers
), or
plasma physics
.
Demography
•
The study of human
population dynamics
. It
encompasses the study of the size, structure
and distribution of
populations
, and how
populations change over time due to births,
deaths,
migration
and
ageing
.
•
Methods are including
census
returns and
vital
statistics
registers, or incorporate survey data
using indirect estimation techniques.
Psychological Statistics
The application of statistics to psychology.
Some of the more commonly used statistical
tests in psychology are:
Student's t

test , Chi

square, ANOVA,
ANCOVA, MANOVA, Regression analysis ,
Correlation, Survival analysis, Cliniqual
trial , etc.
Social Statistics
•
Using
statistical
measurement systems to study
human
behavior in a social environment
•
Advanced statistical analyses have become popular
among social science.
•
A new branch: quantitative social science in Harvard
•
Structural Equation Modeling
and
factor analysis
•
Multilevel models
•
Cluster analysis
•
Latent class models
•
Item response theory
•
Survey methodology
and
survey sampling
Chemometrics
•
Apply mathematical or statistical methods to chemical data.
•
Chemometrics is the science of relating measurements made
on a chemical system or process to the state of the system via
application of mathematical or statistical methods.
•
Chemometric research spans a wide area of different methods
which can be applied in chemistry. There are techniques for
collecting good data (optimization of experimental parameters,
design of experiments
,
calibration
,
signal processing
) and for
getting information from these data (
statistics
,
pattern
recognition
,
modeling
, structure

property

relationship
estimations).
•
Chemometrics tries to build a bridge between the methods and
their application in chemistry.
Reliability Engineering
•
Reliability engineers perform a wide variety of
special management and engineering tasks to
ensure that sufficient attention is given to details
that will affect the reliability of a given system.
•
Reliability engineers rely heavily on
statistics
,
probability theory
, and
reliability theory
. Many
engineering techniques are used in reliability
engineering, such as reliability prediction,
Weibull
analysis, thermal management,
reliability testing and accelerated life testing.
Statistical Methods
•
A common goal for a statistical research
project is to investigate causality, and in
particular to draw a conclusion on the
effect of changes in the values of
predictors or
independent variables
on a
response or
dependent variable
.
•
Two major types of studies: Experimental
and observational studies
Well Known Techniques
•
Student's t

test
: test
means
of two
normally distributed
populations are equal
•
chi

square
: test two distributions are the same
•
analysis of variance
(ANOVA): test the difference of
mean or effects.
•
Mann

Whitney U
: test difference in
medians
between
two observed distributions
•
regression analysis
: model relationships between
random variables
, determine the magnitude of the
relationships between variables, and can be used to
make predictions based on the models
•
Correlation
: indicates the strength and direction
of a linear relationship between two
random
variables
•
Fisher's Least Significant Difference test
: test
difference of means in multiple comparison.
•
Pearson product

moment correlation coefficient
:
a measure of how well a
linear equation
describes the relation between two variables
X
and
Y
measured on the same object or
organism.
•
Spearman's rank correlation coefficient
: a
non

parametric
measure of
correlation
between two
variables
Simple Statistic Application
•
Compare two means
•
Compare two proportions
•
Compare two populations
•
Estimate mean or proportion
•
Find empirical distribution
Statistical Topics
Sampling Distribution
•
Sampling distribution is used to describe the distribution of
outcomes that one would observe from replication of a particular
sampling plan.
•
Know that to estimate means to esteem (to give value to).
•
Know that estimates computed from one sample will be different
from estimates that would be computed from another sample.
•
Understand that estimates are expected to differ from the
population characteristics (parameters) that we are trying to
estimate, but that the properties of sampling distributions allow us
to quantify, probabilistically, how they will differ.
•
Understand that different statistics have different sampling
distributions with distribution shape depending on (a) the specific
statistic, (b) the sample size, and (c) the parent distribution.
•
Understand the relationship between sample size and the
distribution of sample estimates.
•
Understand that the variability in a sampling distribution can be
reduced by increasing the
Research
•
Sequential sampling technique
•
Low response rate
•
Biased response
Outlier Removal
•
Outliers are a few observations that are not well fitted by the "best"
available model. When occurring, one must first investigate the
source of data, if there is no doubt about the accuracy or veracity of
the observation, then it should be removed and the model should be
refitted. Robust statistical techniques are needed to cope with any
undetected outliers; otherwise the result will be misleading.
•
Because of the potentially large variance, outliers could be the
outcome of sampling. It's perfectly correct to have such an
observation that legitimately belongs to the study group by definition.
Say, Lognormally distributed data.
•
To be very careful and cautious: before declaring an observation "an
outlier," find out why and how such observation occurred. It could
even be an error at the data entering stage.
•
First, construct the BoxPlot of your data. Form the Q1, Q2, and Q3
points which divide the samples into four equally sized groups. (Q2
= median) Let IQR = Q3

Q1. Outliers are defined as those points
outside the values Q3+k*IQR and Q1

k*IQR. For most case one
sets k=1.5 or 3.
•
Another alternative outlier definition is out off: mean + ks, mean

ks
sigma (k is 2, 2.5, or 3)
Central Limit Theorem
•
The average of a sample of observations drawn from some
population with any shape

distribution is approximately
distributed as a normal distribution if certain conditions are
met.
•
It is well known that whatever the parent population is, the
standardized variable will have a distribution with a mean 0
and standard deviation 1 under random sampling with a large
sample size.
•
The sample size needed for the approximation to be adequate
depends strongly on the shape of the parent distribution.
Symmetry is particularly important. For a symmetric and short
tail parent distribution, even if very different from the shape of
a normal distribution, an adequate approximation can be
obtained with small samples (e.g., 10 or 12 for the uniform
distribution). In some extreme cases (e.g. binomial with )
samples sizes far exceeding the typical guidelines (say, 30) are
needed for an adequate approximation
P

values
•
The P

value, which directly depends on a given sample, attempts to provide a
measure of the strength of the results of a test, in contrast to a simple reject or do not
reject. If the null hypothesis is true and the chance of random variation is the only
reason for sample differences, then the P

value is a quantitative measure to feed into
the decision making process as evidence. The following table provides a reasonable
interpretation of P

values:
•
P< 0.01 very strong evidence against H0; 0.01
≤
P < 0.05 moderate evidence against
H0; 0.05 ≤ P < 0.10 suggestive evidence against H0; 0.10 ≤ P little or no real
evidence against H0
•
This interpretation is widely accepted, and many scientific journals routinely publish
papers using this interpretation for the result of test of hypothesis.
•
For the fixed

sample size, when the number of realizations is decided in advance, the
distribution of p is uniform (assuming the null hypothesis). We would express this as
P(p ≤ x) = x. That means the criterion of p <0.05 achieves a of 0.05.
•
When a p

value is associated with a set of data, it is a measure of the probability that
the data could have arisen as a random sample from some population described by
the statistical (testing) model.
•
A p

value is a measure of how much evidence you have against the null hypothesis.
The smaller the p

value, the more evidence you have. One may combine the p

value
with the significance level to make decision on a given test of hypothesis. In such a
case, if the p

value is less than some threshold (usually .05, sometimes a bit larger
like 0.1 or a bit smaller like .01) then you reject the null hypothesis.
Accuracy, Precision, Robustness,
and Data Quality
•
Accuracy
is the degree of conformity of a measured/calculated quantity to its actual
(true) value.
•
Precision
is the degree to which further measurements or calculations will show the
same or similar results.
•
Robustness
is the resilience of the system, especially when under stress or when
confronted with invalid input.
•
Data are of high quality "if they are fit for their intended uses in
operations
,
decision
making
and
planning
.
•
An "accurate" estimate has small bias. A "precise" estimate has both small bias and
variance.
•
The robustness of a procedure is the extent to which its properties do not depend on
those assumptions which you do not wish to make.
•
Distinguish between bias robustness and efficiency robustness.
•
Example: Sample mean is seen as a robust estimator, it is because the CLT
guarantees a 0 bias for large samples regardless of the underlying distribution. This
estimator is bias robust, but it is clearly not efficiency robust as its variance can
increase endlessly. That variance can even be infinite if the underlying distribution is
Cauchy or Pareto with a large scale parameter.
Bias Reduction Techniques
•
The most effective tools for bias reduction is non

biased estimators are the
Bootstrap and the Jackknifing. The bootstrap uses resampling from a given
set of data to mimic the variability that produced the data in the first place,
has a rather more dependable theoretical basis and can be a highly
effective procedure for estimation of error quantities in statistical problems.
•
Bootstrap is to create a virtual population by duplicating the same sample
over and over, and then re

samples from the virtual population to form a
reference set. Then you compare your original sample with the reference
set to get the exact p

value. Very often, a certain structure is "assumed" so
that a residual is computed for each case. What is then re

sampled is from
the set of residuals, which are then added to those assumed structures,
before some statistic is evaluated. The purpose is often to estimate a P

level.
•
Jackknife is to re

compute the data by leaving on observation out each time.
Jackknifing does a bit of logical folding to provide estimators of coefficients
and error that will have reduced bias.
•
Bias reduction techniques have wide applications in anthropology,
chemistry, climatology, clinical trials, cybernetics, and ecology, etc.
Effect Size
Effect size (ES) permits the comparative effect of different
treatments to be compared, even when based on different samples
and different measuring instruments. The ES is the mean difference
between the control group and the treatment group.
•
Glass's method: Suppose an experimental treatment group has a
mean score of Xe and a control group has a mean score of Xc and a
standard deviation of Sc, then the effect size is equal to (Xe

Xc)/Sc.
•
Hunter and Schmidt (1990) suggested using a pooled within

group
standard deviation because it has less sampling error than the
control group standard deviation under the condition of equal
sample size. In addition, Hunter and Schmidt corrected the effect
size for measurement error by dividing the effect size by the square
root of the reliability coefficient of the dependent variable:
•
Cohen's ES: (mean1

mean2)/pooled SD
Nonparametric Technique
•
Parametric techniques are more useful the more one knows about your subject matter, since
knowledge about the data can be built into parametric models. Nonparametric methods, including
both senses of the term, distribution free tests and flexible functional forms, are more useful when
knowing less about the subject matter. One must use statistical technique called nonparametric if
it satisfies at least on of the following five types of criteria:
1. The data entering the analysis are enumerative

that is, count data representing the number of
observations in each category or cross

category.
2. The data are measured and /or analyzed using a nominal or ordinal scale of measurement.
3. The inference does not concern a parameter in the population distribution.
4. The probability distribution of the statistic upon which the analysis is based is very general, such as
continuous, discrete, or symmetric etc.
The Statistics are:
•
Mann

Whitney Rank Test as a nonparametric alternative to Students T

test when one does not
have normally distributed data.
•
Mann

Whitney: To be used with two independent groups (analogous to the independent groups t

test)
•
Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the
related samples t

test)
•
Kruskall

Wallis: To be used with two or more independent groups (analogous to the single

factor
between

subjects ANOVA)
•
Friedman: To be used with two or more related groups (analogous to the single

factor within

subjects ANOVA)
Least Squares Models
•
Many problems in analyzing data involve describing how variables
are related. The simplest of all models describing the relationship
between two variables is a linear, or straight

line, model. The
conventional method is that of least squares, which finds the line
minimizing the sum of distances between observed points and the
fitted line.
•
There is a simple connection between the numerical coefficients in
the regression equation and the slope and intercept of regression
line.
•
The summary statistic like a correlation coefficient or does not tell
the whole story. A scatter plot is an essential complement to
examining the relationship between the two variables.
•
Model checking is an essential part of the process of statistical
modeling. After all, conclusions based on models that do not
properly describe an observed set of data will be invalid.
•
The impact of violation of regression model assumptions (i.e.,
conditions) and possible solutions by analyzing the residuals.
Least Median of Squares Models
least absolute deviation (LAD)
•
The standard least squares techniques for
estimation in linear models are not robust in the
sense that outliers or contaminated data can
strongly influence estimates.
•
A robust technique, which protects against
contamination is least median of squares (LMS)
or least absolute deviation (LAD) .
•
An extension of LMS estimation to generalized
linear models, giving rise to the least median of
deviance (LMD) estimator.
Multivariate Data Analysis
•
Multivariate analysis is a branch of statistics involving the consideration of objects on
each of which are observed the values of a number of variables. Multivariate
techniques are used across the whole range of fields of statistical application. The
techniques are:
Principal components analysis
Factor analysis
Cluster analysis
Discriminant analysis
•
Principal component analysis used for exploring data to reduce the dimension.
Generally, PCA seeks to represent n correlated random variables by a reduced set of
uncorrelated variables, which are obtained by transformation of the original set onto
an appropriate subspace.
•
Two closely related techniques, principal component analysis and factor analysis, are
used to reduce the dimensionality of multivariate data. In these techniques
correlations and interactions among the variables are summarized in terms of a small
number of underlying factors. The methods rapidly identify key variables or groups of
variables that control the system under study.
•
Cluster analysis is an exploratory data analysis tool which aims at sorting different
objects into groups in a way that the degree of association between two objects is
maximal if they belong to the same group and minimal otherwise.
•
Discriminant function analysis used to classify cases into the values of a categorical
dependent, usually a dichotomy.
Regression Analysis
•
Models the relationship between one or more
response variables
(
Y
), and the
predictors
(
X
1,...,
Xp
). If there is more than one response variable, we speak of
multivariate regression
.
Types of regression
•
Simple and multiple linear regression
Simple linear regression
and
multiple linear regression
are related statistical methods for modeling
the relationship between two or more random variables using a
linear equation
. Linear regression
assumes the best estimate of the response is a
linear function
of some parameters (though not
necessarily linear on the predictors).
•
Nonlinear regression models
If the relationship between the variables being analyzed is not linear in parameters, a number of
nonlinear regression
techniques may be used to obtain a more accurate regression.
•
Other models
•
Although these three types are the most common, there also exist
Poisson regression
,
supervised
learning
, and
unit

weighted regression
.
•
Linear models
Predictor variables may be defined quantitatively or qualitatively(or
categorical
). Categorical
predictors are sometimes called
factors
. Although the method of estimating the model is the
same for each case, different situations are sometimes known by different names for historical
reasons:
–
If the predictors are all quantitative, we speak of
multiple regression
.
–
If the predictors are all qualitative, one performs
analysis of variance
.
–
If some predictors are quantitative and some qualitative, one performs an
analysis of covariance
.
General Linear Regression
•
The
general linear model
(GLM) is a statistical
linear model
. It may be written as
where
Y
is a matrix with series of multivariate measurements,
X
is a matrix that
might be a
design matrix
,
B
is a matrix containing parameters that are usually to be
estimated and
U
is a matrix containing residuals (i.e., errors or noise). The residual is
usually assumed to follow a
multivariate normal distribution
or other distribution,
such as a distribution in exponential family.
•
The general linear model incorporates a number of different statistical models:
ANOVA
,
ANCOVA
,
MANOVA
,
MANCOVA
, ordinary
linear regression
,
t

test
and
F

test
. If there is only one column in
Y
(i.e., one dependent variable) then the model
can also be referred to as the
multiple regression
model (multiple linear regression).
•
For example, if the response variable can take only binary values (for example, a
Boolean or Yes/No variable),
logistic regression
is preferred. The outcome of this
type of regression is a function which describes how the probability of a given event
(e.g. probability of getting "yes") varies with the predictors
•
Hypothesis tests with the general linear model can be made in two ways:
multivariate
and mass

univariate.
U
XB
Y
Semiparametric and Non

parametric modeling
•
The Generalized Linear Model (GLM)
Y= G(X
1
*b
1
+ ... + X
p
*b
p
) + e
where G is called the link function. All these models lead to the
problem of estimating a multivariate regression. Parametric
regression estimation has the disadvantage, that by the parametric
"form" certain properties of the resulting estimate are already implied.
•
Nonparametric techniques allow diagnostics of the data without this
restriction, and the model structure is not specified a priori. However,
this requires large sample sizes and causes problems in graphical
visualization.
•
Semiparametric methods are a compromise between both: they
support a nonparametric modeling of certain features and profit from
the simplicity of parametric methods. Example: Cox Proportional
Hazard Model.
Survival analysis
•
It deals with “death” in biological organisms and failure in mechanical systems. Death or failure is
called an "event" in the survival analysis literature, and so models of death or failure are
generically termed
time

to

event models
.
•
Survival data arise in a literal form from trials concerning life

threatening conditions, but the
methodology can also be applied to other waiting times such as the duration of pain relief.
•
Censoring
: Nearly every sample contains some cases that do not experience an event. If the
dependent variable is the time of the event, what do you do with these "censored" cases?
•
Survival analysis attempts to answer questions such as: what is the fraction of a population
which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can
multiple causes of death or failure be taken into account? How do particular circumstances or
characteristics increase or decrease the odds of survival?
•
Time

dependent covariate
: Many explanatory variables (like income or blood pressure)change
in value over time. How do you put such variables in a regression analysis?
•
Survival Analysis is a group of statistical methods for analysis and interpretation of survival data.
Survival and hazard functions, the methods of estimating parameters and testing hypotheses that
are the main part of analyses of survival data.
•
Main topics relevant to survival data analysis are: Survival and hazard functions, Types of
censoring, Estimation of survival and hazard functions: the Kaplan

Meier and life table estimators,
Simple life tables, Comparison of survival functions: The logrank and Mantel

Haenszel tests,
Wilcoxon test;The proportional hazards model: time independent and time dependent covariates,
Recurrent model, and Methods for determining sample sizes.
Repeated Measures and
Longitudinal Data
•
Repeated measures and longitudinal data require special attention because they
involve correlated data that commonly arise when the primary sampling units are
measured repeatedly over time or under different conditions.
•
The experimental units are often subjects. It is usually interested in between

subject
and within

subject effects. Between

subject effects are those whose values change
only from subject to subject and remain the same for all observations on a single
subject, for example, treatment and gender. Within

subject effects are those whose
values may differ from measurement to measurement.
•
Since measurements on the same experimental unit are likely to be correlated,
repeated measurements analysis must account for that correlation.
•
Normal theory models for split

plot experiments and repeated measures ANOVA can
be used to introduce the concept of correlated data.
•
PROC GLM, PROC GENMOD and PROC MIXED in the SAS system may be used.
Mixed linear models provide a general framework for modeling covariance structures, a
critical first step that influences parameter estimation and tests of hypotheses. The
primary objectives are to investigate trends over time and how they relate to treatment
groups or other covariates.
•
Techniques applicable to non

normal data, such as McNemar's test for binary data,
weighted least squares for categorical data, and generalized estimating equations
(GEE) are the main topics. The GEE method can be used to accommodate correlation
when the means at each time point are modeled using a generalized linear model.
•
Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models.
Biometrika 1986;73:13
–
22
Information Theory
•
Information theory is a branch probability and mathematical statistics that deal with communication systems,
data transmission, cryptography, signal to noise ratios, data compression, etc. Claude Shannon is the
father of information theory. His theory considered the transmission of information as a statistical
phenomenon and gave communications engineers a way to determine the capacity of a communication
channel about the common currency of bits Shannon defined a measure of entropy as:
•
H =

∑
p
i
log p
i
,
•
that, when applied to an information source, could determine the capacity of the channel required
to transmit the source as encoded binary digits. The
entropy
is a measure of the amount of
uncertainty
one has about which message will be chosen. It is defined as the
average
self

information of a message
i
from that message space.
•
Entropy as defined by Shannon is closely related to entropy as defined by physicists in statistical
thermodynamics. This work was the inspiration for adopting the term entropy in information theory. Other
useful measures of information include mutual information which is a measure of the correlation between
two event sets. Mutual information is defined for two events X and Y as:
M (X, Y) = H(X, Y)

H(X)

H(Y)
where H(X, Y) is the join entropy defined as:
H(X, Y) =

∑ p (x
i
, y
i
) log p (x
i
, y
i
),
•
Mutual information is closely related to the log

likelihood ratio test for multinomial distribution, and to
Pearson's Chi

square test. The field of Information Science has since expanded to cover the full range of
techniques and abstract descriptions for the storage, retrieval and transmittal of information.
•
Applications: Coding theory,
making
and
breaking
cryptographic systems, intelligent work, Bayesian
analysis, gabling, investing, etc.
Incomplete Data
•
Methods dealing with analysis of data with missing values can be
classified into:

Analysis of complete cases, including weighting
adjustments,

Imputation methods, and extensions to multiple imputation, and

Methods that analyze the incomplete data directly without requiring
a rectangular data set, such as maximum likelihood and Bayesian
methods.
•
Multiple imputation (MI) is a general paradigm for the analysis of
incomplete data. Each missing datum is replaced by m> 1 simulated
values, producing m simulated versions of the complete data. Each
version is analyzed by standard complete

data methods, and the
results are combined using simple rules to produce inferential
statements that incorporate missing data uncertainty. The focus is
on the practice of MI for real statistical problems in modern
computing environments.
Interactions
•
ANOVA programs generally produce all possible interactions, while regression
programs generally do not produce any interactions. So it's up to the user to construct
interaction terms to multiply together.
•
If the standard error term is high, it might be Multicolinearity. But it is not the only
factor that can cause large SE's for estimators of "slope" coefficients any regression
models. SE's are inversely proportional to the range of variability in the predictor
variable. To increase the precision of estimators, we should increase the range of the
input.
•
Another cause of large SE's is a small number of "event" observations or a small
number of "non

event" observations
•
There is also another cause of high standard errors; it's called serial correlation, when
using time

series.
•
When X and W are category systems. The interaction describes a two

way analysis
of variance (ANOV) model; when X and W are (quasi

)continuous variables, this
equation describes a multiple linear regression (MLR) model.
•
In ANOVA contexts, the existence of an interaction can be described as a difference
between differences.
•
In MLR contexts, an interaction implies a change in the slope (of the regression of Y
on X) from one value of W to another value of W.
Sufficient Statistic
•
A sufficient estimator based on a statistic contains all the information
which is present in the raw data. For example, the sum of your data
is sufficient to estimate the mean of the population. You do not have
to know the data set itself. This saves a lot ... Simply, send out the
total, and the sample size.
•
A
sufficient statistic
t for a parameter q is a function of the sample
data x1,...,xn, which contains all information in the sample about the
parameter q . More formally, sufficiency is defined in terms of the
likelihood function for q . For a sufficient statistic t, the Likelihood
L(x1,...,xn q ) can be written as g (t  q )*k(x1,...,xn). Since the
second term does not depend on q , t is said to be a sufficient
statistic for q .
•
To illustrate, let the observations be independent Bernoulli trials with
the same probability of success. Suppose that there are n trials, and
that person A observes which observations are successes, and
person B only finds out the number of successes. If seeing these
successes at random points without replication, B and A will see the
same ting.
Tests
•
Significance tests are based on assumptions: The data have to be
random, out of a well defined basic population and one has to
assume that some variables follow a certain distribution. Power of a
test is the probability of correctly rejecting a false null hypothesis. It
is one minus the probability of making a Type II error. The Type I
error: fail to reject the false hypothesis. Decrease the probability of
making a Type I error will increase the probability of making a Type
II error.
•
Power and the True Difference between Population Means:
The
distance between the two population means will affect the power of
our test.
•
Power as a Function of Sample Size and Variance:
Sample size
has an indirect effect on power because it affects the measure of
variance we used in the test. When n is large we will have a lower
standard error than when n is small.
•
Pilot Studies:
When the needed estimates for sample size
calculation is not available from existing database, a pilot study is
needed for adequate estimation with a given precision.
ANOVA: Analysis of Variance
•
Test the difference between 2 or more means. ANOVA does
this by examining the ratio of variability between two conditions
and variability within each condition.
•
Say we give a drug that we believe will improve memory to a
group of people and give a placebo to another group of people.
We might measure memory performance by the number of
words recalled from a list we ask everyone to memorize. An
ANOVA test would compare the variability that we observe
between the two conditions to the variability observed within
each condition. Recall that we measure variability as the sum
of the difference of each score from the mean.
•
When the variability that we predict (between the two groups)
is much greater than the variability we don't predict (within
each group) then we will conclude that our treatments produce
different results.
Data Mining and Knowledge
Discovery
•
It uses sophisticated statistical analysis and modeling techniques to uncover patterns
and relationships hidden in organizational databases.
•
Aim at tools and techniques to process structured information from databases to data
warehouses to data mining, and to knowledge discovery. Data warehouse
applications have become business

critical.
•
It can compress even more value out of these huge repositories of information. The
continuing rapid growth of on

line data and the widespread use of databases
necessitate the development of techniques for extracting useful knowledge and for
facilitating database access.
•
The challenge of extracting knowledge from data is of common interest to several
fields, including statistics, databases, pattern recognition, machine learning, data
visualization, optimization, and high

performance computing.
•
The data mining process involves identifying an appropriate data set to "mine" or sift
through to discover data content relationships. Data mining tools include techniques
like case

based reasoning, cluster analysis, data visualization, fuzzy query and
analysis, and neural networks. Data mining sometimes resembles the traditional
scientific method of identifying a hypothesis and then testing it using an appropriate
data set.
•
It is reminiscent of what happens when data has been collected and no significant
results were found and hence an ad hoc, exploratory analysis is conducted to find a
significant relationship.
•
Data mining is the process of extracting knowledge from data. For clever marketers, that
Comments 0
Log in to post a comment