Statistics for the Physical Sciences
STAT 229
Chapter 1
Statistics: The Art and Science of
Learning from Data
Fall 2008 STAT 229
2
Homework 1
•
Problems 1.1 to 1.36 (even numbered)
•
Complete the survey on page 22

23
Fall 2008 STAT 229
3
1.1 Overview
•
Statistics
is the art and science of learning
from data. It is a collection of methods for
–
Planning experiments (
Design
)
–
Obtaining data (data are collected observations,
such as measurements and survey responses)
–
Organizing data
–
Summarizing data (
Description
)
–
Analyzing data
–
Interpreting results, and
–
Making decisions and predictions (
Inference
)
•
Statistics is a branch of Mathematics

>
Fall 2008 STAT 229
4
•
Statistics is invented for studying
Randomness

a lack of order, purpose,
cause, or predictability (by Wiki)

without
which the world will be of no interest.
•
Examples of random phenomena:
–
Phelps won 8 gold medals
–
A 6

sided die is flipped and landed a 4
–
It’s going to rain tomorrow
•
Randomness, Fuzziness and Uncertainty
•
Randomness creates uncertainty. On the
other hand, randomness can be used.
When
estimating the proportion of adults in USA who smoked, we can survey
1000 adults and use the survey responses as our data. How randomness
is used? Why use it?
Fall 2008 STAT 229
5
1.2 We Learn about Population
Using Samples
•
In the previous example, all US adults
form a
population
while the 1000 surveyed
adults form a
sample
.
•
In general, a
population
is the complete
collection of all items to be studied. These
items can be human subjects, animals,
machines, even scores.
•
A
sample
is a sub

collection of items
selected from a population.
Fall 2008 STAT 229
6
More about Samples
•
A sample should
represent
the underlying
population. Therefore, sample data must be
collected in an appropriate way, such as through
a process of random selection.
•
How large should a sample be?
•
What are those appropriate ways to generate a
sample?
•
Methods for summarizing sample data are
referred to as
descriptive statistics
, while
methods for making decisions or predictions
about a population based on sample data are
called
inferential statistics
.
Fall 2008 STAT 229
7
Parameter and Statistic
•
A parameter is a numeric summary of the
population
•
A statistic is a numeric summary of a
sample taken from the population
•
Problem:
Number of Good Friends
One year the General Social Survey
asked, “About how many good friends do
you have?” Of the 819 people who
responded, 6% reported having only one
good friend. Identify
(a) the sample
(b) the population, and
(c) the parameter or statistic
•
Try Problem 1.3 on page 8 of the textbook.
Go to the General Social Survey website
http://sda.berkeley.edu/GSS
By entering HEAVEN as the “row variable”
name, find the percentages of people who
said “yes, definitely,” “yes, probably,” “no,
probably not,” and “no, definitely not” when
asked whether they believed in heaven.
Fall 2008 STAT 229
10
1.3 What Role Do Computers Play
in Statistics?
•
Save (large) data files
•
Create databases
•
Do analysis with software: SAS, Minitab,
Spss, R, Splus, C, Matlab, Excel, ...
•
Simulation
–
use of computers to mimic
reality.
Fall 2008 STAT 229
11
Simulation of Coin Tossing
in
Microsoft
Excel
NOTES:
1. Pseudo

random numbers are numbers
generated by a computer algorithm to
simulate real random numbers.
2. Excel has an Analysis ToolPak by which
one can do statistical analysis, including
simulation.
Tasks:
When a balanced coin is tossed 20 times, we have a
sequence of 20 Heads or Tails. Let 1 denote Heads
and 0 denote Tails. Then a sample is a sequence of
1 or 0. The empirical probability or sample proportion
of tossing Heads(1) is computed as the number of
1’s divided by the total number of tosses. The coin

tossing process can be simulated using Bernoulli
distribution with proportion p = 0.5.
1. Simulate 5 random samples, each consisting of
10 pseudo

random numbers from a Bernoulli(0.5)
distribution. Repeat the process using 1000 pseudo

random numbers.
2. Compute the sample proportion for each of the 10
samples.
Simulation
Follow this:
Tools
Data Analysis
Random Number Generation
Bernoulli
More questions:
1.
Where does randomness play a role?
2.
Is the amount of variability from sample to
sample of size 10 bigger than the amount of
variability from sample to sample of size
1000?
3.
Comment on the effect of sample size.
If You Are Using Excel 2007…
•
Excel 2007
no longer
have
tools
menu.
•
To use
Analysis ToolPak
, go to
office
button
at the upper left corner, click
Excel
options
, then click
Add

ins
and highlight
Analysis ToolPak
. Clicking
go
button to
open the
Add

ins
window. Check the box
Analysis ToolPak
and click OK.
•
Now go to Data menu, click
Data Analysis
and choose
Random Number Generation
.
Fall 2008 STAT 229
14
Statistics for the Physical Sciences
STAT 229
Chapter 2
Exploring Data with Graphs and
Numerical Summaries
Homework #2
•
2

1 (p29): Problems 2.2, 2.4, 2.6, 2.8
•
2

2 (p44): Problems 2.10, 2.12, 2.14, 2.16, 2.22
•
2

3 (p55): Problems 2.30, 2.32, 2.34, 2.36, 2.42, 2.44
•
2

4 (p64): Problems 2.48, 2.52, 2.56, 2.58, 2.60
•
2

5 (p73): Problems 2.64, 2.66, 2.68, 2.72, 2.74, 2.78, 2.80, 2.82
•
2

6 (p80): Problems 2.84
2.1 What Are the Types of Data?
•
A characteristic observed for the subjects in a study
is called a
variable
.
•
Examples of variable: major, GPA, religious
affiliation, smoking status,...
•
Variables can be
quantitative
(numerical) or
qualitative (categorical).
•
A variable is quantitative if its numerical values
represent different magnitudes of the variable, such
as weight, GPA. A variable is categorical, if its
value represents a category, such as major, letter
garde.
•
Quantitative variables can be
discrete
or
continuous
.
•
A discrete variable is usually a count such
as the number of car accident last year,
while a continuous variable is a
measurement, such as distance.
•
The reason we care whether a variable is
quantitative, categorical, discrete, or
continuous is that the
method used to
analyze a data set depends on the type of
variable
the data represent.
Key Features of a Variable
•
A quantitative variable usually takes different
values in a study. Studying the
spread
(variability)
of such a variable is one of the
most important tasks in statistics. Another
feature of a quantitative variable is the
center
of all its possible values.
•
For a categorical variable, a key feature to
describe is the
relative number of items
(percentage) in the various categories.
Frequency Tables
•
For a categorical variable, counting how
often each possible value is taken by the
variable is a critical first step in descriptive
statistics. The results are summarized in a
frequency table
.
•
The following table shows the frequency of
shark attacks in various regions for 1990

2006.
Region
Frequency
Proportion
Percentage
Florida
365
0.785
78.5
Hawaii
60
0.129
12.9
California
40
0.086
8.6
Total
465
1.000
100
Frequency of shark attacks in
various regions for 1990

2006
Questions: What is the variable? Is it categorical?
The
mode
of categorical data is the category with the highest
frequency. Find the mode of the data.
Frequency Tables (cont’d)
•
In the table above, the proportions and percentages
are also called
relative frequencies
. A table like this
is called a
frequency table
.
•
A
frequency table
is a listing of possible values for a
variable, together with the number of observations for
each value.
•
For a quantitative variable, A frequency table is
constructed by first categorizing the data into a set of
adjacent intervals, then finding the frequencies for
each interval.
No. Hours
Frequency
Percent
0

1
232
25.6
2

3
403
44.5
4

5
181
20.0
6

7
45
5.0
8 or more
44
4.9
Total
905
100.0
Frequency Table for Daily TV Watching
•
Example
Construct a frequency table for quiz scores for twenty
students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8,
6, 6
Score Frequency Proportion Percentage
[0,2]
(2,4]
(4,6]
(6,8]
(8,10]
1
1
7
8
3
Total 20
0.05
0.05
0.35
0.40
0.15
5
5
35
40
15
1.00 100
2.2 How Can We Describe Data Using
Graphical Summaries?
Group
Seats
Percent (%)
EUL
39
5.3
PES
200
27.3
EFA
42
5.7
EDD
15
2
ELDR
67
9.2
EPP
276
37.7
UEN
27
3.7
Other
66
9
Total
732
100.0
Preliminary results of the election for the European Parliament in 2004
Pie Charts and Bar Graphs
for
Categorical Variables
•
Pie chart:
A circle having a “slice of a
pie” for each category. The size of slice
corresponds to the
percentage
of
observations in the category.
•
Bar graph:
Displays a vertical bar for
each category. The height of the bar is the
percentage
of observations in the category.
Seats
5%
27%
6%
2%
9%
38%
4%
9%
EUL
PES
EFA
EDD
ELDR
EPP
UEN
Other
Example
: Use the shark attack data from
this
source link
to construct a pie chart
of interest.
Bar Graph for
European Parliament in 2004
0
50
100
150
200
250
300
EUL
PES
EFA
EDD
ELDR
EPP
UEN
Other
Seats
Pareto Chart
0
5
10
15
20
25
30
35
40
EPP
PES
ELDR
Other
EFA
EUL
UEN
EDD
Group
Percentage
Pareto Chart: Bar Graph with categories
Ordered by Their Frequency from the
Tallest Bar to Shortest
Graphs for Quantitative Variables
•
Dot plots:
Shows a dot for each observation, placed
just above the value on the number line for that
observation.
•
Stem

and

Leaf Plots:
similar to dot plot. Each
observation is represented by a stem and a leaf.
•
Histogram:
a graph uses bars to portray the
frequencies or relative refrequencies.
Example
Dot plot
Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9,
6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6
1 2 3 4 5 6 7 8 9 10
Graphs for Quantitative Variables
Example
Stem

and

Leaf Plot
Stem Leaves
4
5
6
7
8
9
10
5
2
4 5 6 6
0 4 7 7
6
0
Test scores for 12 students: 80, 45, 100, 76, 84, 87, 96, 62, 75,74, 87, 76
Step 1: Sorted test scores: 45, 62, 74, 75, 76, 76, 80, 84, 87, 87, 96, 100
Step 2: Place the scores in the corresponding stems and leaves.
(usually the last digit will be the leaf)
Graphs for Quantitative Variables
Histogram
Step 1: Divide the range of data into
intervals of equal width.
Step 2: Count the frequency and construct a
frequency table (or relative frequency table).
Step 3: Label the endpoints of the intervals on x

axis.
Draw a bar over each interval with height equal
to its frequency (or relative frequency), values
of which are marked on the y

axis.
Graphs for Quantitative Variables
Example
Histogram
Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9,
6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6
Score Freq
[0,2) 1
[2,4) 1
[4,6) 7
[6,8) 8
[8,10) 3
[10,12) 0
0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
Score
Frequency
Graphs for Quantitative Variables
The Shape of a Distribution
•
When looking at a graph of quantitative
data (dot plot, stem

and

leaf plot, and
histogram), look for
–
the
overall pattern
: Do the data cluster
together?
–
the
outliers
–
modes
: unimodal, bimodal,…
–
skew
: skewed to the left or right
–
the underlying
smooth
curve
Unimodal Bimodal Multimodal
outliers
Histogram of x
x
Frequency
10
5
0
5
10
0
5
10
15
20
25
30
35
Histogram of y
y
Frequency
10
5
0
5
10
0
5
10
15
20
25
30
These Two Histograms Show Differences in Spread
Time plots
•
Time series: a data set collect over time.
•
Time plot: a graph displaying time

series
data.
•
Look for pattern over time.
Time plots: Example
Gasoline price
2.3 How can we describe the
center of quantitative data?
•
Measures of center: mean and median
–
Mean: the sum of the observations divided by
the number of observations.
–
Median: The midpoint of the observations.
Mean Formula
1 2
1
n
i
x x x
x x
n n
Example
Travel times to work
How long does it take to get from home to work?
Here are the travel times in minutes in North
Carolina, chosen at random by Census Bureau:
30
20 10 40 25 20 10 60 15 40 5 30 12 10 10
Find the mean travel time.
1 2
n
x x x
x
n
30 20 10 337
22.5 minutes
15 15
How to Determine the Median
Step 1: Sort your data from the smallest
to the largest.
Step 2: If n, the number of data points is
odd
, the median is the middle
value; if n is
even
, the median is
the average of the middle two values.
Example
Find median for the travel times
30 20 10 40 25 20 10 60 15 40 5 30 12 10 10
Arrange the data in order:
5 10 10 10 10 12 15
20
20 25 30 30 40 40 60
Since n = 15 is odd, Median = 20, the middle value.
Example
Find the median for the scores
60 80 87 73 95 92
Arrange the data in order: 60 73
80 87
92 95
Since n = 6 is even, Median = (
80 + 87
)/2 = 83.5, the
average of the two middle values.
Properties of the mean and the
median
•
The mean is the balance point of the data.
•
In a symmetric distribution, the mean and
median are the same.
•
In a skewed distribution, the mean is usually
farther out in the long tail than the median.
–
Skewed to the right, mean > median
–
Skewed to the left, mean < median
•
Mean is less resistant to outliers.
Mean, Median, and Mode
•
The mean is the balance
point.
•
The median is the midpoint.
•
The mode is the value
occurs most frequently.
Mean and Median: Applications
•
City data
–
St Cloud, MN
–
New Orleans, LA
2.4 How can we describe the spread of
quantitative data?
•
Measures of spread:
–
The Range
–
The Stand Deviation
–
The Interquartile Range (Sec2.5)
Measuring spread: The Range
•
Range = largest value

smallest value
•
Example
: Find the range of the quiz scores :
2, 5, 0, 7, 9, 1, 7, 6, 10, 9, 3, 9, 9, 7, 0, 6, 9,
10, 8,1, 4, 6, 8, 9, 4, 2, 9, 0, 5, 7
Range = largest value

smallest value
= 10

0
= 10
•
Simple to compute
•
Easy to understand
But
•
Uses only extreme values
•
Affected severely by outliers
The Range
Measuring
Spread
:
Variance and Standard Deviation
•
The
standard deviation
and
variance
measure
spread by looking how far the observations are
from their mean.
•
The
variance
of a set of observations is an
average of the squares of deviation from the mean.
2
s
2 2 2
2
1 2
2
( ) ( ) ( )
1
1
( )
1
n
i
x x x x x x
s
n
x x
n
•
The
standard deviation
s
is the square root of
the variance
2
( )
1
i
x x
s
n
•
Example
(Calculating the standard deviation
s
)
Metabolic rates of 7 men who took part in a study of
dieting. The units are calories per 24 hours.
1792 1666 1362 1614 1460 1867 1439
Find the mean first:
x
1792 1439 112000
1600 calories
7 7
The standard deviation: Example
1792
192
36864
1666
66
4356
1362

238
56644
1614
14
196
1460

140
19600
1867
267
71289
1439

161
25921
Observations Deviations Squared deviations
i
x
i
x x
2
( )
i
x x
sum = 0 sum = 214870
The variance
The standard deviation
2
214870
35811.67
6
s
35811.67 189.24 calories
s
Cont’d
Properties of the
Standard Deviation
•
The greater the spread, the larger the
s
.
•
s
≥ 0.
•
s
= 0 when all the observations take the
same value.
•
s
can be influenced by outliers.
Interpreting the Magnitude of s:
The Empirical Rule
If a distribution of data is bell shaped, then
approximately:
68%
of the observations fall within 1 stand deviation
of the mean, that is between

s and + s.
95%
of the observations fall within 2 stand deviations
of the mean, that is between

2s and + 2s.
99.7%
of the observations fall within 3 stand
deviations of the mean, that is between

3s and
+ 3s.
x
x
x
x
x
x
Sample Statistics and
Population Parameters
•
Population: The collection of all individuals or
items under consideration.
•
Sample: That part of the population from
which we actually collect information.
•
We use a sample to draw conclusion about
the entire population.
Sample Statistics and
Population Parameters
•
Parameter: Numerical summary of the
population.
•
Statistic: Numerical summary of a sample.
•
Notations:
Population Mean
Population Standard Deviation
Sample Mean
Sample Standard Deviation
s
x
2.5 How Can Measures of Position
Describe Spread?
•
Measure of positions:
–
Quartiles
–
Percentiles.
•
Percentiles:
–
p
th percentile: a value such that
p
percent of
observations fall below or at that value.
•
Quartiles
–
First quartile
, the same as 25
th
percentile (p=25)
–
Second quartile
, the same as 50
th
percentile (p=50)
–
Third quartile
, the same as 75
th
percentile (p=75)
•
To calculate the quartiles:
1. Arrange the observations in increasing order.
2. The second quartile is the median M.
( = 50th percentile)
3. The first quartile is the median of the
observations whose position in the ordered list is to
the left location of the overall median. ( = 25th
percentile)
4. The third quartile is the median of the
observations whose position in the ordered list is to
the right location of the overall median. ( = 75th
percentile)
1
Q
3
Q
Calculating Quartiles
1
Q
3
Q
2
Q
2
Q
•
Example 2.17 Travel times to work Find and .
Arrange the data in order:
5 10 10 10 10 12 15
20
20 25 30 30 40 40 60
the left location of the overall median
20
is:
5 10 10 10 10 12 15
= 10
the right location of the overall median
20
is:
20 25 30 30 40 40 60
= 30
1
Q
3
Q
1
Q
3
Q
Quartiles: Example
•
Example 2.5 Travel times to work Find and .
Travel times in minutes of 20 randomly chosen New York workers: 10 30 5 25 40
20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
Arrange the data in order:
5 10 10 15 15 15 15 20 20
20 25
30 30 40 40 45 60 60 65 85
The overall median = 22.5 minutes
the left location of the overall median is: 5 10 10 15
15 15
15 20 20 20
= 15 minutes
the right location of the overall median is:25 30 30 40
40 45
60 60 65 85
= 42.5 minutes
1
Q
3
Q
1
Q
3
Q
Quartiles: Example
Another Measure of Spread:
The
Interquartile
Range
•
The Interquartile Range (IQR)
The Interquartile Range =

•
Example
(Travel times to work) Find IQR.
5 10 10
10
10 12 15
20
20 25 30
30
40 40 60
3
Q
1
Q
Detecting Potential Outliers:
The 1.5*IQR Criterion
•
The
1.5*IQR Criterion
for Identifying
Potential Outliers.
An observation is a
potential outlier
if it
falls
more than
1.5*IQR
below
the first
quartile or
more than
1.5*IQR
above
the
third quartile.
Detecting Potential Outliers:
Example
•
Example 2.18 Travel times to work (in
minutes). Detecting Potential Outliers.
5 10 10 10 10 12 15
20 20 25 30 30 40 40
80
The five

number summary and The
BoxPlot
•
The five

number summary of a distribution
consists of the smallest observation, the first
quartile, the median, the third quartile, and
the largest observation.
Minimum
Median
Maximum
•
Example 2.19 The five

number summary of
travel times to work.
5 10 10 10 10 12 15 20 20 25 30 30 40 40
80
1
Q
3
Q
The Box Plot
Constructing a box plot
•
A box goes from the and .
•
A line drawn inside the box at the median.
•
A line goes from the lower end of the box to
the smallest
observation
that is not an
potential outlier. A separate line goes from the
upper end of the box to the largest
observation
that is not an potential outlier.
These lines are called whiskers.
•
The potential outliers are shown separately.
1
Q
3
Q
Example
(
Constructing a boxplot
)
Travel times to work.
5 10 10
10
10 12 15
20
20
25 30
30
40 40 80
Steps:
1.
Find Q1, Q2, and Q3:
2.
Find IQR:
3.
Determine two fences:
lower fence = Q1
–
1.5*IQR
upper fence = Q3 + 1.5*IQR
4.
Identify potentialutliers
5.
Determine whiskers:
one from Q1 to the
smallest observation
within fences, and the
other from Q3 to the
largest within fences.
6.
Draw the boxplot.
20
40
60
80
Q1 = 10
Q2 = 20
Q3 = 30
smallest in fences
Largest in fences
outlier
(Text Page 67)
Sodium values for 20
breakfast cereals:
0 70 125 125
140 150
170 170
180
200 200
210
210 220
220 230
250 260 290 290
R codes
:
x=c(0,70,125,125,140,150,170,170,
180,200,200,210,210,220,220
,230,250,260,290,290)
boxplot(x, col=3, horizontal = T)
Example
(Boxplot)
0
50
100
150
200
250
300
Interpretation of Boxplots
•
IQR measures the sample
variability
(or
spread).
•
A box plot indicates
skew
. The side with
the
larger part of the box
and the
longer
whisker
usually has skew in that direction.
Interpretation of Box Plots
Interpretation of Box Plots
In terms of symmetry, median, spread, …
Side

by

Side Box Plots
•
Help to compare groups (
in terms of
symmetry, median, spread,…).
•
Example: (College student heights) Click
here
to see the “Heights” data on the text
CD.
R
codes (copy and
paste to R):
heights=read.table("heights.csv”
, sep=',', header=T)
boxplot(HEIGHT~GENDER,
data=heights, col = 3:4)
0
1
55
60
65
70
75
80
85
90
Box plots comparing heights
The z

Score
•
Z

score for an observation is the number of
standard deviation that it falls from the mean and
in which direction.
•
An observation in a bell

shaped distribution is
regarded as a
potential outlier
if it falls more
than three standard deviation from the mean;
that is, z > 3 or z <

3.
(Recall the empirical rule, 99.7% of
values are within 3 standard deviations of the mean.)
observation mean
standard deviation
z
The z

Score: Example
Example 2.20
The height of a group of young women has
mean x=64 inches
an
d standard deviation =2.7 inches.
The histogram of height is bellshaped.
 A woman 70 in
s
70 64
ches tall has standized height = 2
.22
2.7
or 2.22 standard deviations above th
e mean.
60 64
 A woman 60 inches tall has standized
height = 1.48
2.7
or 1.48 standard deviations
z
z
less than the mean height.
73 64
 A woman 73 inches tall has standized
height = 3.33
2.7
or 3.33 standard deviations above the
mean height,
and that woman's height is an outlier
.
z
2.6 How Can Graphical
Summaries Be Misused?
•
Self reading
Statistics for the Physical Sciences
STAT 229
Chapter 3
Association: Contingency,
Correlation, regression
Homework #3
•
3

1: Problems 3.2, 3.4, 3.6, 3.8, 3.10
•
3

2: Problems 3.12, 3.14, 3.16, 3.18, 3.22
•
3

3: Problems 3.26, 3.30, 3.36, 3.38, 3.40
•
3

4: Problems 3.48, 3.50, 3.52, 3.54, 3.56, 3.58, 3.60
Response Variables and
Explanatory Variables
•
In this chapter, we discuss statistical methods for
data on two variables
.
•
Some times, one of the two variables may be
termed the response variable and the other
explanatory variable.
•
The
response variable
is the outcome variable on
which comparisons are made.
•
The
explanatory variable
defines the group to be
compared with respect to values on the response
variable.
Is Smoking Actually Beneficial to Your Health?
•
This is Example 1 on page 93 of text. 1314 women were asked whether
they were smokers. They were followed over a period of 20 years.
Smoker
Survival Status
Total
Dead
Alive
Yes
139
443
582
No
230
502
732
Total
369
945
1314
It’s natural to treat the variable “Survival Status” as a
response variable and “Smoker” as an explanatory variable.
Associations
•
The main purpose of a data analysis with two
variables is to investigate whether there is an
association and to describe the nature of that
association.
•
An association exits between two variables if
a particular value for one variable is more
likely to occur with certain values of the other
variable.
•
Is the variable “Survival Status” associated
with the variable “Smoker”? Does smoking
lead to cancer?
Other Examples of Association
–
Smoking and BMI
–
Smoking and lung cancer
–
Irrigation and plant growth
–
Traffic and air pollution
–
Gender and height
3.1 Explore the Association
between Two Categorical Variables
•
A contingency table is used to explore the
association between two categorical variables:
–
Rows
list the categories of one variable.
–
Columns
list the categories of the other variable.
–
Each
cell
in the table holds the number of observations
(frequency) in the sample with certain outcomes on the
two variables.
•
Cross

tabulation: The process of finding the
frequencies for the cells of a contingency table.
•
The previous table is an example of a contingency
table.
Construct Contingency Tables
From Raw Data
•
Excel
Data
: Two Variables
–
Cancer Treatment
: treatments given to the
cancer patients (Surgery and Radiation
therapy).
–
Cancer Controlled
: whether cancer has been
controlled (Yes and No).
Contingency table (Example)
Treatment
Cancer Controlled
Total
Yes
No
Surgery
21
2
23
Radiation
Therapy
15
3
18
Total
36
5
41
Questions
: (1) What proportion of the patients who had surgery
had their cancer controlled?
(2) What proportion of all cancer patients had their cancer
controlled?
(1) 21 / 23 = 91% of the patients who had surgery
had their cancer controlled.
(2) 36 / 41 = 88% of all cancer patients had cancer
controlled.
Answer
Conditional Proportions
A conditional proportion
is the proportion
of one variable at a given level of the
other variable.
Marginal proportion
A
marginal proportion
is the proportion of a
row or column variable.
Side

by

side bars
•
Display conditional proportions.
•
Useful for making comparisons.
Side

by

side bars: Example
Cancer condition for two cancer treatments
0
0.2
0.4
0.6
0.8
1
Yes
No
Cancer controlled
Proportion
Surgery
Radiation Therapy
•
The proportion of patients who had their cancer
controlled is slightly higher for the patients who had
surgery than for those who had radiation therapy.
Is There an Association?
Food type
Pesticide Present
Pesticide
Not Present
Organic
29 (0.23)
98 (0.77)
Conventional
19485 (0.73)
7086 (0.27)
Pesticide Status for Organic vs. Conventional Food
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Pesticide Present
Pesticide Not Present
Organic
Conventional
Examples
•
Ex 3.8 page 101
•
Ex 3.3 page 100
3.2 How Can We Explore the Association
between Two Quantitative Variables
•
An association can be studied between
–
two categorical variables
–
two quantitative variables
–
a categorical variable and a quantitative variable.
•
In this section, we explore the association between
two quantitative variables.
•
That is, we will study how a response variable tends
to change as the value of an explanatory variable
changes.
Scatterplots
•
A
scatterplot
is a graphical display of
relationship between two quantitative variables.
It portrays two variables simultaneously
–
horizontal axis
: the explanatory variable
–
vertical axis
: response variable.
–
point
in the display: observation corresponding to a
subject.
2
4
6
8
5
10
15
20
x
y
Example: Worldwide Use of Internet
•
Click
to see the data (text, page 103).
•
Data dictionary

GDP: Gross domestic product, per capita, in thousands of US dollars
CO
2
: Carbon dioxide emissions, per capita, in tons
Cellular: Percentage of adults who are cellular

phone subscribers
Fertility: Mean number of children per adult woman
•
Question to explore
(1) Describe the center and spread of the data distribution.
(2) Portray the relationship with a scatterplot for Internet use
and GDP
(3) What do you learn about the association by inspecting
the scatterplot?
0
10
20
30
40
50
INTERNET
0
5
10
15
20
25
30
35
GDP
Mean: 21.14
Standard deviation: 18.47
Mean: 16.00
Standard deviation: 10.60
0
5
10
15
20
25
30
35
0
10
20
30
40
50
GDP
INTERNET
Interpreting Scatterplots
•
You can describe the overall pattern of a
scatterplot by the trend, direction, and
strength of the relationship between the
two variables
–
Trend
: linear, curved, clusters, no pattern
–
Direction
: positive, negative, no direction
–
Strength
: how closely the points fit the trend
•
Also look for outliers from the overall trend
Positive Association
•
Two quantitative variables x and y are
–
Positively associated
when
•
High values of x tend to occur with high values of y
•
Low values of x tend to occur with low values of y
–
Negatively associated
when high values of one
variable tend to pair with low values of the other
variable
Would you expect a positive association, a
negative association or no association between
the age of the car and the mileage on the
odometer?
a)
Positive association
b)
Negative association
c)
No association
Moving Graphics
http://www.gapminder.org
Linear Correlation, r
•
Measures the strength and direction of the
linear association between x and y
–
A positive r value indicates a positive association
–
A negative r value indicates a negative association
–
An r value close to +1 or

1 indicates a strong linear
association
–
An r value close to 0 indicates a weak association
)
)(
(
1
1
y
x
s
y
y
s
x
x
n
r
Properties of Correlation
•
Always falls between

1 and +1
•
Sign of correlation denotes direction
–
(

) indicates negative linear association
–
(+) indicates positive linear association
•
Correlation has a unitless measure

does not
depend on the variables’ units
•
Two variables have the same correlation no
matter which is treated as the response variable
•
Correlation is sensitive to outliers
•
Correlation only measures strength of
linear
relationship
Calculating the Correlation Coefficient
Country
Per Capita GDP (
x
)
Life Expectancy (
y
)
Austria
21.4
77.48
Belgium
23.2
77.53
Finland
20.0
77.32
France
22.7
78.63
Germany
20.8
77.17
Ireland
18.6
76.39
Italy
21.5
78.51
Netherlands
22.0
78.15
Switzerland
23.8
78.99
United Kingdom
21.2
77.37
Per Capita Gross Domestic Product and Average Life
Expectancy for Countries in Western Europe
Calculating the Correlation Coefficient
0.809
(7.285)
1
10
1
n
1
i
y
i
x
i
s
y
y
s
x
x
1

n
1
r
x
y
21.4
77.48

0.078

0.345
0.027
23.2
77.53
1.097

0.282

0.309
20.0
77.32

0.992

0.546
0.542
22.7
78.63
0.770
1.102
0.849
20.8
77.17

0.470

0.735
0.345
18.6
76.39

1.906

1.716
3.271
21.5
78.51

0.013
0.951

0.012
22.0
78.15
0.313
0.498
0.156
23.8
78.99
1.489
1.555
2.315
21.2
77.37

0.209

0.483
0.101
= 21.52
= 77.754
sum = 7.285
s
x
=1.532
s
y
=0.795
y
i
/s
y
y
x
i
/s
x
x
x
y
y
i
x
i
s
y

y
s
x

x
Called Z

Scores
Divide a Scatterplot into Quadrants
0
5
10
15
20
25
30
35
0
10
20
30
40
50
GDP
INTERNET
x =16
y = 21.1
I
II
III
IV
In quadrant I, both z

scores positive;
In quadrant II, z

scores
of Internet are positive,
while z

scores of GDP
are negative;
In quadrant III, both z

scores negative;
In quadrant IV, z

scores
of GDP are positive,
while z

scores of
INTERNET are negative;
3.3 How Can We Predict the
Outcome of a Variable?
•
When a scatterplot indicates a relationship
between two variables, we can start fitting
a curve to the data.
•
The procedure of fitting a curve to the
data, along with inferences about
parameters of interest and prediction of
the response value, is called regression
analysis.
Regression Analysis
•
The first step of a
regression analysis
is
to identify the response and explanatory
variables
–
We use
y
to denote the
response variable
–
We use
x
to denote the
explanatory
variable
Regression Line
•
A regression line is a straight line that describes how
the response variable (y) changes as the explanatory
variable (x) changes
•
A regression line predicts the value of the response
variable (y) for a given level of the explanatory
variable (x)
•
The y

intercept of the regression line is denoted by
a
•
The slope of the regression line is denoted by
b
Example: How Can Anthropologists
Predict Height Using Human Remains?
•
Regression Equation:
•
is the predicted height and is the length
of a femur (thighbone), measured in
centimeters
x
y
4
.
2
4
.
61
ˆ
y
ˆ
x
•
Use the regression equation to predict
the height of a person whose femur
length was 50 centimeters
ˆ
y
61
.
4
2
.
4
(
50
)
181
.
4
Interpreting the y

Intercept
•
y

Intercept:
–
The predicted value for y when x = 0
–
Helps in plotting the line
–
May not have any interpretative value if no
observations had x values near 0
Interpreting the Slope
•
Slope: measures the change in the
predicted variable (y) for a 1 unit increase
in the explanatory variable in (x)
•
Example: A 1 cm increase in femur length
results in a 2.4 cm increase in predicted
height
Slope Values:
Positive, Negative, Equal to 0
Regression Line
•
At a given value of x, the equation:
–
Predicts a single value of the response variable
–
But… we should not expect all subjects at that value of
x to have the same value of y
•
Variability occurs in the y values!
bx
a
y
ˆ
Residuals
•
Measures the size of the prediction errors, the
vertical distance between the point and the
regression line
•
Each observation has a residual
•
Calculation for each residual:
•
A large residual indicates an unusual
observation
ˆ
y y
“Least Squares Method” Yields the
Regression Line
•
Residual sum of squares:
•
The least squares regression line is the line that
minimizes the vertical distance between the points
and their predictions, i.e., it minimizes the residual
sum of squares
•
Note: the sum of the residuals about the regression
line will always be zero
2 2
ˆ
( ) ( )
residuals y y
Regression Formulas for y

Intercept and
Slope
•
Slope:
•
Y

Intercept:
( )
y
x
s
b r
s
( )
a y b x
y
x
,
Regression line always passes through
Calculating the slope and y intercept for the
regression line
:
.275
4.979
0.0091
Given
Fi
0.368
nd and
3
.
0.65
x
y
x
y
s
s
a b
r
.275
4.979
0.0091
0.368
0.653
0.368
0.653 26.4
0.0091
x
y
y
x
s
b r
x
y
s
s
s
r
4.979 26.4 0.275 2.28
a y b x
y intercept=

2.28
Slope =26.4
0.653
r
Find a and b.
Internet Usage and
Gross National Product (GDP)
INTERNET
GDP
INTERNET
GDP
Algeria
0.65
6.09
Japan
38.42
25.13
Argentina
10.08
11.32
Malaysia
27.31
8.75
Australia
37.14
25.37
Mexico
3.62
8.43
Austria
38.7
26.73
Netherlands
49.05
27.19
Belgium
31.04
25.52
New Zealand
46.12
19.16
Brazil
4.66
7.36
Nigeria
0.1
0.85
Canada
46.66
27.13
Norway
46.38
29.62
Chile
20.14
9.19
Pakistan
0.34
1.89
China
2.57
4.02
Philippines
2.56
3.84
Denmark
42.95
29
Russia
2.93
7.1
Egypt
0.93
3.52
Saudi Arabia
1.34
13.33
Finland
43.03
24.43
South Africa
6.49
11.29
France
26.38
23.99
Spain
18.27
20.15
Germany
37.36
25.35
Sweden
51.63
24.18
Greece
13.21
17.44
Switzerland
30.7
28.1
India
0.68
2.84
Turkey
6.04
5.89
Iran
1.56
6
United Kingdom
32.96
24.16
Ireland
23.31
32.41
United States
50.15
34.32
Israel
27.66
19.79
Vietnam
1.24
2.07
Yemen
0.09
0.79
•
Enter x data into L1
•
Enter y data into L2
1.
STAT CALC menu
2.
Choose 8:
LinReg(a+bx)
3.
1
st
number = x variable
4.
2
nd
number = y variable
5.
Enter
Using TI

83
Cereal: Sodium and Sugar
The Slope and the Correlation
•
Correlation:
–
Describes the strength of the linear association between 2
variables
–
Does not change when the units of measurement change
–
Does not depend upon which variable is the response and
which is the explanatory
•
Slope:
–
Numerical value depends on the units used to measure the
variables
–
Does not tell us whether the association is strong or weak
–
The two variables must be identified as response and
explanatory variables
–
The regression equation can be used to predict values of the
response variable for given values of the explanatory
variable
The Squared Correlation
•
When a strong linear association exists, the
regression equation predictions tend to be much better
than the predictions using only
•
We measure the
proportional reduction in error
and
call it, r
2
, which measures the proportion of the
variation in the y

values that is accounted for by the
linear relationship of y with x.
•
A correlation of 0.9 means that
81% of the variation in the y

values can be explained by the
explanatory variable, x
2
0.9 0.81 81%
.
y
3.4 What Are Some Cautions in
Analyzing Association?
•
Be cautious of
–
Extrapolation
–
Influential outliers
–
Interpretation of correlation or association
–
Lurking variables
–
Confounding
Extrapolation
•
Extrapolation:
Using a regression line to
predict y

values for x

values outside the
observed range of the data
–
It’s riskier as we move farther from the range
of the given x

values
–
There is no guarantee that the relationship
given by the regression equation holds
outside the range of sampled x

values
Outliers and Influential Points
•
A
regression
outlier
is an observation/point
that lies far away from the trend that the rest of
the data follows
•
An observation is
influential
if
–
Its
x
value is relatively low or high compared
to the remainder of the data, and
–
The observation is a regression outlier.
•
An influential observation tends to pull the
regression line toward that data point and
away from the rest of the data.
Impact of removing an Influential
data point
Interpretation of Correlation and
Association
•
Correlation does not imply causation
.
•
In general, it’s also true that
association
does not imply causation
. This warning
holds whether we analyze associations
between qualitative variables or between
quantitative variables.
•
Create a scatterplot for “Crime rate”
against “Education” in the “FL crime”
data
on the text CD.
Scatterplot of Crime against Education
y = 0.1467x + 61.802
R
2
= 0.218
50
60
70
80
90
0
20
40
60
80
100
120
140
Education
Crime rate
Lurking Variables
•
A
lurking variable
is a variable, usually
unobserved, that influences the
association between the variables of
primary interest.
135
STAT 319 Biometrics Fall 2008
Example
: A reporter studied the causes of a fire to
a house and established a high positive correlation
between the damages (in dollars) and the number
of firefighters at the scene. Which of the following
could be a
lurking variable
that is responsible for
the association?
(a) Firefighter
(b) Weather
(c) Size of the house
(d) Size of the blaze
136
STAT 319 Biometrics Fall 2008
Example
: An economist noticed that nations with
more TV sets have higher life expectancies. He
established a high positive correlation between
length of life and number of TV sets. Find the lurking
variable, if there is one.
(a) TV sets brands
(b) Popcorn
(c) Wealth of the nation
(d) Sofa
(e) No confounding variable
Simpson’s Paradox
•
Simpson’s Paradox
refers to the phenomenon
that the direction of an association between
two variables can change after we include a
third variable and analyze the data at separate
levels of that variable. (Book)
•
Simpson's paradox
(or the
Yule

Simpson
effect
) is a statistical paradox wherein the
successes of groups seem reversed when the
groups are combined. (Wiki)
Is Smoking Actually Beneficial to Your Health?
•
This is Example 1 on page 93 of text. 1314 women were asked whether
they were smokers. They were followed over a period of 20 years.
Smoker
Survival Status
Total
Dead
Alive
Yes
139
443
582
No
230
502
732
Total
369
945
1314
The data indicate that smoking could apparently be
beneficial to your health. Could a lurking variable be
responsible for the association?
•
The were also age information about the 1314
women involved in the study. These women can
be stratified into 4 different age groups, creating 4
contingency tables.
Smoker
Age group
18

34
35

54
55

64
65 +
Dead Alive
Dead Alive
Dead Alive
Dead Alive
Yes
No
5 174
6 213
41 198
19 180
51 64
40 81
42 7
165 28
Question
: For each age group, find conditional proportions of
deaths for smokers and nonsmokers.
More Simpson Paradoxes
•
http://en.wikipedia.org/wiki/Simpson's_par
adox
Simpson's paradox for continuous data: a positive
trend appears for two separate groups (blue and red),
a negative trend (black, dashed) appears when the
data are combined.
Confounding
•
When two explanatory variables are both
associated with a response variable but
are also associated with each other, there
is said to be
confounding
.
•
Age is a confounding variable in the study
of the association between smoking and
survival status.
Difference between a Confounding
Variable and a Lurking Variable
•
A confounding variable is already included in the
study. It is associated both with the response
variable and the explanatory variable.
•
A lurking variable is not measured in the study. It has
the
potential
for confounding.
•
The effect of an explanatory variable can be
analyzed by
adjusting for
confounding variables.
•
Ignoring lurking variables results in misleading
conclusions. (age in smoking

survival association).
Chapter 4:
Gathering Data
Section 4.1
Should We Experiment or Should
We Merely Observe?
Statistics for the Physical Sciences (STAT
229

02)
Homework #4
•
4

1: Problems 4.2, 4.4, 4.6, 4.8, 4.10
•
4

2: Problems 4.14, 4.18, 4.20, 4.22, 4.28,
4.30
•
4

3: Problems 4.34, 4.36, 4.38, 4.40, 4.42
•
4

4: Problems 4.44, 4.46, 4.48, 4.50, 4.52,
4.54
145
1.
Population versus Sample
2.
Types of Studies: Experimental and
Observational
3.
Comparing Experimental and
Observational Studies
Learning Objectives:
146
Population
•
Population: all the subjects of interest
–
We use statistics to learn about the population, the
entire group of interest
•
Sample: subset of the population
–
Data is collected for the sample because we cannot
typically measure all subjects in the population
Learning Objective 1:
Population and Sample
Sample
147
Learning Objective 2:
Type of Study: Observational Study
•
In an
observational study,
the
researcher observes values of the
response variable and explanatory
variables for the sampled subjects,
without anything being done to the
subjects (such as imposing a
treatment)
148
Learning Objective 2:
Observational Study
–
Sample Survey
•
A sample survey selects a sample of
people from a population and interviews
them to collect data.
•
A sample survey is a type of observational
study.
•
A census is a survey that attempts to
count the number of people in the
population and to measure certain
characteristics about them
149
Learning Objective 2:
Type of Study: Experiment
•
A researcher conducts an
experiment
by assigning subjects to certain
experimental conditions and then
observing outcomes on the response
variable
•
The experimental conditions, which
correspond to assigned values of the
explanatory variable, are called
treatments
150
Learning Objective 2:
Example
•
Headline: “Student Drug Testing Not Effective in
Reducing Drug Use”
•
Facts about the study:
–
76,000 students nationwide
–
Schools selected for the study included schools
that tested for drugs and schools that did not test
for drugs
–
Each student filled out a questionnaire asking
about his/her drug use
151
Learning Objective 2:
Example
•
Conclusion: Drug use was similar in schools that
tested for drugs and schools that did not test for
drugs
152
Learning Objective 2:
Example
This study was an observational study.
In order for it to be an experiment, the
researcher would had to have assigned
each school to use or not use drug
testing rather than leaving this decision
to the school.
153
Learning Objective 3:
Comparing Experiments and Observational Studies
•
An experiment reduces the potential for
lurking variables
to affect the result.
Thus, an experiment gives the
researcher more control over outside
influences.
•
Only an experiment can establish
cause and effect. Observational
studies can not.
•
Experiments are not always possible
due to ethical reasons, time
considerations and other factors.
154
Chapter 4
Gathering Data
Section 4.2
What are Good Ways and Poor Ways to
Sample?
Learning Objectives:
1.
Sampling Frame & Sampling Design
2.
Simple Random Sample (SRS)
3.
Random number table
4.
Margin of Error
5.
Convenience Samples
6.
Types of Bias in Sample Surveys
7.
Key Parts of a Sample Survey
156
Learning Objective 1:
Sampling Frame & Sampling Design
•
The sampling frame is the list of subjects
in the population from which the sample is
taken, ideally it lists the entire population
of interest
•
The sampling design determines how the
sample is selected. Ideally, it should give
each subject an equal chance of being
selected to be in the sample
157
Learning Objective 2:
Simple
Random Sampling, SRS
•
Random Sampling is the best way of
obtaining a sample that is
representative of the population
•
A
simple random sample
of ‘n’
subjects from a population is one in
which each possible sample of that
size has the same chance of being
selected
158
Learning Objective 2:
SRS Example
•
Two club officers are to be chosen for a New Orleans trip
•
There are 5 officers: President, Vice

President,
Secretary, Treasurer and Activity Coordinator
•
The 10 possible samples are:
(P,V) (P,S) (P,T) (P,A) (V,S)
(V,T) (V,A) (S,T) (S,A) (T,A)
•
For a SRS, each of the ten possible samples has an
equal chance of being selected. Thus, each sample has
a 1 in 10 chance of being selected and each officer has
a 4 in 10 chance of being selected.
159
Learning Objective 3:
SRS: Table of Random Numbers
Table E on pg. A6 of text
Table of Random Numbers
160
Leaning Objective 3:
Using Random Numbers to select a SRS
•
To select a simple random sample
–
Number the subjects in the sampling frame
using numbers of the same length (number of
digits)
–
Select numbers of that length from a table of
random numbers or using a random number
generator
–
Include in the sample those subjects having
numbers equal to the random numbers
selected
161
We need to select a random sample of 5 from a class of 20 students.
1)
List and number all members of the
population
, which is the class of 20.
2)
The number 20 is two

digits long.
3)
Parse the list of random digits into numbers that are two digits long. Here
we choose to start with line 2, for no particular reason.
Learning Objective 3:
Choosing a simple random sample
22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34 02
162
1 Alison
2 Amy
3 Brigitte
4 Darwin
5 Emily
6 Fernando
7 George
8 Harry
9 Henry
10 John
11 Kate
12 Max
13 Moe
14 Nancy
15 Ned
16 Paul
17 Ramon
18 Rupert
19 Tom
20 Victoria
•
Remember that 1 is 01, 2 is 02, etc.
•
If you were to hit 09 again before getting five people,
don’t sample Ramon twice
—
you just keep going.
4)
Choose a
random sample
of size 5 by reading through the
list of two

digit random numbers, starting with line 2 and on.
5)
The first five random numbers matching numbers assigned
to people make the SRS.
22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34
02
The first individual selected is Amy, number 02. That’s it
from line 2. Move to line 3
Then Moe (13), Darwin, (04), Henry (09), and Net (15)
24
13
04
83 60 22 52 79 72 65 76 39 36 48
09
15
17 92 48 30
163
Learning Objective 4:
Margin of Error
•
Sample surveys are commonly used to
estimate population percentages
•
These estimates include a
margin of
error
which tells us how well the sample
estimate predicts the population
percentage
•
When a SRS of n subjects is used, the margin
of error is approximately
1
100%
n
164
Learning Objective 4:
Example: Margin of Error
•
A survey result states: “The
margin of error
is plus
or minus 3 percentage points”
•
This means: “It is very likely that the reported
sample percentage is no more than 3% lower or 3%
higher than the population percentage”
•
Click
here
to see a Gallup example. Read the
“Survey Methods” part and justify the margin of
error in the survey.
165
Learning Objective 5:
Convenience Samples: Poor Ways to Sample
•
Convenience Sample:
a type of survey
sample that is easy to obtain
–
Unlikely to be representative of the
population
–
Often severe biases result from such a
sample
–
Results apply ONLY to the observed
subjects; that is, they are descriptive.
166
Learning Objective 5:
Convenience Samples: Poor Ways to Sample
•
Volunteer Sample:
most common form
of convenience sample
–
Subjects volunteer for the sample
–
Volunteers do not tend to be
representative of the entire population
167
Learning Objective 6:
Types of Bias in Sample Surveys
Bias
: Tendency to systematically favor certain parts of the
population over others
–
Sampling Bias
: Occurs when using biased samples, which
are based on sampling methods such as
using nonrandom
samples
or having undercoverage
–
Nonresponse bias
: occurs when some sampled subjects
cannot be reached or refuse to participate or
fail to answer
some questions
–
Response bias
: occurs when the subject gives an
incorrect
response
or the question is misleading
A Large Sample Does Not Guarantee An Unbiased Sample!
168
Learning Objective 7:
Key Parts of a Sample Survey
•
Identify
the population of all subjects of interest
•
Construct
a sampling frame which attempts to list all
subjects in the population
•
Use
a random sampling design to select
n
subjects
from the sampling frame
•
Be
cautious of sampling bias due to nonrandom
samples
We can make inferences about the population of interest
when sample surveys that use random sampling are
employed.
169
Chapter 4
Gathering Data
Section 4.3
What Are Good Ways and Poor Ways to Experiment?
Learning Objectives:
1.
Identify the elements of an experiment
2.
Experiments
3.
3 Components of a good experiment
4.
Blinding the Study
5.
Define Statistical Significance
6.
Generalizing Results of the Study
171
Learning Objective 1:
Elements of an Experiment
•
Experimental units
: the subjects of an experiment; the
entities that we measure
in an experiment
•
Treatment
: A specific experimental condition imposed on
the subjects of the study; the treatments correspond to
assigned values of the explanatory variable
•
Explanatory variable
: Defines the groups to be compared
with respect to values on the response variable
•
Response variable
: The outcome measured on the
subjects to reveal the effect of the treatment(s).
172
Learning Objective 2:
Experiments
•
An experiment deliberately imposes treatments on the
experimental units in order to observe their responses.
•
The
goal
of an experiment is
to compare the effect of
the treatment on the response
.
•
Experiments that are randomized occur when the
subjects are randomly assigned to the treatments;
randomization helps to eliminate the effects of lurking
variables
173
Learning Objective 3:
3 Components of a Good Experiment
•
Control/Comparison group: allows the
researcher to analyze the effectiveness of
the primary treatment
•
Randomization: eliminates possible
researcher bias, balances the comparison
groups on known as well as on lurking
variables (so that the observed difference
among subjects is attributed to treatments)
•
Replication: allows us to attribute
observed effects to the treatments rather
than ordinary variability
174
Learning Objective 3:
Principle 1: Control or Comparison Group
•
A placebo is a dummy treatment, i.e. sugar pill. Many
subjects respond favorable to any treatment, even a
placebo.
•
A control group typically receives a placebo. A control
group allows us to analyze the effectiveness of the
primary treatment.
–
A control group need not receive a placebo. Clinical trials
often compare a new treatment for a medical condition, not
with a placebo, but with a treatment that is already on the
market.
175
Learning Objective 3:
Principle 1: Control or Comparison Group
•
Experiments should
compare
treatments rather
than attempt to assess the effect of a single
treatment in isolation
–
Is the treatment group better, worse, or no different
than the control group?
•
Example: 400 volunteers are asked to quit
smoking and each start taking an antidepressant.
In 1 year, how many have relapsed? Without a
control group (individuals who are not on the
antidepressant), it is not possible to gauge the
effectiveness of the antidepressant.
176
Learning Objective 3:
Placebo effect
•
Placebo effect (power of suggestion) : The
“placebo effect” is an improvement in
health due not to any treatment but only to
the patient’s belief that he or she will
improve.
177
Learning Objective 3:
Principle 2: Randomization
•
To have confidence in our results we should
randomly
assign subjects to the treatments. In doing so, we
–
Eliminate bias that may result from the researcher
assigning the subjects
–
Balance the groups on variables known to affect the
response
–
Balance the groups on lurking variables that may be
unknown to the researcher
178
Learning Objective 3:
Principle 3: Replication
•
Replication is the process of assigning
several experimental units to each
treatment
–
The difference due to ordinary variation is
smaller with larger samples
–
We have more confidence that the sample
results reflect a true difference due to
treatments when the sample size is large
–
Since it is always possible that the observed
effects were due to chance alone, replicating
the experiment also builds confidence in our
conclusions
179
Learning Objective 4:
Blinding the Experiment
•
Ideally, subjects are unaware, or
blind
, to the
treatment they are receiving
•
If an experiment is conducted in such a way that
neither the subjects nor the investigators working with
them know which treatment each subject is receiving,
then the experiment is
double

blinded
•
A double

blinded experiment controls
response bias
from the respondent and experimenter
180
•
If an experiment (or other study) finds a difference in
two (or more) groups, is this difference really
important?
•
If the observed difference is larger than what would
be expected just by chance, then it is labeled
statistically significant
.
•
Rather than relying solely on the label of
statistical
significance
, also look at the actual results to
determine if they are
practically significant
.
Learning Objective 5:
Define Statistical Significance
181
Learning Objective 6:
Generalizing Results
•
Recall that the goal of experimentation is
to analyze the association between the
treatment and the response for the
population
, not just the sample
•
However, care should be taken to
generalize the results of a study only to
the population that is represented by the
study.
182
Chapter 4
Gathering Data
Section 4.4
What are Other Ways to Conduct Experimental and
Observational Studies
Learning Objectives
1.
Sample Surveys: Other Random Sampling Designs
2.
Types of Observational Studies: Prospective and
Retrospective
3.
Multifactor Experiment
4.
Matched pairs design
5.
Randomized block design
184
Learning Objective 1:
Sample Surveys: Random Sampling
Designs
•
It is not always possible to conduct an
experiment , so it is necessary to have
well designed, informative studies that are
not experimental, e.g., sample surveys
that use randomization
–
Simple Random Sampling
–
Cluster Sampling
–
Stratified Random Sampling
185
Learning Objective 1:
Sample Surveys: Cluster Random Sample
Steps
–
Divide the population into a large number of
clusters
, such as city blocks
–
Select a simple random sample of the clusters
–
Use the subjects in those clusters as the
sample
186
Learning Objective 1:
Sample Surveys: Cluster Random
Sample
•
Preferable when
–
A reliable sampling frame is unavailable
–
The cost of selecting a SRS is excessive
•
Disadvantage
–
Usually need a larger sample size than with a
SRS in order to achieve a particular margin of
error
187
Learning Objective 1:
Sample Surveys: Stratified Random Sample
Steps
–
Divide the population into separate groups,
called
strata
–
Select a
simple random sample from each
strata
–
Combine the samples from all strata to form
complete sample
188
Learning Objective 1:
Sample Surveys: Stratified Random Sample
•
Advantage is that you can include in your
sample enough subjects in each stratum
you want to evaluate
•
Disadvantage is that you must have a
sampling frame and know the stratum into
which each subject belongs
189
Learning Objective 1:
Stratified Random Sample

Example
Suppose a university has the following student
demographics:
Undergraduate Graduate First Professional Special
55% 20% 5% 20%
In order to insure proper coverage of each
demographic, a stratified random sample of 100
students could be chosen as follows: select a SRS
of 55 undergraduates, a SRS of 20 graduates, a
SRS of 5 first professional students, and a SRS of
20 special students; combine these 100 students.
190
Learning Objective 1:
Comparing Random Sampling Methods
191
Learning Objective 2:
Types of Observational Studies
An observational study can yield useful information when an
experiment is not practical.
•
Types of observational studies:
–
Sample Survey: attempts to take a cross section of a
population at the current time
–
Retrospective
study: looks into the past
–
Prospective
study: follows its subjects into the future
•
Causation can never be definitively established with an
observational study, but well designed studies can provide
supporting evidence for the researcher’s beliefs
192
Learning Objective 2:
Retrospective Case

Control Study
•
A
case

control study
is a retrospective
observational study in which subjects who
have a response outcome of interest (the
cases) and subjects who have the other
response outcome (the controls) are
compared on an explanatory variable
193
Learning Objective 2:
Example: Case

Control Study
•
Response outcome of interest: Lung cancer
–
The
cases
have lung cancer
–
The
controls
did not have lung cancer
•
The two groups were compared on the explanatory
variable
smoker/nonsmoker
Smoker
Cases
Controls
yes
688
650
no
21
59
Total
709
709
Prob(smoker)
97%
92%
Lung Cancer
194
Learning Objective 2:
Example: Prospective Study
Nurses’ Health Study:
–
Began in 1976 with 121,700 female nurses aged 30 to 55;
questionnaires are filled out every two years
–
Purpose was to explore the relationships among diet,
hormonal factors, smoking habits and exercise habits and
the risk of coronary heart disease, pulmonary disease and
stroke
–
Nurses are followed into the future to determine whether
they eventually develop an outcome such as lung cancer
and whether certain explanatory variables are associated
with it
195
Learning Objective 3:
Multifactor Experiments
•
A Multifactor experiment uses a single experiment to
analyze the effects of two or more explanatory variables
on the response
•
Categorical explanatory variables in an experiment are
often called
factors
•
We are often able to learn more from a multifactor
experiment than from separate one

factor experiments
since the response may vary for different factor
combinations
196
Learning Objective 3:
Example: Multifactor experiment
Examine the effectiveness of
both Zyban and nicotine
patches on quitting smoking
•
Two factor experiment
•
4 treatments
197
Learning Objective 3:
Example: Multifactor experiment
•
subjects
: a certain number of undergraduate
students
•
all subjects viewed a 40

minute television program
that included ads for a digital camera
•
some subjects saw a 30

second commercial; others
saw a 90

second version
•
same commercial was shown either 1, 3, or 5 times
during the program
•
there were two
factors
: length of the commercial (2
values), and number of repetitions (3 values)
198
Learning Objective 3:
Example: Multifactor experiment
•
the 6 combinations of one value of
each factor form six
treatments
Factor B:
Repetitions
1 time
3 times
5 times
Factor A:
Length
30
seconds
1
2
3
90
seconds
4
5
6
subjects assigned
to Treatment 3 see
a 30

second ad
five times during
the program
after viewing, all subjects answered questions about:
recall of the ad, their attitude toward the commercial,
and their intention to purchase the product
–
these
were the
response variables
.
199
Learning Objective 4:
Matched Pairs Design
In a matched pairs design, the subjects receiving the two
treatments are somehow matched (same person,
husband/wife, two plots in the same field, etc.)
–
In a
crossover design
, the same individual is used for the
two treatments
•
Randomly
–
assign the two treatments to the two matched subjects, or
–
randomize the order of applying the treatments in a
crossover design
•
The number of replicates equals the number of pairs
•
Helps to reduce effects of lurking variables
200
Learning Objective 5:
Randomized Block Design
•
A
block
is a set of experimental units that
are matched with respect to one or more
characteristics
•
A
Randomized Block Design, RBD,
is
when the random assignment of
experimental units to treatments is carried
out separately within each block
201
Learning Objective 5:
Example: Randomized Block Design
Block = gender; 3 treatments = 3 types of therapy
The men (as well as the women) are randomly assigned to the
3 treatments; differences can be compared with respect to
gender as well as therapy type
202
Learning Objective 5:
Randomized Block Design
•
RBD eliminates variability in the response
due to the blocking variable; allows for
better comparisons to be made among the
treatments of interest
•
A matched pairs design is a special case
of a RBD with two observations in each
block
203
Chapter 5
Probability in our Daily Lives
Section 5
Comments 0
Log in to post a comment