# Statistics for the Physical Sciences

Ασφάλεια

29 Νοε 2013 (πριν από 4 χρόνια και 5 μήνες)

244 εμφανίσεις

Statistics for the Physical Sciences

STAT 229

Chapter 1

Statistics: The Art and Science of
Learning from Data

Fall 2008 STAT 229

2

Homework 1

Problems 1.1 to 1.36 (even numbered)

Complete the survey on page 22
-
23

Fall 2008 STAT 229

3

1.1 Overview

Statistics

is the art and science of learning
from data. It is a collection of methods for

Planning experiments (
Design
)

Obtaining data (data are collected observations,
such as measurements and survey responses)

Organizing data

Summarizing data (
Description
)

Analyzing data

Interpreting results, and

Making decisions and predictions (
Inference
)

Statistics is a branch of Mathematics
-
>

Fall 2008 STAT 229

4

Statistics is invented for studying

Randomness
-

a lack of order, purpose,
cause, or predictability (by Wiki)
-

without
which the world will be of no interest.

Examples of random phenomena:

Phelps won 8 gold medals

A 6
-
sided die is flipped and landed a 4

It’s going to rain tomorrow

Randomness, Fuzziness and Uncertainty

Randomness creates uncertainty. On the
other hand, randomness can be used.
When
estimating the proportion of adults in USA who smoked, we can survey
1000 adults and use the survey responses as our data. How randomness
is used? Why use it?

Fall 2008 STAT 229

5

1.2 We Learn about Population
Using Samples

In the previous example, all US adults
form a
population

while the 1000 surveyed
sample
.

In general, a
population

is the complete
collection of all items to be studied. These
items can be human subjects, animals,
machines, even scores.

A
sample

is a sub
-
collection of items
selected from a population.

Fall 2008 STAT 229

6

A sample should
represent

the underlying
population. Therefore, sample data must be
collected in an appropriate way, such as through
a process of random selection.

How large should a sample be?

What are those appropriate ways to generate a
sample?

Methods for summarizing sample data are
referred to as
descriptive statistics
, while
methods for making decisions or predictions
about a population based on sample data are
called
inferential statistics
.

Fall 2008 STAT 229

7

Parameter and Statistic

A parameter is a numeric summary of the
population

A statistic is a numeric summary of a
sample taken from the population

Problem:
Number of Good Friends

One year the General Social Survey
asked, “About how many good friends do
you have?” Of the 819 people who
responded, 6% reported having only one
good friend. Identify

(a) the sample

(b) the population, and

(c) the parameter or statistic

Try Problem 1.3 on page 8 of the textbook.

Go to the General Social Survey website

http://sda.berkeley.edu/GSS

By entering HEAVEN as the “row variable”
name, find the percentages of people who
said “yes, definitely,” “yes, probably,” “no,
probably not,” and “no, definitely not” when
asked whether they believed in heaven.

Fall 2008 STAT 229

10

1.3 What Role Do Computers Play
in Statistics?

Save (large) data files

Create databases

Do analysis with software: SAS, Minitab,
Spss, R, Splus, C, Matlab, Excel, ...

Simulation

use of computers to mimic
reality.

Fall 2008 STAT 229

11

Simulation of Coin Tossing

in

Microsoft

Excel

NOTES:

1. Pseudo
-
random numbers are numbers
generated by a computer algorithm to
simulate real random numbers.

2. Excel has an Analysis ToolPak by which
one can do statistical analysis, including
simulation.

When a balanced coin is tossed 20 times, we have a
sequence of 20 Heads or Tails. Let 1 denote Heads
and 0 denote Tails. Then a sample is a sequence of
1 or 0. The empirical probability or sample proportion
of tossing Heads(1) is computed as the number of
1’s divided by the total number of tosses. The coin
-
tossing process can be simulated using Bernoulli
distribution with proportion p = 0.5.

1. Simulate 5 random samples, each consisting of
10 pseudo
-
random numbers from a Bernoulli(0.5)
distribution. Repeat the process using 1000 pseudo
-
random numbers.

2. Compute the sample proportion for each of the 10
samples.

Simulation

Tools

Data Analysis

Random Number Generation

Bernoulli

More questions:

1.
Where does randomness play a role?

2.
Is the amount of variability from sample to
sample of size 10 bigger than the amount of
variability from sample to sample of size
1000?

3.
Comment on the effect of sample size.

If You Are Using Excel 2007…

Excel 2007
no longer
have
tools

To use
Analysis ToolPak
, go to
office
button

at the upper left corner, click
Excel
options
, then click
-
ins

and highlight
Analysis ToolPak
. Clicking
go

button to
open the
-
ins

window. Check the box
Analysis ToolPak
and click OK.

Now go to Data menu, click
Data Analysis
and choose
Random Number Generation
.

Fall 2008 STAT 229

14

Statistics for the Physical Sciences

STAT 229

Chapter 2

Exploring Data with Graphs and
Numerical Summaries

Homework #2

2
-
1 (p29): Problems 2.2, 2.4, 2.6, 2.8

2
-
2 (p44): Problems 2.10, 2.12, 2.14, 2.16, 2.22

2
-
3 (p55): Problems 2.30, 2.32, 2.34, 2.36, 2.42, 2.44

2
-
4 (p64): Problems 2.48, 2.52, 2.56, 2.58, 2.60

2
-
5 (p73): Problems 2.64, 2.66, 2.68, 2.72, 2.74, 2.78, 2.80, 2.82

2
-
6 (p80): Problems 2.84

2.1 What Are the Types of Data?

A characteristic observed for the subjects in a study
is called a
variable
.

Examples of variable: major, GPA, religious
affiliation, smoking status,...

Variables can be
quantitative

(numerical) or
qualitative (categorical).

A variable is quantitative if its numerical values
represent different magnitudes of the variable, such
as weight, GPA. A variable is categorical, if its
value represents a category, such as major, letter
garde.

Quantitative variables can be
discrete

or
continuous
.

A discrete variable is usually a count such
as the number of car accident last year,
while a continuous variable is a
measurement, such as distance.

The reason we care whether a variable is
quantitative, categorical, discrete, or
continuous is that the
method used to
analyze a data set depends on the type of
variable

the data represent.

Key Features of a Variable

A quantitative variable usually takes different
values in a study. Studying the
(variability)

of such a variable is one of the
most important tasks in statistics. Another
feature of a quantitative variable is the
center

of all its possible values.

For a categorical variable, a key feature to
describe is the
relative number of items

(percentage) in the various categories.

Frequency Tables

For a categorical variable, counting how
often each possible value is taken by the
variable is a critical first step in descriptive
statistics. The results are summarized in a
frequency table
.

The following table shows the frequency of
shark attacks in various regions for 1990
-
2006.

Region

Frequency

Proportion

Percentage

Florida

365

0.785

78.5

Hawaii

60

0.129

12.9

California

40

0.086

8.6

Total

465

1.000

100

Frequency of shark attacks in
various regions for 1990
-
2006

Questions: What is the variable? Is it categorical?

The
mode

of categorical data is the category with the highest
frequency. Find the mode of the data.

Frequency Tables (cont’d)

In the table above, the proportions and percentages
are also called
relative frequencies
. A table like this
is called a
frequency table
.

A
frequency table

is a listing of possible values for a
variable, together with the number of observations for
each value.

For a quantitative variable, A frequency table is
constructed by first categorizing the data into a set of
adjacent intervals, then finding the frequencies for
each interval.

No. Hours

Frequency

Percent

0
-
1

232

25.6

2
-
3

403

44.5

4
-
5

181

20.0

6
-
7

45

5.0

8 or more

44

4.9

Total

905

100.0

Frequency Table for Daily TV Watching

Example

Construct a frequency table for quiz scores for twenty
students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8,
6, 6

Score Frequency Proportion Percentage

[0,2]

(2,4]

(4,6]

(6,8]

(8,10]

1

1

7

8

3

Total 20

0.05

0.05

0.35

0.40

0.15

5

5

35

40

15

1.00 100

2.2 How Can We Describe Data Using
Graphical Summaries?

Group

Seats

Percent (%)

EUL

39

5.3

PES

200

27.3

EFA

42

5.7

EDD

15

2

ELDR

67

9.2

EPP

276

37.7

UEN

27

3.7

Other

66

9

Total

732

100.0

Preliminary results of the election for the European Parliament in 2004

Pie Charts and Bar Graphs

for
Categorical Variables

Pie chart:

A circle having a “slice of a
pie” for each category. The size of slice
corresponds to the
percentage

of
observations in the category.

Bar graph:

Displays a vertical bar for
each category. The height of the bar is the
percentage

of observations in the category.

Seats
5%
27%
6%
2%
9%
38%
4%
9%
EUL
PES
EFA
EDD
ELDR
EPP
UEN
Other
Example
: Use the shark attack data from

this

to construct a pie chart

of interest.

Bar Graph for

European Parliament in 2004

0
50
100
150
200
250
300
EUL
PES
EFA
EDD
ELDR
EPP
UEN
Other
Seats
Pareto Chart
0
5
10
15
20
25
30
35
40
EPP
PES
ELDR
Other
EFA
EUL
UEN
EDD
Group
Percentage
Pareto Chart: Bar Graph with categories
Ordered by Their Frequency from the
Tallest Bar to Shortest

Graphs for Quantitative Variables

Dot plots:

Shows a dot for each observation, placed
just above the value on the number line for that
observation.

Stem
-
and
-
Leaf Plots:

similar to dot plot. Each
observation is represented by a stem and a leaf.

Histogram:

a graph uses bars to portray the
frequencies or relative refrequencies.

Example
Dot plot

Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9,
6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

1 2 3 4 5 6 7 8 9 10

Graphs for Quantitative Variables

Example
Stem
-
and
-
Leaf Plot

Stem Leaves

4

5

6

7

8

9

10

5

2

4 5 6 6

0 4 7 7

6

0

Test scores for 12 students: 80, 45, 100, 76, 84, 87, 96, 62, 75,74, 87, 76

Step 1: Sorted test scores: 45, 62, 74, 75, 76, 76, 80, 84, 87, 87, 96, 100

Step 2: Place the scores in the corresponding stems and leaves.
(usually the last digit will be the leaf)

Graphs for Quantitative Variables

Histogram

Step 1: Divide the range of data into

intervals of equal width.

Step 2: Count the frequency and construct a

frequency table (or relative frequency table).

Step 3: Label the endpoints of the intervals on x
-
axis.

Draw a bar over each interval with height equal

to its frequency (or relative frequency), values

of which are marked on the y
-
axis.

Graphs for Quantitative Variables

Example
Histogram

Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9,
6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

Score Freq

[0,2) 1

[2,4) 1

[4,6) 7

[6,8) 8

[8,10) 3

[10,12) 0

0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
Score
Frequency
Graphs for Quantitative Variables

The Shape of a Distribution

When looking at a graph of quantitative
data (dot plot, stem
-
and
-
leaf plot, and
histogram), look for

the
overall pattern
: Do the data cluster
together?

the
outliers

modes
: unimodal, bimodal,…

skew
: skewed to the left or right

the underlying
smooth

curve

Unimodal Bimodal Multimodal

outliers

Histogram of x
x
Frequency
-10
-5
0
5
10
0
5
10
15
20
25
30
35
Histogram of y
y
Frequency
-10
-5
0
5
10
0
5
10
15
20
25
30
These Two Histograms Show Differences in Spread

Time plots

Time series: a data set collect over time.

Time plot: a graph displaying time
-
series
data.

Look for pattern over time.

Time plots: Example

Gasoline price

2.3 How can we describe the
center of quantitative data?

Measures of center: mean and median

Mean: the sum of the observations divided by
the number of observations.

Median: The midpoint of the observations.

Mean Formula

1 2
1
n
i
x x x
x x
n n
    
 

Example

Travel times to work

How long does it take to get from home to work?

Here are the travel times in minutes in North

Carolina, chosen at random by Census Bureau:

30
20 10 40 25 20 10 60 15 40 5 30 12 10 10

Find the mean travel time.

1 2
n
x x x
x
n
    

30 20 10 337
22.5 minutes
15 15
    
  
How to Determine the Median

Step 1: Sort your data from the smallest

to the largest.

Step 2: If n, the number of data points is

odd
, the median is the middle

value; if n is
even
, the median is

the average of the middle two values.

Example

Find median for the travel times

30 20 10 40 25 20 10 60 15 40 5 30 12 10 10

Arrange the data in order:

5 10 10 10 10 12 15
20

20 25 30 30 40 40 60

Since n = 15 is odd, Median = 20, the middle value.

Example

Find the median for the scores

60 80 87 73 95 92

Arrange the data in order: 60 73
80 87

92 95

Since n = 6 is even, Median = (
80 + 87
)/2 = 83.5, the

average of the two middle values.

Properties of the mean and the
median

The mean is the balance point of the data.

In a symmetric distribution, the mean and
median are the same.

In a skewed distribution, the mean is usually
farther out in the long tail than the median.

Skewed to the right, mean > median

Skewed to the left, mean < median

Mean is less resistant to outliers.

Mean, Median, and Mode

The mean is the balance
point.

The median is the midpoint.

The mode is the value
occurs most frequently.

Mean and Median: Applications

City data

St Cloud, MN

New Orleans, LA

2.4 How can we describe the spread of
quantitative data?

The Range

The Stand Deviation

The Interquartile Range (Sec2.5)

Measuring spread: The Range

Range = largest value
-

smallest value

Example
: Find the range of the quiz scores :
2, 5, 0, 7, 9, 1, 7, 6, 10, 9, 3, 9, 9, 7, 0, 6, 9,
10, 8,1, 4, 6, 8, 9, 4, 2, 9, 0, 5, 7

Range = largest value
-

smallest value

= 10
-

0

= 10

Simple to compute

Easy to understand

But

Uses only extreme values

Affected severely by outliers

The Range

Measuring
:

Variance and Standard Deviation

The
standard deviation
and
variance

measure
spread by looking how far the observations are
from their mean.

The
variance

of a set of observations is an
average of the squares of deviation from the mean.

2
s
2 2 2
2
1 2
2
( ) ( ) ( )
1
1
( )
1
n
i
x x x x x x
s
n
x x
n
       

 

The
standard deviation

s

is the square root of
the variance

2
( )
1
i
x x
s
n

Example

(Calculating the standard deviation
s
)

Metabolic rates of 7 men who took part in a study of
dieting. The units are calories per 24 hours.

1792 1666 1362 1614 1460 1867 1439

Find the mean first:

x
1792 1439 112000
1600 calories
7 7
   
  
The standard deviation: Example

1792

192

36864

1666

66

4356

1362

-
238

56644

1614

14

196

1460

-
140

19600

1867

267

71289

1439

-
161

25921

Observations Deviations Squared deviations

i
x
i
x x

2
( )
i
x x

sum = 0 sum = 214870

The variance

The standard deviation

2
214870
35811.67
6
s
 
35811.67 189.24 calories
s
 
Cont’d

Properties of the

Standard Deviation

The greater the spread, the larger the
s
.

s

≥ 0.

s

= 0 when all the observations take the
same value.

s

can be influenced by outliers.

Interpreting the Magnitude of s:

The Empirical Rule

If a distribution of data is bell shaped, then
approximately:

68%
of the observations fall within 1 stand deviation
of the mean, that is between
-

s and + s.

95%
of the observations fall within 2 stand deviations
of the mean, that is between
-

2s and + 2s.

99.7%
of the observations fall within 3 stand
deviations of the mean, that is between
-

3s and

+ 3s.

x
x
x
x
x
x
Sample Statistics and

Population Parameters

Population: The collection of all individuals or
items under consideration.

Sample: That part of the population from
which we actually collect information.

We use a sample to draw conclusion about
the entire population.

Sample Statistics and

Population Parameters

Parameter: Numerical summary of the
population.

Statistic: Numerical summary of a sample.

Notations:

Population Mean

Population Standard Deviation

Sample Mean

Sample Standard Deviation
s

x

2.5 How Can Measures of Position

Measure of positions:

Quartiles

Percentiles.

Percentiles:

p
th percentile: a value such that
p

percent of
observations fall below or at that value.

Quartiles

First quartile
, the same as 25
th

percentile (p=25)

Second quartile
, the same as 50
th

percentile (p=50)

Third quartile
, the same as 75
th

percentile (p=75)

To calculate the quartiles:

1. Arrange the observations in increasing order.

2. The second quartile is the median M.

( = 50th percentile)

3. The first quartile is the median of the

observations whose position in the ordered list is to

the left location of the overall median. ( = 25th

percentile)

4. The third quartile is the median of the

observations whose position in the ordered list is to

the right location of the overall median. ( = 75th

percentile)

1
Q
3
Q

Calculating Quartiles

1
Q
3
Q
2
Q
2
Q

Example 2.17 Travel times to work Find and .

Arrange the data in order:

5 10 10 10 10 12 15
20

20 25 30 30 40 40 60

the left location of the overall median

20

is:

5 10 10 10 10 12 15

= 10

the right location of the overall median

20

is:

20 25 30 30 40 40 60

= 30

1
Q
3
Q
1
Q
3
Q

Quartiles: Example

Example 2.5 Travel times to work Find and .

Travel times in minutes of 20 randomly chosen New York workers: 10 30 5 25 40
20 10 15 30 20 15 20 85 15 65 15 60 60 40 45

Arrange the data in order:

5 10 10 15 15 15 15 20 20
20 25

30 30 40 40 45 60 60 65 85

The overall median = 22.5 minutes

the left location of the overall median is: 5 10 10 15
15 15

15 20 20 20

= 15 minutes

the right location of the overall median is:25 30 30 40
40 45

60 60 65 85

= 42.5 minutes

1
Q
3
Q
1
Q
3
Q

Quartiles: Example

Another Measure of Spread:

The
Interquartile

Range

The Interquartile Range (IQR)

The Interquartile Range =
-

Example

(Travel times to work) Find IQR.

5 10 10
10

10 12 15
20

20 25 30
30

40 40 60

3
Q
1
Q
Detecting Potential Outliers:
The 1.5*IQR Criterion

The
1.5*IQR Criterion
for Identifying
Potential Outliers.

An observation is a
potential outlier
if it
falls
more than

1.5*IQR
below

the first
quartile or
more than

1.5*IQR
above

the
third quartile.

Detecting Potential Outliers:

Example

Example 2.18 Travel times to work (in
minutes). Detecting Potential Outliers.

5 10 10 10 10 12 15

20 20 25 30 30 40 40

80

The five
-
number summary and The
BoxPlot

The five
-
number summary of a distribution
consists of the smallest observation, the first
quartile, the median, the third quartile, and
the largest observation.

Minimum
Median

Maximum

Example 2.19 The five
-
number summary of
travel times to work.

5 10 10 10 10 12 15 20 20 25 30 30 40 40
80

1
Q
3
Q
The Box Plot

Constructing a box plot

A box goes from the and .

A line drawn inside the box at the median.

A line goes from the lower end of the box to
the smallest
observation

that is not an
potential outlier. A separate line goes from the
upper end of the box to the largest
observation

that is not an potential outlier.
These lines are called whiskers.

The potential outliers are shown separately.

1
Q
3
Q
Example

(
Constructing a boxplot
)

Travel times to work.

5 10 10
10

10 12 15
20

20
25 30
30

40 40 80

Steps:

1.
Find Q1, Q2, and Q3:

2.
Find IQR:

3.
Determine two fences:

lower fence = Q1

1.5*IQR

upper fence = Q3 + 1.5*IQR

4.
Identify potentialutliers

5.
Determine whiskers:
one from Q1 to the
smallest observation
within fences, and the
other from Q3 to the
largest within fences.

6.
Draw the boxplot.

20
40
60
80
Q1 = 10

Q2 = 20

Q3 = 30

smallest in fences

Largest in fences

outlier

(Text Page 67)

Sodium values for 20
breakfast cereals:

0 70 125 125

140 150

170 170

180
200 200

210

210 220
220 230

250 260 290 290

R codes
:

x=c(0,70,125,125,140,150,170,170,
180,200,200,210,210,220,220
,230,250,260,290,290)

boxplot(x, col=3, horizontal = T)

Example
(Boxplot)

0
50
100
150
200
250
300
Interpretation of Boxplots

IQR measures the sample
variability

(or

A box plot indicates
skew
. The side with
the
larger part of the box

and the
longer
whisker

usually has skew in that direction.

Interpretation of Box Plots

Interpretation of Box Plots

In terms of symmetry, median, spread, …

Side
-
by
-
Side Box Plots

Help to compare groups (
in terms of

Example: (College student heights) Click
here

to see the “Heights” data on the text
CD.

R

codes (copy and
paste to R):

boxplot(HEIGHT~GENDER,

data=heights, col = 3:4)

0
1
55
60
65
70
75
80
85
90
Box plots comparing heights
The z
-
Score

Z
-
score for an observation is the number of
standard deviation that it falls from the mean and
in which direction.

An observation in a bell
-
shaped distribution is
regarded as a
potential outlier
if it falls more
than three standard deviation from the mean;

that is, z > 3 or z <
-

3.
(Recall the empirical rule, 99.7% of
values are within 3 standard deviations of the mean.)

observation mean
standard deviation
z

The z
-
Score: Example

Example 2.20
The height of a group of young women has
mean x=64 inches
an
d standard deviation =2.7 inches.
The histogram of height is bell-shaped.
-- A woman 70 in
s
70 64
ches tall has standized height = 2
.22
2.7
or 2.22 standard deviations above th
e mean.
60 64
-- A woman 60 inches tall has standized
height = -1.48
2.7
or -1.48 standard deviations
z
z

less than the mean height.
73 64
-- A woman 73 inches tall has standized
height = 3.33
2.7
or 3.33 standard deviations above the
mean height,
and that woman's height is an outlier
.

z

2.6 How Can Graphical
Summaries Be Misused?

Statistics for the Physical Sciences

STAT 229

Chapter 3

Association: Contingency,
Correlation, regression

Homework #3

3
-
1: Problems 3.2, 3.4, 3.6, 3.8, 3.10

3
-
2: Problems 3.12, 3.14, 3.16, 3.18, 3.22

3
-
3: Problems 3.26, 3.30, 3.36, 3.38, 3.40

3
-
4: Problems 3.48, 3.50, 3.52, 3.54, 3.56, 3.58, 3.60

Response Variables and

Explanatory Variables

In this chapter, we discuss statistical methods for
data on two variables
.

Some times, one of the two variables may be
termed the response variable and the other
explanatory variable.

The
response variable
is the outcome variable on
which comparisons are made.

The
explanatory variable
defines the group to be
compared with respect to values on the response
variable.

Is Smoking Actually Beneficial to Your Health?

This is Example 1 on page 93 of text. 1314 women were asked whether
they were smokers. They were followed over a period of 20 years.

Smoker

Survival Status

Total

Alive

Yes

139

443

582

No

230

502

732

Total

369

945

1314

It’s natural to treat the variable “Survival Status” as a
response variable and “Smoker” as an explanatory variable.

Associations

The main purpose of a data analysis with two
variables is to investigate whether there is an
association and to describe the nature of that
association.

An association exits between two variables if
a particular value for one variable is more
likely to occur with certain values of the other
variable.

Is the variable “Survival Status” associated
with the variable “Smoker”? Does smoking

Other Examples of Association

Smoking and BMI

Smoking and lung cancer

Irrigation and plant growth

Traffic and air pollution

Gender and height

3.1 Explore the Association
between Two Categorical Variables

A contingency table is used to explore the
association between two categorical variables:

Rows

list the categories of one variable.

Columns

list the categories of the other variable.

Each
cell

in the table holds the number of observations
(frequency) in the sample with certain outcomes on the
two variables.

Cross
-
tabulation: The process of finding the
frequencies for the cells of a contingency table.

The previous table is an example of a contingency
table.

Construct Contingency Tables
From Raw Data

Excel
Data
: Two Variables

Cancer Treatment
: treatments given to the
cancer patients (Surgery and Radiation
therapy).

Cancer Controlled
: whether cancer has been
controlled (Yes and No).

Contingency table (Example)

Treatment

Cancer Controlled

Total

Yes

No

Surgery

21

2

23

Therapy

15

3

18

Total

36

5

41

Questions
: (1) What proportion of the patients who had surgery

had their cancer controlled?

(2) What proportion of all cancer patients had their cancer

controlled?

(1) 21 / 23 = 91% of the patients who had surgery
had their cancer controlled.

(2) 36 / 41 = 88% of all cancer patients had cancer
controlled.

Conditional Proportions

A conditional proportion

is the proportion
of one variable at a given level of the
other variable.

Marginal proportion

A
marginal proportion

is the proportion of a
row or column variable.

Side
-
by
-
side bars

Display conditional proportions.

Useful for making comparisons.

Side
-
by
-
side bars: Example

Cancer condition for two cancer treatments
0
0.2
0.4
0.6
0.8
1
Yes
No
Cancer controlled
Proportion
Surgery

The proportion of patients who had their cancer
controlled is slightly higher for the patients who had
surgery than for those who had radiation therapy.

Is There an Association?

Food type

Pesticide Present

Pesticide

Not Present

Organic

29 (0.23)

98 (0.77)

Conventional

19485 (0.73)

7086 (0.27)

Pesticide Status for Organic vs. Conventional Food
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Pesticide Present
Pesticide Not Present
Organic
Conventional
Examples

Ex 3.8 page 101

Ex 3.3 page 100

3.2 How Can We Explore the Association
between Two Quantitative Variables

An association can be studied between

two categorical variables

two quantitative variables

a categorical variable and a quantitative variable.

In this section, we explore the association between
two quantitative variables.

That is, we will study how a response variable tends
to change as the value of an explanatory variable
changes.

Scatterplots

A
scatterplot

is a graphical display of
relationship between two quantitative variables.
It portrays two variables simultaneously

horizontal axis
: the explanatory variable

vertical axis
: response variable.

point

in the display: observation corresponding to a
subject.

2
4
6
8
5
10
15
20
x
y
Example: Worldwide Use of Internet

Click

to see the data (text, page 103).

Data dictionary
-

GDP: Gross domestic product, per capita, in thousands of US dollars

CO
2
: Carbon dioxide emissions, per capita, in tons

Cellular: Percentage of adults who are cellular
-
phone subscribers

Fertility: Mean number of children per adult woman

Question to explore

(1) Describe the center and spread of the data distribution.

(2) Portray the relationship with a scatterplot for Internet use

and GDP

(3) What do you learn about the association by inspecting

the scatterplot?

0
10
20
30
40
50
INTERNET
0
5
10
15
20
25
30
35
GDP
Mean: 21.14

Standard deviation: 18.47

Mean: 16.00

Standard deviation: 10.60

0
5
10
15
20
25
30
35
0
10
20
30
40
50
GDP
INTERNET
Interpreting Scatterplots

You can describe the overall pattern of a
scatterplot by the trend, direction, and
strength of the relationship between the
two variables

Trend
: linear, curved, clusters, no pattern

Direction
: positive, negative, no direction

Strength
: how closely the points fit the trend

Also look for outliers from the overall trend

Positive Association

Two quantitative variables x and y are

Positively associated

when

High values of x tend to occur with high values of y

Low values of x tend to occur with low values of y

Negatively associated

when high values of one
variable tend to pair with low values of the other
variable

Would you expect a positive association, a

negative association or no association between

the age of the car and the mileage on the

odometer?

a)
Positive association

b)
Negative association

c)
No association

Moving Graphics

http://www.gapminder.org

Linear Correlation, r

Measures the strength and direction of the
linear association between x and y

A positive r value indicates a positive association

A negative r value indicates a negative association

An r value close to +1 or
-
1 indicates a strong linear
association

An r value close to 0 indicates a weak association

)
)(
(
1
1
y
x
s
y
y
s
x
x
n
r

Properties of Correlation

Always falls between
-
1 and +1

Sign of correlation denotes direction

(
-
) indicates negative linear association

(+) indicates positive linear association

Correlation has a unitless measure
-

does not
depend on the variables’ units

Two variables have the same correlation no
matter which is treated as the response variable

Correlation is sensitive to outliers

Correlation only measures strength of
linear

relationship

Calculating the Correlation Coefficient

Country

Per Capita GDP (
x
)

Life Expectancy (
y
)

Austria

21.4

77.48

Belgium

23.2

77.53

Finland

20.0

77.32

France

22.7

78.63

Germany

20.8

77.17

Ireland

18.6

76.39

Italy

21.5

78.51

Netherlands

22.0

78.15

Switzerland

23.8

78.99

United Kingdom

21.2

77.37

Per Capita Gross Domestic Product and Average Life
Expectancy for Countries in Western Europe

Calculating the Correlation Coefficient

0.809
(7.285)
1
10
1

n
1
i
y
i
x
i
s
y
y
s
x
x
1
-
n
1
r
x

y

21.4

77.48

-
0.078

-
0.345

0.027

23.2

77.53

1.097

-
0.282

-
0.309

20.0

77.32

-
0.992

-
0.546

0.542

22.7

78.63

0.770

1.102

0.849

20.8

77.17

-
0.470

-
0.735

0.345

18.6

76.39

-
1.906

-
1.716

3.271

21.5

78.51

-
0.013

0.951

-
0.012

22.0

78.15

0.313

0.498

0.156

23.8

78.99

1.489

1.555

2.315

21.2

77.37

-
0.209

-
0.483

0.101

= 21.52

= 77.754

sum = 7.285

s
x
=1.532

s
y
=0.795

y
i
/s
y
y

x
i
/s
x
x

x
y

y
i
x
i
s
y
-
y
s
x
-
x
Called Z
-
Scores

Divide a Scatterplot into Quadrants

0
5
10
15
20
25
30
35
0
10
20
30
40
50
GDP
INTERNET
x =16
y = 21.1
I

II

III

IV

In quadrant I, both z
-
scores positive;

In quadrant II, z
-
scores
of Internet are positive,
while z
-
scores of GDP
are negative;

In quadrant III, both z
-
scores negative;

In quadrant IV, z
-
scores
of GDP are positive,
while z
-
scores of
INTERNET are negative;

3.3 How Can We Predict the
Outcome of a Variable?

When a scatterplot indicates a relationship
between two variables, we can start fitting
a curve to the data.

The procedure of fitting a curve to the
data, along with inferences about
parameters of interest and prediction of
the response value, is called regression
analysis.

Regression Analysis

The first step of a
regression analysis

is
to identify the response and explanatory
variables

We use
y

to denote the
response variable

We use
x

to denote the
explanatory
variable

Regression Line

A regression line is a straight line that describes how
the response variable (y) changes as the explanatory
variable (x) changes

A regression line predicts the value of the response
variable (y) for a given level of the explanatory
variable (x)

The y
-
intercept of the regression line is denoted by
a

The slope of the regression line is denoted by
b

Example: How Can Anthropologists
Predict Height Using Human Remains?

Regression Equation:

is the predicted height and is the length
of a femur (thighbone), measured in
centimeters

x
y
4
.
2
4
.
61
ˆ

y
ˆ
x

Use the regression equation to predict
the height of a person whose femur
length was 50 centimeters


ˆ
y

61
.
4

2
.
4
(
50
)

181
.
4
Interpreting the y
-
Intercept

y
-
Intercept:

The predicted value for y when x = 0

Helps in plotting the line

May not have any interpretative value if no
observations had x values near 0

Interpreting the Slope

Slope: measures the change in the
predicted variable (y) for a 1 unit increase
in the explanatory variable in (x)

Example: A 1 cm increase in femur length
results in a 2.4 cm increase in predicted
height

Slope Values:

Positive, Negative, Equal to 0

Regression Line

At a given value of x, the equation:

Predicts a single value of the response variable

But… we should not expect all subjects at that value of
x to have the same value of y

Variability occurs in the y values!

bx
a
y

ˆ
Residuals

Measures the size of the prediction errors, the
vertical distance between the point and the
regression line

Each observation has a residual

Calculation for each residual:

A large residual indicates an unusual
observation

ˆ
y y

“Least Squares Method” Yields the
Regression Line

Residual sum of squares:

The least squares regression line is the line that
minimizes the vertical distance between the points
and their predictions, i.e., it minimizes the residual
sum of squares

Note: the sum of the residuals about the regression
line will always be zero

2 2
ˆ
( ) ( )
residuals y y
 
 
Regression Formulas for y
-
Intercept and
Slope

Slope:

Y
-
Intercept:

( )
y
x
s
b r
s

( )
a y b x
 

y
x
,
Regression line always passes through

Calculating the slope and y intercept for the
regression line

:
.275
4.979
0.0091
Given
Fi
0.368
nd and
3
.
0.65
x
y
x
y
s
s
a b
r

.275
4.979
0.0091
0.368
0.653
0.368
0.653 26.4
0.0091
x
y
y
x
s
b r
x
y
s
s
s
r
 
 
  
 
 
 

4.979 26.4 0.275 2.28
a y b x
 
   
y intercept=
-
2.28

Slope =26.4

0.653
r

Find a and b.

Internet Usage and

Gross National Product (GDP)

INTERNET
GDP
INTERNET
GDP
Algeria
0.65
6.09
Japan
38.42
25.13
Argentina
10.08
11.32
Malaysia
27.31
8.75
Australia
37.14
25.37
Mexico
3.62
8.43
Austria
38.7
26.73
Netherlands
49.05
27.19
Belgium
31.04
25.52
New Zealand
46.12
19.16
Brazil
4.66
7.36
Nigeria
0.1
0.85
46.66
27.13
Norway
46.38
29.62
Chile
20.14
9.19
Pakistan
0.34
1.89
China
2.57
4.02
Philippines
2.56
3.84
Denmark
42.95
29
Russia
2.93
7.1
Egypt
0.93
3.52
Saudi Arabia
1.34
13.33
Finland
43.03
24.43
South Africa
6.49
11.29
France
26.38
23.99
Spain
18.27
20.15
Germany
37.36
25.35
Sweden
51.63
24.18
Greece
13.21
17.44
Switzerland
30.7
28.1
India
0.68
2.84
Turkey
6.04
5.89
Iran
1.56
6
United Kingdom
32.96
24.16
Ireland
23.31
32.41
United States
50.15
34.32
Israel
27.66
19.79
Vietnam
1.24
2.07
Yemen
0.09
0.79

Enter x data into L1

Enter y data into L2

1.

2.
Choose 8:
LinReg(a+bx)

3.
1
st

number = x variable

4.
2
nd

number = y variable

5.
Enter

Using TI
-
83

Cereal: Sodium and Sugar

The Slope and the Correlation

Correlation:

Describes the strength of the linear association between 2
variables

Does not change when the units of measurement change

Does not depend upon which variable is the response and
which is the explanatory

Slope:

Numerical value depends on the units used to measure the
variables

Does not tell us whether the association is strong or weak

The two variables must be identified as response and
explanatory variables

The regression equation can be used to predict values of the
response variable for given values of the explanatory
variable

The Squared Correlation

When a strong linear association exists, the
regression equation predictions tend to be much better
than the predictions using only

We measure the
proportional reduction in error

and
call it, r
2
, which measures the proportion of the
variation in the y
-
values that is accounted for by the
linear relationship of y with x.

A correlation of 0.9 means that

81% of the variation in the y
-
values can be explained by the
explanatory variable, x

2
0.9 0.81 81%
 
.
y
3.4 What Are Some Cautions in
Analyzing Association?

Be cautious of

Extrapolation

Influential outliers

Interpretation of correlation or association

Lurking variables

Confounding

Extrapolation

Extrapolation:
Using a regression line to
predict y
-
values for x
-
values outside the
observed range of the data

It’s riskier as we move farther from the range
of the given x
-
values

There is no guarantee that the relationship
given by the regression equation holds
outside the range of sampled x
-
values

Outliers and Influential Points

A
regression

outlier

is an observation/point
that lies far away from the trend that the rest of
the data follows

An observation is
influential

if

Its
x

value is relatively low or high compared
to the remainder of the data, and

The observation is a regression outlier.

An influential observation tends to pull the
regression line toward that data point and
away from the rest of the data.

Impact of removing an Influential
data point

Interpretation of Correlation and
Association

Correlation does not imply causation
.

In general, it’s also true that
association
does not imply causation
. This warning
holds whether we analyze associations
between qualitative variables or between
quantitative variables.

Create a scatterplot for “Crime rate”
against “Education” in the “FL crime”
data

on the text CD.

Scatterplot of Crime against Education
y = 0.1467x + 61.802
R
2
= 0.218
50
60
70
80
90
0
20
40
60
80
100
120
140
Education
Crime rate
Lurking Variables

A
lurking variable

is a variable, usually
unobserved, that influences the
association between the variables of
primary interest.

135

STAT 319 Biometrics Fall 2008

Example
: A reporter studied the causes of a fire to
a house and established a high positive correlation
between the damages (in dollars) and the number
of firefighters at the scene. Which of the following
could be a
lurking variable
that is responsible for
the association?

(a) Firefighter

(b) Weather

(c) Size of the house

(d) Size of the blaze

136

STAT 319 Biometrics Fall 2008

Example
: An economist noticed that nations with
more TV sets have higher life expectancies. He
established a high positive correlation between
length of life and number of TV sets. Find the lurking
variable, if there is one.

(a) TV sets brands

(b) Popcorn

(c) Wealth of the nation

(d) Sofa

(e) No confounding variable

refers to the phenomenon
that the direction of an association between
two variables can change after we include a
third variable and analyze the data at separate
levels of that variable. (Book)

(or the
Yule
-
Simpson
effect
) is a statistical paradox wherein the
successes of groups seem reversed when the
groups are combined. (Wiki)

Is Smoking Actually Beneficial to Your Health?

This is Example 1 on page 93 of text. 1314 women were asked whether
they were smokers. They were followed over a period of 20 years.

Smoker

Survival Status

Total

Alive

Yes

139

443

582

No

230

502

732

Total

369

945

1314

The data indicate that smoking could apparently be
beneficial to your health. Could a lurking variable be
responsible for the association?

The were also age information about the 1314
women involved in the study. These women can
be stratified into 4 different age groups, creating 4
contingency tables.

Smoker

Age group

18
-

34

35
-

54

55
-

64

65 +

Yes

No

5 174

6 213

41 198

19 180

51 64

40 81

42 7

165 28

Question
: For each age group, find conditional proportions of
deaths for smokers and nonsmokers.

http://en.wikipedia.org/wiki/Simpson's_par

Simpson's paradox for continuous data: a positive
trend appears for two separate groups (blue and red),
a negative trend (black, dashed) appears when the
data are combined.

Confounding

When two explanatory variables are both
associated with a response variable but
are also associated with each other, there
is said to be
confounding
.

Age is a confounding variable in the study
of the association between smoking and
survival status.

Difference between a Confounding
Variable and a Lurking Variable

A confounding variable is already included in the
study. It is associated both with the response
variable and the explanatory variable.

A lurking variable is not measured in the study. It has
the
potential

for confounding.

The effect of an explanatory variable can be
analyzed by
confounding variables.

Ignoring lurking variables results in misleading
conclusions. (age in smoking
-
survival association).

Chapter 4:

Gathering Data

Section 4.1

Should We Experiment or Should
We Merely Observe?

Statistics for the Physical Sciences (STAT
229
-
02)

Homework #4

4
-
1: Problems 4.2, 4.4, 4.6, 4.8, 4.10

4
-
2: Problems 4.14, 4.18, 4.20, 4.22, 4.28,
4.30

4
-
3: Problems 4.34, 4.36, 4.38, 4.40, 4.42

4
-
4: Problems 4.44, 4.46, 4.48, 4.50, 4.52,
4.54

145

1.
Population versus Sample

2.
Types of Studies: Experimental and
Observational

3.
Comparing Experimental and
Observational Studies

Learning Objectives:

146

Population

Population: all the subjects of interest

We use statistics to learn about the population, the
entire group of interest

Sample: subset of the population

Data is collected for the sample because we cannot
typically measure all subjects in the population

Learning Objective 1:

Population and Sample

Sample

147

Learning Objective 2:

Type of Study: Observational Study

In an
observational study,

the
researcher observes values of the
response variable and explanatory
variables for the sampled subjects,
without anything being done to the
subjects (such as imposing a
treatment)

148

Learning Objective 2:

Observational Study

Sample Survey

A sample survey selects a sample of
people from a population and interviews
them to collect data.

A sample survey is a type of observational
study.

A census is a survey that attempts to
count the number of people in the
population and to measure certain

149

Learning Objective 2:

Type of Study: Experiment

A researcher conducts an
experiment
by assigning subjects to certain
experimental conditions and then
observing outcomes on the response
variable

The experimental conditions, which
correspond to assigned values of the
explanatory variable, are called
treatments

150

Learning Objective 2:

Example

Headline: “Student Drug Testing Not Effective in
Reducing Drug Use”

Facts about the study:

76,000 students nationwide

Schools selected for the study included schools
that tested for drugs and schools that did not test
for drugs

Each student filled out a questionnaire asking
about his/her drug use

151

Learning Objective 2:

Example

Conclusion: Drug use was similar in schools that
tested for drugs and schools that did not test for
drugs

152

Learning Objective 2:

Example

This study was an observational study.

In order for it to be an experiment, the
researcher would had to have assigned
each school to use or not use drug
testing rather than leaving this decision
to the school.

153

Learning Objective 3:

Comparing Experiments and Observational Studies

An experiment reduces the potential for
lurking variables

to affect the result.
Thus, an experiment gives the
researcher more control over outside
influences.

Only an experiment can establish
cause and effect. Observational
studies can not.

Experiments are not always possible
due to ethical reasons, time
considerations and other factors.

154

Chapter 4

Gathering Data

Section 4.2

What are Good Ways and Poor Ways to
Sample?

Learning Objectives:

1.
Sampling Frame & Sampling Design

2.
Simple Random Sample (SRS)

3.
Random number table

4.
Margin of Error

5.
Convenience Samples

6.
Types of Bias in Sample Surveys

7.
Key Parts of a Sample Survey

156

Learning Objective 1:

Sampling Frame & Sampling Design

The sampling frame is the list of subjects
in the population from which the sample is
taken, ideally it lists the entire population
of interest

The sampling design determines how the
sample is selected. Ideally, it should give
each subject an equal chance of being
selected to be in the sample

157

Learning Objective 2:

Simple

Random Sampling, SRS

Random Sampling is the best way of
obtaining a sample that is
representative of the population

A
simple random sample

of ‘n’
subjects from a population is one in
which each possible sample of that
size has the same chance of being
selected

158

Learning Objective 2:

SRS Example

Two club officers are to be chosen for a New Orleans trip

There are 5 officers: President, Vice
-
President,
Secretary, Treasurer and Activity Coordinator

The 10 possible samples are:

(P,V) (P,S) (P,T) (P,A) (V,S)

(V,T) (V,A) (S,T) (S,A) (T,A)

For a SRS, each of the ten possible samples has an
equal chance of being selected. Thus, each sample has
a 1 in 10 chance of being selected and each officer has
a 4 in 10 chance of being selected.

159

Learning Objective 3:

SRS: Table of Random Numbers

Table E on pg. A6 of text

Table of Random Numbers

160

Leaning Objective 3:

Using Random Numbers to select a SRS

To select a simple random sample

Number the subjects in the sampling frame
using numbers of the same length (number of
digits)

Select numbers of that length from a table of
random numbers or using a random number
generator

Include in the sample those subjects having
numbers equal to the random numbers
selected

161

We need to select a random sample of 5 from a class of 20 students.

1)
List and number all members of the
population
, which is the class of 20.

2)
The number 20 is two
-
digits long.

3)
Parse the list of random digits into numbers that are two digits long. Here
we choose to start with line 2, for no particular reason.

Learning Objective 3:

Choosing a simple random sample

22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34 02

162

1 Alison

2 Amy

3 Brigitte

4 Darwin

5 Emily

6 Fernando

7 George

8 Harry

9 Henry

10 John

11 Kate

12 Max

13 Moe

14 Nancy

15 Ned

16 Paul

17 Ramon

18 Rupert

19 Tom

20 Victoria

Remember that 1 is 01, 2 is 02, etc.

If you were to hit 09 again before getting five people,
don’t sample Ramon twice

you just keep going.

4)
Choose a
random sample

of size 5 by reading through the
list of two
-
digit random numbers, starting with line 2 and on.

5)
The first five random numbers matching numbers assigned
to people make the SRS.

22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34
02

The first individual selected is Amy, number 02. That’s it
from line 2. Move to line 3

Then Moe (13), Darwin, (04), Henry (09), and Net (15)

24
13

04

83 60 22 52 79 72 65 76 39 36 48
09

15

17 92 48 30

163

Learning Objective 4:

Margin of Error

Sample surveys are commonly used to
estimate population percentages

These estimates include a
margin of
error

which tells us how well the sample
estimate predicts the population
percentage

When a SRS of n subjects is used, the margin
of error is approximately

1
100%
n

164

Learning Objective 4:

Example: Margin of Error

A survey result states: “The
margin of error

is plus
or minus 3 percentage points”

This means: “It is very likely that the reported
sample percentage is no more than 3% lower or 3%
higher than the population percentage”

Click
here

to see a Gallup example. Read the
“Survey Methods” part and justify the margin of
error in the survey.

165

Learning Objective 5:

Convenience Samples: Poor Ways to Sample

Convenience Sample:

a type of survey
sample that is easy to obtain

Unlikely to be representative of the
population

Often severe biases result from such a
sample

Results apply ONLY to the observed
subjects; that is, they are descriptive.

166

Learning Objective 5:

Convenience Samples: Poor Ways to Sample

Volunteer Sample:

most common form
of convenience sample

Subjects volunteer for the sample

Volunteers do not tend to be
representative of the entire population

167

Learning Objective 6:

Types of Bias in Sample Surveys

Bias
: Tendency to systematically favor certain parts of the
population over others

Sampling Bias
: Occurs when using biased samples, which
are based on sampling methods such as
using nonrandom
samples
or having undercoverage

Nonresponse bias
: occurs when some sampled subjects
cannot be reached or refuse to participate or
some questions

Response bias
: occurs when the subject gives an
incorrect
response
or the question is misleading

A Large Sample Does Not Guarantee An Unbiased Sample!

168

Learning Objective 7:

Key Parts of a Sample Survey

Identify

the population of all subjects of interest

Construct

a sampling frame which attempts to list all
subjects in the population

Use

a random sampling design to select
n

subjects
from the sampling frame

Be

cautious of sampling bias due to nonrandom
samples

We can make inferences about the population of interest
when sample surveys that use random sampling are
employed.

169

Chapter 4

Gathering Data

Section 4.3

What Are Good Ways and Poor Ways to Experiment?

Learning Objectives:

1.
Identify the elements of an experiment

2.
Experiments

3.
3 Components of a good experiment

4.
Blinding the Study

5.
Define Statistical Significance

6.
Generalizing Results of the Study

171

Learning Objective 1:

Elements of an Experiment

Experimental units
: the subjects of an experiment; the
entities that we measure
in an experiment

Treatment
: A specific experimental condition imposed on
the subjects of the study; the treatments correspond to
assigned values of the explanatory variable

Explanatory variable
: Defines the groups to be compared
with respect to values on the response variable

Response variable
: The outcome measured on the
subjects to reveal the effect of the treatment(s).

172

Learning Objective 2:

Experiments

An experiment deliberately imposes treatments on the
experimental units in order to observe their responses.

The
goal

of an experiment is
to compare the effect of

the treatment on the response
.

Experiments that are randomized occur when the
subjects are randomly assigned to the treatments;
randomization helps to eliminate the effects of lurking
variables

173

Learning Objective 3:

3 Components of a Good Experiment

Control/Comparison group: allows the
researcher to analyze the effectiveness of
the primary treatment

Randomization: eliminates possible
researcher bias, balances the comparison
groups on known as well as on lurking
variables (so that the observed difference
among subjects is attributed to treatments)

Replication: allows us to attribute
observed effects to the treatments rather
than ordinary variability

174

Learning Objective 3:

Principle 1: Control or Comparison Group

A placebo is a dummy treatment, i.e. sugar pill. Many
subjects respond favorable to any treatment, even a
placebo.

A control group typically receives a placebo. A control
group allows us to analyze the effectiveness of the
primary treatment.

A control group need not receive a placebo. Clinical trials
often compare a new treatment for a medical condition, not
with a placebo, but with a treatment that is already on the
market.

175

Learning Objective 3:

Principle 1: Control or Comparison Group

Experiments should
compare

treatments rather
than attempt to assess the effect of a single
treatment in isolation

Is the treatment group better, worse, or no different
than the control group?

Example: 400 volunteers are asked to quit
smoking and each start taking an antidepressant.
In 1 year, how many have relapsed? Without a
control group (individuals who are not on the
antidepressant), it is not possible to gauge the
effectiveness of the antidepressant.

176

Learning Objective 3:

Placebo effect

Placebo effect (power of suggestion) : The
“placebo effect” is an improvement in
health due not to any treatment but only to
the patient’s belief that he or she will
improve.

177

Learning Objective 3:

Principle 2: Randomization

To have confidence in our results we should
randomly

assign subjects to the treatments. In doing so, we

Eliminate bias that may result from the researcher
assigning the subjects

Balance the groups on variables known to affect the
response

Balance the groups on lurking variables that may be
unknown to the researcher

178

Learning Objective 3:

Principle 3: Replication

Replication is the process of assigning
several experimental units to each
treatment

The difference due to ordinary variation is
smaller with larger samples

We have more confidence that the sample
results reflect a true difference due to
treatments when the sample size is large

Since it is always possible that the observed
effects were due to chance alone, replicating
the experiment also builds confidence in our
conclusions

179

Learning Objective 4:

Blinding the Experiment

Ideally, subjects are unaware, or
blind
, to the
treatment they are receiving

If an experiment is conducted in such a way that
neither the subjects nor the investigators working with
them know which treatment each subject is receiving,
then the experiment is
double
-
blinded

A double
-
blinded experiment controls
response bias
from the respondent and experimenter

180

If an experiment (or other study) finds a difference in
two (or more) groups, is this difference really
important?

If the observed difference is larger than what would
be expected just by chance, then it is labeled
statistically significant
.

Rather than relying solely on the label of
statistical
significance
, also look at the actual results to
determine if they are
practically significant
.

Learning Objective 5:

Define Statistical Significance

181

Learning Objective 6:

Generalizing Results

Recall that the goal of experimentation is
to analyze the association between the
treatment and the response for the
population
, not just the sample

However, care should be taken to
generalize the results of a study only to
the population that is represented by the
study.

182

Chapter 4

Gathering Data

Section 4.4

What are Other Ways to Conduct Experimental and
Observational Studies

Learning Objectives

1.
Sample Surveys: Other Random Sampling Designs

2.
Types of Observational Studies: Prospective and
Retrospective

3.
Multifactor Experiment

4.
Matched pairs design

5.
Randomized block design

184

Learning Objective 1:

Sample Surveys: Random Sampling
Designs

It is not always possible to conduct an
experiment , so it is necessary to have
well designed, informative studies that are
not experimental, e.g., sample surveys
that use randomization

Simple Random Sampling

Cluster Sampling

Stratified Random Sampling

185

Learning Objective 1:

Sample Surveys: Cluster Random Sample

Steps

Divide the population into a large number of
clusters
, such as city blocks

Select a simple random sample of the clusters

Use the subjects in those clusters as the
sample

186

Learning Objective 1:

Sample Surveys: Cluster Random
Sample

Preferable when

A reliable sampling frame is unavailable

The cost of selecting a SRS is excessive

Usually need a larger sample size than with a
SRS in order to achieve a particular margin of
error

187

Learning Objective 1:

Sample Surveys: Stratified Random Sample

Steps

Divide the population into separate groups,
called
strata

Select a

simple random sample from each
strata

Combine the samples from all strata to form
complete sample

188

Learning Objective 1:

Sample Surveys: Stratified Random Sample

Advantage is that you can include in your
sample enough subjects in each stratum
you want to evaluate

Disadvantage is that you must have a
sampling frame and know the stratum into
which each subject belongs

189

Learning Objective 1:

Stratified Random Sample
-

Example

Suppose a university has the following student
demographics:

55% 20% 5% 20%

In order to insure proper coverage of each
demographic, a stratified random sample of 100
students could be chosen as follows: select a SRS
of 55 undergraduates, a SRS of 20 graduates, a
SRS of 5 first professional students, and a SRS of
20 special students; combine these 100 students.

190

Learning Objective 1:

Comparing Random Sampling Methods

191

Learning Objective 2:

Types of Observational Studies

An observational study can yield useful information when an
experiment is not practical.

Types of observational studies:

Sample Survey: attempts to take a cross section of a
population at the current time

Retrospective

study: looks into the past

Prospective

study: follows its subjects into the future

Causation can never be definitively established with an
observational study, but well designed studies can provide
supporting evidence for the researcher’s beliefs

192

Learning Objective 2:

Retrospective Case
-
Control Study

A
case
-
control study

is a retrospective
observational study in which subjects who
have a response outcome of interest (the
cases) and subjects who have the other
response outcome (the controls) are
compared on an explanatory variable

193

Learning Objective 2:

Example: Case
-
Control Study

Response outcome of interest: Lung cancer

The
cases

have lung cancer

The
controls

did not have lung cancer

The two groups were compared on the explanatory
variable

smoker/nonsmoker

Smoker
Cases
Controls
yes
688
650
no
21
59
Total
709
709
Prob(smoker)
97%
92%
Lung Cancer
194

Learning Objective 2:

Example: Prospective Study

Nurses’ Health Study:

Began in 1976 with 121,700 female nurses aged 30 to 55;
questionnaires are filled out every two years

Purpose was to explore the relationships among diet,
hormonal factors, smoking habits and exercise habits and
the risk of coronary heart disease, pulmonary disease and
stroke

Nurses are followed into the future to determine whether
they eventually develop an outcome such as lung cancer
and whether certain explanatory variables are associated
with it

195

Learning Objective 3:

Multifactor Experiments

A Multifactor experiment uses a single experiment to
analyze the effects of two or more explanatory variables
on the response

Categorical explanatory variables in an experiment are
often called
factors

We are often able to learn more from a multifactor
experiment than from separate one
-
factor experiments
since the response may vary for different factor
combinations

196

Learning Objective 3:

Example: Multifactor experiment

Examine the effectiveness of
both Zyban and nicotine
patches on quitting smoking

Two factor experiment

4 treatments

197

Learning Objective 3:

Example: Multifactor experiment

subjects
: a certain number of undergraduate
students

all subjects viewed a 40
-
minute television program
that included ads for a digital camera

some subjects saw a 30
-
second commercial; others
saw a 90
-
second version

same commercial was shown either 1, 3, or 5 times
during the program

there were two
factors
: length of the commercial (2
values), and number of repetitions (3 values)

198

Learning Objective 3:

Example: Multifactor experiment

the 6 combinations of one value of
each factor form six
treatments

Factor B:

Repetitions

1 time

3 times

5 times

Factor A:

Length

30
seconds

1

2

3

90
seconds

4

5

6

subjects assigned
to Treatment 3 see
a 30
-
five times during
the program

after viewing, all subjects answered questions about:
recall of the ad, their attitude toward the commercial,
and their intention to purchase the product

these
were the
response variables
.

199

Learning Objective 4:

Matched Pairs Design

In a matched pairs design, the subjects receiving the two
treatments are somehow matched (same person,
husband/wife, two plots in the same field, etc.)

In a
crossover design
, the same individual is used for the
two treatments

Randomly

assign the two treatments to the two matched subjects, or

randomize the order of applying the treatments in a
crossover design

The number of replicates equals the number of pairs

Helps to reduce effects of lurking variables

200

Learning Objective 5:

Randomized Block Design

A
block

is a set of experimental units that
are matched with respect to one or more
characteristics

A
Randomized Block Design, RBD,

is
when the random assignment of
experimental units to treatments is carried
out separately within each block

201

Learning Objective 5:

Example: Randomized Block Design

Block = gender; 3 treatments = 3 types of therapy

The men (as well as the women) are randomly assigned to the

3 treatments; differences can be compared with respect to

gender as well as therapy type

202

Learning Objective 5:

Randomized Block Design

RBD eliminates variability in the response
due to the blocking variable; allows for
better comparisons to be made among the
treatments of interest

A matched pairs design is a special case
of a RBD with two observations in each
block

203

Chapter 5

Probability in our Daily Lives

Section 5