Statistics for the Physical Sciences

spotlessstareSecurity

Nov 29, 2013 (3 years and 8 months ago)

166 views

Statistics for the Physical Sciences

STAT 229

Chapter 1


Statistics: The Art and Science of
Learning from Data

Fall 2008 STAT 229

2

Homework 1


Problems 1.1 to 1.36 (even numbered)


Complete the survey on page 22
-
23

Fall 2008 STAT 229

3

1.1 Overview


Statistics

is the art and science of learning
from data. It is a collection of methods for


Planning experiments (
Design
)


Obtaining data (data are collected observations,
such as measurements and survey responses)


Organizing data


Summarizing data (
Description
)


Analyzing data


Interpreting results, and


Making decisions and predictions (
Inference
)



Statistics is a branch of Mathematics
-
>


Fall 2008 STAT 229

4


Statistics is invented for studying

Randomness
-

a lack of order, purpose,
cause, or predictability (by Wiki)
-

without
which the world will be of no interest.


Examples of random phenomena:


Phelps won 8 gold medals


A 6
-
sided die is flipped and landed a 4


It’s going to rain tomorrow


Randomness, Fuzziness and Uncertainty


Randomness creates uncertainty. On the
other hand, randomness can be used.
When
estimating the proportion of adults in USA who smoked, we can survey
1000 adults and use the survey responses as our data. How randomness
is used? Why use it?

Fall 2008 STAT 229

5

1.2 We Learn about Population
Using Samples


In the previous example, all US adults
form a
population

while the 1000 surveyed
adults form a
sample
.


In general, a
population

is the complete
collection of all items to be studied. These
items can be human subjects, animals,
machines, even scores.


A
sample

is a sub
-
collection of items
selected from a population.

Fall 2008 STAT 229

6

More about Samples


A sample should
represent

the underlying
population. Therefore, sample data must be
collected in an appropriate way, such as through
a process of random selection.


How large should a sample be?


What are those appropriate ways to generate a
sample?


Methods for summarizing sample data are
referred to as
descriptive statistics
, while
methods for making decisions or predictions
about a population based on sample data are
called
inferential statistics
.

Fall 2008 STAT 229

7

Parameter and Statistic


A parameter is a numeric summary of the
population


A statistic is a numeric summary of a
sample taken from the population



Problem:
Number of Good Friends


One year the General Social Survey
asked, “About how many good friends do
you have?” Of the 819 people who
responded, 6% reported having only one
good friend. Identify


(a) the sample


(b) the population, and


(c) the parameter or statistic


Try Problem 1.3 on page 8 of the textbook.


Go to the General Social Survey website


http://sda.berkeley.edu/GSS


By entering HEAVEN as the “row variable”
name, find the percentages of people who
said “yes, definitely,” “yes, probably,” “no,
probably not,” and “no, definitely not” when
asked whether they believed in heaven.

Fall 2008 STAT 229

10

1.3 What Role Do Computers Play
in Statistics?


Save (large) data files


Create databases


Do analysis with software: SAS, Minitab,
Spss, R, Splus, C, Matlab, Excel, ...


Simulation


use of computers to mimic
reality.



Fall 2008 STAT 229

11

Simulation of Coin Tossing

in

Microsoft

Excel

NOTES:

1. Pseudo
-
random numbers are numbers
generated by a computer algorithm to
simulate real random numbers.

2. Excel has an Analysis ToolPak by which
one can do statistical analysis, including
simulation.

Tasks:


When a balanced coin is tossed 20 times, we have a
sequence of 20 Heads or Tails. Let 1 denote Heads
and 0 denote Tails. Then a sample is a sequence of
1 or 0. The empirical probability or sample proportion
of tossing Heads(1) is computed as the number of
1’s divided by the total number of tosses. The coin
-
tossing process can be simulated using Bernoulli
distribution with proportion p = 0.5.



1. Simulate 5 random samples, each consisting of
10 pseudo
-
random numbers from a Bernoulli(0.5)
distribution. Repeat the process using 1000 pseudo
-
random numbers.


2. Compute the sample proportion for each of the 10
samples.

Simulation

Follow this:



Tools

Data Analysis


Random Number Generation


Bernoulli

More questions:

1.
Where does randomness play a role?

2.
Is the amount of variability from sample to
sample of size 10 bigger than the amount of
variability from sample to sample of size
1000?

3.
Comment on the effect of sample size.

If You Are Using Excel 2007…


Excel 2007
no longer
have
tools

menu.


To use
Analysis ToolPak
, go to
office
button

at the upper left corner, click
Excel
options
, then click
Add
-
ins

and highlight
Analysis ToolPak
. Clicking
go

button to
open the
Add
-
ins

window. Check the box
Analysis ToolPak
and click OK.


Now go to Data menu, click
Data Analysis
and choose
Random Number Generation
.

Fall 2008 STAT 229

14

Statistics for the Physical Sciences

STAT 229

Chapter 2


Exploring Data with Graphs and
Numerical Summaries

Homework #2


2
-
1 (p29): Problems 2.2, 2.4, 2.6, 2.8


2
-
2 (p44): Problems 2.10, 2.12, 2.14, 2.16, 2.22


2
-
3 (p55): Problems 2.30, 2.32, 2.34, 2.36, 2.42, 2.44


2
-
4 (p64): Problems 2.48, 2.52, 2.56, 2.58, 2.60


2
-
5 (p73): Problems 2.64, 2.66, 2.68, 2.72, 2.74, 2.78, 2.80, 2.82


2
-
6 (p80): Problems 2.84


2.1 What Are the Types of Data?


A characteristic observed for the subjects in a study
is called a
variable
.


Examples of variable: major, GPA, religious
affiliation, smoking status,...


Variables can be
quantitative

(numerical) or
qualitative (categorical).


A variable is quantitative if its numerical values
represent different magnitudes of the variable, such
as weight, GPA. A variable is categorical, if its
value represents a category, such as major, letter
garde.


Quantitative variables can be
discrete

or
continuous
.


A discrete variable is usually a count such
as the number of car accident last year,
while a continuous variable is a
measurement, such as distance.


The reason we care whether a variable is
quantitative, categorical, discrete, or
continuous is that the
method used to
analyze a data set depends on the type of
variable

the data represent.


Key Features of a Variable


A quantitative variable usually takes different
values in a study. Studying the
spread
(variability)

of such a variable is one of the
most important tasks in statistics. Another
feature of a quantitative variable is the
center

of all its possible values.


For a categorical variable, a key feature to
describe is the
relative number of items

(percentage) in the various categories.

Frequency Tables


For a categorical variable, counting how
often each possible value is taken by the
variable is a critical first step in descriptive
statistics. The results are summarized in a
frequency table
.


The following table shows the frequency of
shark attacks in various regions for 1990
-
2006.

Region

Frequency

Proportion

Percentage

Florida

365

0.785

78.5

Hawaii

60

0.129

12.9

California

40

0.086

8.6

Total

465

1.000

100

Frequency of shark attacks in
various regions for 1990
-
2006

Questions: What is the variable? Is it categorical?

The
mode

of categorical data is the category with the highest
frequency. Find the mode of the data.

Frequency Tables (cont’d)


In the table above, the proportions and percentages
are also called
relative frequencies
. A table like this
is called a
frequency table
.


A
frequency table

is a listing of possible values for a
variable, together with the number of observations for
each value.


For a quantitative variable, A frequency table is
constructed by first categorizing the data into a set of
adjacent intervals, then finding the frequencies for
each interval.

No. Hours

Frequency

Percent

0
-
1

232

25.6

2
-
3

403

44.5

4
-
5

181

20.0

6
-
7

45

5.0

8 or more

44

4.9

Total

905

100.0

Frequency Table for Daily TV Watching


Example


Construct a frequency table for quiz scores for twenty
students: 5, 7, 8, 3, 7, 7, 1, 9, 6, 8, 5, 6, 7, 10, 7, 9, 6, 8,
6, 6

Score Frequency Proportion Percentage

[0,2]

(2,4]

(4,6]

(6,8]

(8,10]

1

1

7

8

3


Total 20

0.05

0.05

0.35

0.40

0.15

5

5

35

40

15


1.00 100

2.2 How Can We Describe Data Using
Graphical Summaries?


Group


Seats



Percent (%)


EUL

39

5.3


PES

200

27.3


EFA

42

5.7


EDD

15

2


ELDR

67

9.2


EPP

276

37.7


UEN

27

3.7


Other

66

9


Total

732

100.0

Preliminary results of the election for the European Parliament in 2004


Pie Charts and Bar Graphs

for
Categorical Variables



Pie chart:

A circle having a “slice of a
pie” for each category. The size of slice
corresponds to the
percentage

of
observations in the category.



Bar graph:

Displays a vertical bar for
each category. The height of the bar is the
percentage

of observations in the category.

Seats
5%
27%
6%
2%
9%
38%
4%
9%
EUL
PES
EFA
EDD
ELDR
EPP
UEN
Other
Example
: Use the shark attack data from

this
source link

to construct a pie chart

of interest.

Bar Graph for

European Parliament in 2004

0
50
100
150
200
250
300
EUL
PES
EFA
EDD
ELDR
EPP
UEN
Other
Seats
Pareto Chart
0
5
10
15
20
25
30
35
40
EPP
PES
ELDR
Other
EFA
EUL
UEN
EDD
Group
Percentage
Pareto Chart: Bar Graph with categories
Ordered by Their Frequency from the
Tallest Bar to Shortest


Graphs for Quantitative Variables



Dot plots:

Shows a dot for each observation, placed
just above the value on the number line for that
observation.



Stem
-
and
-
Leaf Plots:

similar to dot plot. Each
observation is represented by a stem and a leaf.



Histogram:

a graph uses bars to portray the
frequencies or relative refrequencies.

Example
Dot plot


Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9,
6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6

1 2 3 4 5 6 7 8 9 10

Graphs for Quantitative Variables

Example
Stem
-
and
-
Leaf Plot


Stem Leaves



4

5

6

7

8

9

10

5


2

4 5 6 6

0 4 7 7

6

0

Test scores for 12 students: 80, 45, 100, 76, 84, 87, 96, 62, 75,74, 87, 76

Step 1: Sorted test scores: 45, 62, 74, 75, 76, 76, 80, 84, 87, 87, 96, 100

Step 2: Place the scores in the corresponding stems and leaves.
(usually the last digit will be the leaf)

Graphs for Quantitative Variables

Histogram

Step 1: Divide the range of data into


intervals of equal width.

Step 2: Count the frequency and construct a


frequency table (or relative frequency table).

Step 3: Label the endpoints of the intervals on x
-
axis.


Draw a bar over each interval with height equal


to its frequency (or relative frequency), values


of which are marked on the y
-
axis.

Graphs for Quantitative Variables

Example
Histogram


Quiz scores for twenty students: 5, 7, 8, 3, 7, 7, 1, 9,
6, 8, 5, 6, 7, 10, 7, 9, 6, 8, 6, 6


Score Freq


[0,2) 1


[2,4) 1


[4,6) 7


[6,8) 8


[8,10) 3


[10,12) 0



0
1
2
3
4
5
6
7
8
9
0
2
4
6
8
10
Score
Frequency
Graphs for Quantitative Variables

The Shape of a Distribution


When looking at a graph of quantitative
data (dot plot, stem
-
and
-
leaf plot, and
histogram), look for


the
overall pattern
: Do the data cluster
together?


the
outliers


modes
: unimodal, bimodal,…


skew
: skewed to the left or right


the underlying
smooth

curve


Unimodal Bimodal Multimodal

outliers

Histogram of x
x
Frequency
-10
-5
0
5
10
0
5
10
15
20
25
30
35
Histogram of y
y
Frequency
-10
-5
0
5
10
0
5
10
15
20
25
30
These Two Histograms Show Differences in Spread

Time plots


Time series: a data set collect over time.


Time plot: a graph displaying time
-
series
data.



Look for pattern over time.

Time plots: Example

Gasoline price

2.3 How can we describe the
center of quantitative data?


Measures of center: mean and median


Mean: the sum of the observations divided by
the number of observations.


Median: The midpoint of the observations.

Mean Formula

1 2
1
n
i
x x x
x x
n n
    
 

Example

Travel times to work

How long does it take to get from home to work?

Here are the travel times in minutes in North

Carolina, chosen at random by Census Bureau:

30
20 10 40 25 20 10 60 15 40 5 30 12 10 10

Find the mean travel time.

1 2
n
x x x
x
n
    

30 20 10 337
22.5 minutes
15 15
    
  
How to Determine the Median

Step 1: Sort your data from the smallest


to the largest.

Step 2: If n, the number of data points is


odd
, the median is the middle


value; if n is
even
, the median is


the average of the middle two values.


Example

Find median for the travel times


30 20 10 40 25 20 10 60 15 40 5 30 12 10 10


Arrange the data in order:


5 10 10 10 10 12 15
20

20 25 30 30 40 40 60

Since n = 15 is odd, Median = 20, the middle value.


Example

Find the median for the scores


60 80 87 73 95 92


Arrange the data in order: 60 73
80 87

92 95

Since n = 6 is even, Median = (
80 + 87
)/2 = 83.5, the

average of the two middle values.



Properties of the mean and the
median


The mean is the balance point of the data.


In a symmetric distribution, the mean and
median are the same.


In a skewed distribution, the mean is usually
farther out in the long tail than the median.


Skewed to the right, mean > median


Skewed to the left, mean < median


Mean is less resistant to outliers.


Mean, Median, and Mode



The mean is the balance
point.



The median is the midpoint.



The mode is the value
occurs most frequently.

Mean and Median: Applications


City data


St Cloud, MN


New Orleans, LA

2.4 How can we describe the spread of
quantitative data?


Measures of spread:


The Range


The Stand Deviation


The Interquartile Range (Sec2.5)

Measuring spread: The Range


Range = largest value
-

smallest value



Example
: Find the range of the quiz scores :
2, 5, 0, 7, 9, 1, 7, 6, 10, 9, 3, 9, 9, 7, 0, 6, 9,
10, 8,1, 4, 6, 8, 9, 4, 2, 9, 0, 5, 7




Range = largest value
-

smallest value


= 10
-

0


= 10


Simple to compute


Easy to understand


But


Uses only extreme values


Affected severely by outliers


The Range

Measuring
Spread
:

Variance and Standard Deviation


The
standard deviation
and
variance

measure
spread by looking how far the observations are
from their mean.


The
variance

of a set of observations is an
average of the squares of deviation from the mean.





2
s
2 2 2
2
1 2
2
( ) ( ) ( )
1
1
( )
1
n
i
x x x x x x
s
n
x x
n
       


 



The
standard deviation

s

is the square root of
the variance




2
( )
1
i
x x
s
n





Example

(Calculating the standard deviation
s
)


Metabolic rates of 7 men who took part in a study of
dieting. The units are calories per 24 hours.


1792 1666 1362 1614 1460 1867 1439




Find the mean first:


x
1792 1439 112000
1600 calories
7 7
   
  
The standard deviation: Example

1792

192


36864

1666

66

4356

1362

-
238

56644

1614

14

196

1460

-
140

19600

1867

267

71289

1439

-
161

25921


Observations Deviations Squared deviations


i
x
i
x x

2
( )
i
x x


sum = 0 sum = 214870

The variance


The standard deviation


2
214870
35811.67
6
s
 
35811.67 189.24 calories
s
 
Cont’d

Properties of the

Standard Deviation


The greater the spread, the larger the
s
.


s

≥ 0.


s

= 0 when all the observations take the
same value.


s

can be influenced by outliers.

Interpreting the Magnitude of s:

The Empirical Rule

If a distribution of data is bell shaped, then
approximately:


68%
of the observations fall within 1 stand deviation
of the mean, that is between
-

s and + s.


95%
of the observations fall within 2 stand deviations
of the mean, that is between
-

2s and + 2s.


99.7%
of the observations fall within 3 stand
deviations of the mean, that is between
-

3s and


+ 3s.

x
x
x
x
x
x
Sample Statistics and

Population Parameters


Population: The collection of all individuals or
items under consideration.


Sample: That part of the population from
which we actually collect information.


We use a sample to draw conclusion about
the entire population.



Sample Statistics and

Population Parameters


Parameter: Numerical summary of the
population.


Statistic: Numerical summary of a sample.


Notations:


Population Mean



Population Standard Deviation


Sample Mean


Sample Standard Deviation
s

x


2.5 How Can Measures of Position
Describe Spread?


Measure of positions:


Quartiles


Percentiles.


Percentiles:


p
th percentile: a value such that
p

percent of
observations fall below or at that value.


Quartiles


First quartile
, the same as 25
th

percentile (p=25)


Second quartile
, the same as 50
th

percentile (p=50)


Third quartile
, the same as 75
th

percentile (p=75)



To calculate the quartiles:



1. Arrange the observations in increasing order.


2. The second quartile is the median M.


( = 50th percentile)


3. The first quartile is the median of the


observations whose position in the ordered list is to


the left location of the overall median. ( = 25th


percentile)


4. The third quartile is the median of the


observations whose position in the ordered list is to


the right location of the overall median. ( = 75th


percentile)




1
Q
3
Q

Calculating Quartiles

1
Q
3
Q
2
Q
2
Q


Example 2.17 Travel times to work Find and .


Arrange the data in order:


5 10 10 10 10 12 15
20

20 25 30 30 40 40 60



the left location of the overall median

20

is:


5 10 10 10 10 12 15




= 10





the right location of the overall median

20

is:


20 25 30 30 40 40 60




= 30










1
Q
3
Q
1
Q
3
Q

Quartiles: Example



Example 2.5 Travel times to work Find and .


Travel times in minutes of 20 randomly chosen New York workers: 10 30 5 25 40
20 10 15 30 20 15 20 85 15 65 15 60 60 40 45


Arrange the data in order:


5 10 10 15 15 15 15 20 20
20 25

30 30 40 40 45 60 60 65 85




The overall median = 22.5 minutes





the left location of the overall median is: 5 10 10 15
15 15

15 20 20 20




= 15 minutes





the right location of the overall median is:25 30 30 40
40 45

60 60 65 85




= 42.5 minutes








1
Q
3
Q
1
Q
3
Q

Quartiles: Example

Another Measure of Spread:

The
Interquartile

Range


The Interquartile Range (IQR)


The Interquartile Range =
-




Example

(Travel times to work) Find IQR.


5 10 10
10

10 12 15
20

20 25 30
30

40 40 60


3
Q
1
Q
Detecting Potential Outliers:
The 1.5*IQR Criterion


The
1.5*IQR Criterion
for Identifying
Potential Outliers.



An observation is a
potential outlier
if it
falls
more than

1.5*IQR
below

the first
quartile or
more than

1.5*IQR
above

the
third quartile.

Detecting Potential Outliers:

Example


Example 2.18 Travel times to work (in
minutes). Detecting Potential Outliers.


5 10 10 10 10 12 15


20 20 25 30 30 40 40


80


The five
-
number summary and The
BoxPlot


The five
-
number summary of a distribution
consists of the smallest observation, the first
quartile, the median, the third quartile, and
the largest observation.


Minimum
Median

Maximum



Example 2.19 The five
-
number summary of
travel times to work.


5 10 10 10 10 12 15 20 20 25 30 30 40 40
80



1
Q
3
Q
The Box Plot

Constructing a box plot


A box goes from the and .


A line drawn inside the box at the median.


A line goes from the lower end of the box to
the smallest
observation

that is not an
potential outlier. A separate line goes from the
upper end of the box to the largest
observation

that is not an potential outlier.
These lines are called whiskers.


The potential outliers are shown separately.

1
Q
3
Q
Example

(
Constructing a boxplot
)

Travel times to work.


5 10 10
10

10 12 15
20

20
25 30
30

40 40 80

Steps:

1.
Find Q1, Q2, and Q3:

2.
Find IQR:

3.
Determine two fences:


lower fence = Q1


1.5*IQR


upper fence = Q3 + 1.5*IQR

4.
Identify potentialutliers

5.
Determine whiskers:
one from Q1 to the
smallest observation
within fences, and the
other from Q3 to the
largest within fences.

6.
Draw the boxplot.

20
40
60
80
Q1 = 10

Q2 = 20

Q3 = 30

smallest in fences

Largest in fences

outlier

(Text Page 67)

Sodium values for 20
breakfast cereals:


0 70 125 125


140 150

170 170


180
200 200

210


210 220
220 230



250 260 290 290


R codes
:



x=c(0,70,125,125,140,150,170,170,
180,200,200,210,210,220,220
,230,250,260,290,290)

boxplot(x, col=3, horizontal = T)


Example
(Boxplot)

0
50
100
150
200
250
300
Interpretation of Boxplots


IQR measures the sample
variability

(or
spread).


A box plot indicates
skew
. The side with
the
larger part of the box

and the
longer
whisker

usually has skew in that direction.

Interpretation of Box Plots

Interpretation of Box Plots

In terms of symmetry, median, spread, …

Side
-
by
-
Side Box Plots


Help to compare groups (
in terms of
symmetry, median, spread,…).


Example: (College student heights) Click
here

to see the “Heights” data on the text
CD.


R

codes (copy and
paste to R):


heights=read.table("heights.csv”


, sep=',', header=T)

boxplot(HEIGHT~GENDER,


data=heights, col = 3:4)

0
1
55
60
65
70
75
80
85
90
Box plots comparing heights
The z
-
Score


Z
-
score for an observation is the number of
standard deviation that it falls from the mean and
in which direction.




An observation in a bell
-
shaped distribution is
regarded as a
potential outlier
if it falls more
than three standard deviation from the mean;


that is, z > 3 or z <
-

3.
(Recall the empirical rule, 99.7% of
values are within 3 standard deviations of the mean.)


observation mean
standard deviation
z


The z
-
Score: Example

Example 2.20
The height of a group of young women has
mean x=64 inches
an
d standard deviation =2.7 inches.
The histogram of height is bell-shaped.
-- A woman 70 in
s
70 64
ches tall has standized height = 2
.22
2.7
or 2.22 standard deviations above th
e mean.
60 64
-- A woman 60 inches tall has standized
height = -1.48
2.7
or -1.48 standard deviations
z
z




less than the mean height.
73 64
-- A woman 73 inches tall has standized
height = 3.33
2.7
or 3.33 standard deviations above the
mean height,
and that woman's height is an outlier
.

z



2.6 How Can Graphical
Summaries Be Misused?


Self reading

Statistics for the Physical Sciences

STAT 229

Chapter 3

Association: Contingency,
Correlation, regression

Homework #3


3
-
1: Problems 3.2, 3.4, 3.6, 3.8, 3.10


3
-
2: Problems 3.12, 3.14, 3.16, 3.18, 3.22


3
-
3: Problems 3.26, 3.30, 3.36, 3.38, 3.40


3
-
4: Problems 3.48, 3.50, 3.52, 3.54, 3.56, 3.58, 3.60

Response Variables and

Explanatory Variables


In this chapter, we discuss statistical methods for
data on two variables
.


Some times, one of the two variables may be
termed the response variable and the other
explanatory variable.


The
response variable
is the outcome variable on
which comparisons are made.


The
explanatory variable
defines the group to be
compared with respect to values on the response
variable.


Is Smoking Actually Beneficial to Your Health?


This is Example 1 on page 93 of text. 1314 women were asked whether
they were smokers. They were followed over a period of 20 years.



Smoker

Survival Status



Total

Dead

Alive

Yes

139

443

582

No

230

502

732

Total

369

945

1314

It’s natural to treat the variable “Survival Status” as a
response variable and “Smoker” as an explanatory variable.

Associations


The main purpose of a data analysis with two
variables is to investigate whether there is an
association and to describe the nature of that
association.


An association exits between two variables if
a particular value for one variable is more
likely to occur with certain values of the other
variable.


Is the variable “Survival Status” associated
with the variable “Smoker”? Does smoking
lead to cancer?

Other Examples of Association



Smoking and BMI


Smoking and lung cancer


Irrigation and plant growth


Traffic and air pollution


Gender and height


3.1 Explore the Association
between Two Categorical Variables


A contingency table is used to explore the
association between two categorical variables:


Rows

list the categories of one variable.


Columns

list the categories of the other variable.


Each
cell

in the table holds the number of observations
(frequency) in the sample with certain outcomes on the
two variables.


Cross
-
tabulation: The process of finding the
frequencies for the cells of a contingency table.


The previous table is an example of a contingency
table.

Construct Contingency Tables
From Raw Data


Excel
Data
: Two Variables


Cancer Treatment
: treatments given to the
cancer patients (Surgery and Radiation
therapy).


Cancer Controlled
: whether cancer has been
controlled (Yes and No).



Contingency table (Example)


Treatment

Cancer Controlled


Total

Yes

No

Surgery

21

2

23

Radiation
Therapy

15

3

18

Total

36

5

41

Questions
: (1) What proportion of the patients who had surgery


had their cancer controlled?

(2) What proportion of all cancer patients had their cancer


controlled?

(1) 21 / 23 = 91% of the patients who had surgery
had their cancer controlled.

(2) 36 / 41 = 88% of all cancer patients had cancer
controlled.



Answer

Conditional Proportions


A conditional proportion

is the proportion
of one variable at a given level of the
other variable.



Marginal proportion


A
marginal proportion

is the proportion of a
row or column variable.


Side
-
by
-
side bars


Display conditional proportions.


Useful for making comparisons.


Side
-
by
-
side bars: Example

Cancer condition for two cancer treatments
0
0.2
0.4
0.6
0.8
1
Yes
No
Cancer controlled
Proportion
Surgery
Radiation Therapy


The proportion of patients who had their cancer
controlled is slightly higher for the patients who had
surgery than for those who had radiation therapy.

Is There an Association?

Food type

Pesticide Present

Pesticide

Not Present

Organic

29 (0.23)

98 (0.77)

Conventional

19485 (0.73)

7086 (0.27)

Pesticide Status for Organic vs. Conventional Food
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Pesticide Present
Pesticide Not Present
Organic
Conventional
Examples


Ex 3.8 page 101


Ex 3.3 page 100


3.2 How Can We Explore the Association
between Two Quantitative Variables


An association can be studied between


two categorical variables


two quantitative variables


a categorical variable and a quantitative variable.


In this section, we explore the association between
two quantitative variables.


That is, we will study how a response variable tends
to change as the value of an explanatory variable
changes.

Scatterplots


A
scatterplot

is a graphical display of
relationship between two quantitative variables.
It portrays two variables simultaneously


horizontal axis
: the explanatory variable


vertical axis
: response variable.


point

in the display: observation corresponding to a
subject.

2
4
6
8
5
10
15
20
x
y
Example: Worldwide Use of Internet


Click

to see the data (text, page 103).


Data dictionary
-


GDP: Gross domestic product, per capita, in thousands of US dollars


CO
2
: Carbon dioxide emissions, per capita, in tons


Cellular: Percentage of adults who are cellular
-
phone subscribers


Fertility: Mean number of children per adult woman


Question to explore


(1) Describe the center and spread of the data distribution.


(2) Portray the relationship with a scatterplot for Internet use


and GDP


(3) What do you learn about the association by inspecting


the scatterplot?

0
10
20
30
40
50
INTERNET
0
5
10
15
20
25
30
35
GDP
Mean: 21.14

Standard deviation: 18.47

Mean: 16.00

Standard deviation: 10.60

0
5
10
15
20
25
30
35
0
10
20
30
40
50
GDP
INTERNET
Interpreting Scatterplots


You can describe the overall pattern of a
scatterplot by the trend, direction, and
strength of the relationship between the
two variables


Trend
: linear, curved, clusters, no pattern


Direction
: positive, negative, no direction


Strength
: how closely the points fit the trend


Also look for outliers from the overall trend


Positive Association


Two quantitative variables x and y are


Positively associated

when


High values of x tend to occur with high values of y


Low values of x tend to occur with low values of y


Negatively associated

when high values of one
variable tend to pair with low values of the other
variable

Would you expect a positive association, a


negative association or no association between


the age of the car and the mileage on the


odometer?

a)
Positive association

b)
Negative association

c)
No association




Moving Graphics



http://www.gapminder.org

Linear Correlation, r


Measures the strength and direction of the
linear association between x and y


A positive r value indicates a positive association


A negative r value indicates a negative association


An r value close to +1 or
-
1 indicates a strong linear
association


An r value close to 0 indicates a weak association



)
)(
(
1
1
y
x
s
y
y
s
x
x
n
r





Properties of Correlation


Always falls between
-
1 and +1


Sign of correlation denotes direction


(
-
) indicates negative linear association


(+) indicates positive linear association


Correlation has a unitless measure
-

does not
depend on the variables’ units


Two variables have the same correlation no
matter which is treated as the response variable


Correlation is sensitive to outliers


Correlation only measures strength of
linear

relationship


Calculating the Correlation Coefficient

Country

Per Capita GDP (
x
)

Life Expectancy (
y
)

Austria

21.4

77.48

Belgium

23.2

77.53

Finland

20.0

77.32

France

22.7

78.63

Germany

20.8

77.17

Ireland

18.6

76.39

Italy

21.5

78.51

Netherlands

22.0

78.15

Switzerland

23.8

78.99

United Kingdom

21.2

77.37

Per Capita Gross Domestic Product and Average Life
Expectancy for Countries in Western Europe

Calculating the Correlation Coefficient

0.809
(7.285)
1
10
1
































n
1
i
y
i
x
i
s
y
y
s
x
x
1
-
n
1
r
x

y

21.4

77.48

-
0.078

-
0.345

0.027

23.2

77.53

1.097

-
0.282

-
0.309

20.0

77.32

-
0.992

-
0.546

0.542

22.7

78.63

0.770

1.102

0.849

20.8

77.17

-
0.470

-
0.735

0.345

18.6

76.39

-
1.906

-
1.716

3.271

21.5

78.51

-
0.013

0.951

-
0.012

22.0

78.15

0.313

0.498

0.156

23.8

78.99

1.489

1.555

2.315

21.2

77.37

-
0.209

-
0.483

0.101

= 21.52


= 77.754

sum = 7.285

s
x
=1.532

s
y
=0.795



y
i
/s
y
y



x
i
/s
x
x

x
y
















y
i
x
i
s
y
-
y
s
x
-
x
Called Z
-
Scores

Divide a Scatterplot into Quadrants

0
5
10
15
20
25
30
35
0
10
20
30
40
50
GDP
INTERNET
x =16
y = 21.1
I

II

III

IV

In quadrant I, both z
-
scores positive;

In quadrant II, z
-
scores
of Internet are positive,
while z
-
scores of GDP
are negative;

In quadrant III, both z
-
scores negative;

In quadrant IV, z
-
scores
of GDP are positive,
while z
-
scores of
INTERNET are negative;


3.3 How Can We Predict the
Outcome of a Variable?


When a scatterplot indicates a relationship
between two variables, we can start fitting
a curve to the data.


The procedure of fitting a curve to the
data, along with inferences about
parameters of interest and prediction of
the response value, is called regression
analysis.

Regression Analysis


The first step of a
regression analysis

is
to identify the response and explanatory
variables



We use
y

to denote the
response variable



We use
x

to denote the
explanatory
variable


Regression Line


A regression line is a straight line that describes how
the response variable (y) changes as the explanatory
variable (x) changes


A regression line predicts the value of the response
variable (y) for a given level of the explanatory
variable (x)




The y
-
intercept of the regression line is denoted by
a


The slope of the regression line is denoted by
b


Example: How Can Anthropologists
Predict Height Using Human Remains?


Regression Equation:





is the predicted height and is the length
of a femur (thighbone), measured in
centimeters


x
y
4
.
2
4
.
61
ˆ


y
ˆ
x

Use the regression equation to predict
the height of a person whose femur
length was 50 centimeters



ˆ
y

61
.
4

2
.
4
(
50
)

181
.
4
Interpreting the y
-
Intercept


y
-
Intercept:


The predicted value for y when x = 0



Helps in plotting the line



May not have any interpretative value if no
observations had x values near 0


Interpreting the Slope


Slope: measures the change in the
predicted variable (y) for a 1 unit increase
in the explanatory variable in (x)



Example: A 1 cm increase in femur length
results in a 2.4 cm increase in predicted
height


Slope Values:

Positive, Negative, Equal to 0

Regression Line


At a given value of x, the equation:






Predicts a single value of the response variable



But… we should not expect all subjects at that value of
x to have the same value of y


Variability occurs in the y values!


bx
a
y


ˆ
Residuals


Measures the size of the prediction errors, the
vertical distance between the point and the
regression line



Each observation has a residual



Calculation for each residual:




A large residual indicates an unusual
observation



ˆ
y y

“Least Squares Method” Yields the
Regression Line


Residual sum of squares:





The least squares regression line is the line that
minimizes the vertical distance between the points
and their predictions, i.e., it minimizes the residual
sum of squares


Note: the sum of the residuals about the regression
line will always be zero


2 2
ˆ
( ) ( )
residuals y y
 
 
Regression Formulas for y
-
Intercept and
Slope



Slope:





Y
-
Intercept:


( )
y
x
s
b r
s

( )
a y b x
 


y
x
,
Regression line always passes through

Calculating the slope and y intercept for the
regression line

:
.275
4.979
0.0091
Given
Fi
0.368
nd and
3
.
0.65
x
y
x
y
s
s
a b
r





.275
4.979
0.0091
0.368
0.653
0.368
0.653 26.4
0.0091
x
y
y
x
s
b r
x
y
s
s
s
r
 
 
  
 
 
 











4.979 26.4 0.275 2.28
a y b x
 
   
y intercept=
-
2.28

Slope =26.4

0.653
r

Find a and b.

Internet Usage and

Gross National Product (GDP)

INTERNET
GDP
INTERNET
GDP
Algeria
0.65
6.09
Japan
38.42
25.13
Argentina
10.08
11.32
Malaysia
27.31
8.75
Australia
37.14
25.37
Mexico
3.62
8.43
Austria
38.7
26.73
Netherlands
49.05
27.19
Belgium
31.04
25.52
New Zealand
46.12
19.16
Brazil
4.66
7.36
Nigeria
0.1
0.85
Canada
46.66
27.13
Norway
46.38
29.62
Chile
20.14
9.19
Pakistan
0.34
1.89
China
2.57
4.02
Philippines
2.56
3.84
Denmark
42.95
29
Russia
2.93
7.1
Egypt
0.93
3.52
Saudi Arabia
1.34
13.33
Finland
43.03
24.43
South Africa
6.49
11.29
France
26.38
23.99
Spain
18.27
20.15
Germany
37.36
25.35
Sweden
51.63
24.18
Greece
13.21
17.44
Switzerland
30.7
28.1
India
0.68
2.84
Turkey
6.04
5.89
Iran
1.56
6
United Kingdom
32.96
24.16
Ireland
23.31
32.41
United States
50.15
34.32
Israel
27.66
19.79
Vietnam
1.24
2.07
Yemen
0.09
0.79

Enter x data into L1


Enter y data into L2

1.
STAT CALC menu

2.
Choose 8:
LinReg(a+bx)

3.
1
st

number = x variable

4.
2
nd

number = y variable

5.
Enter



Using TI
-
83

Cereal: Sodium and Sugar

The Slope and the Correlation


Correlation:


Describes the strength of the linear association between 2
variables


Does not change when the units of measurement change


Does not depend upon which variable is the response and
which is the explanatory


Slope:


Numerical value depends on the units used to measure the
variables


Does not tell us whether the association is strong or weak


The two variables must be identified as response and
explanatory variables


The regression equation can be used to predict values of the
response variable for given values of the explanatory
variable


The Squared Correlation


When a strong linear association exists, the
regression equation predictions tend to be much better
than the predictions using only


We measure the
proportional reduction in error

and
call it, r
2
, which measures the proportion of the
variation in the y
-
values that is accounted for by the
linear relationship of y with x.


A correlation of 0.9 means that



81% of the variation in the y
-
values can be explained by the
explanatory variable, x


2
0.9 0.81 81%
 
.
y
3.4 What Are Some Cautions in
Analyzing Association?



Be cautious of


Extrapolation


Influential outliers


Interpretation of correlation or association


Lurking variables


Confounding

Extrapolation


Extrapolation:
Using a regression line to
predict y
-
values for x
-
values outside the
observed range of the data


It’s riskier as we move farther from the range
of the given x
-
values


There is no guarantee that the relationship
given by the regression equation holds
outside the range of sampled x
-
values


Outliers and Influential Points


A
regression

outlier

is an observation/point
that lies far away from the trend that the rest of
the data follows


An observation is
influential

if


Its
x

value is relatively low or high compared
to the remainder of the data, and


The observation is a regression outlier.


An influential observation tends to pull the
regression line toward that data point and
away from the rest of the data.


Impact of removing an Influential
data point

Interpretation of Correlation and
Association


Correlation does not imply causation
.


In general, it’s also true that
association
does not imply causation
. This warning
holds whether we analyze associations
between qualitative variables or between
quantitative variables.


Create a scatterplot for “Crime rate”
against “Education” in the “FL crime”
data

on the text CD.

Scatterplot of Crime against Education
y = 0.1467x + 61.802
R
2
= 0.218
50
60
70
80
90
0
20
40
60
80
100
120
140
Education
Crime rate
Lurking Variables


A
lurking variable

is a variable, usually
unobserved, that influences the
association between the variables of
primary interest.


135

STAT 319 Biometrics Fall 2008


Example
: A reporter studied the causes of a fire to
a house and established a high positive correlation
between the damages (in dollars) and the number
of firefighters at the scene. Which of the following
could be a
lurking variable
that is responsible for
the association?


(a) Firefighter

(b) Weather

(c) Size of the house

(d) Size of the blaze


136

STAT 319 Biometrics Fall 2008


Example
: An economist noticed that nations with
more TV sets have higher life expectancies. He
established a high positive correlation between
length of life and number of TV sets. Find the lurking
variable, if there is one.

(a) TV sets brands

(b) Popcorn

(c) Wealth of the nation

(d) Sofa

(e) No confounding variable


Simpson’s Paradox



Simpson’s Paradox

refers to the phenomenon
that the direction of an association between
two variables can change after we include a
third variable and analyze the data at separate
levels of that variable. (Book)


Simpson's paradox

(or the
Yule
-
Simpson
effect
) is a statistical paradox wherein the
successes of groups seem reversed when the
groups are combined. (Wiki)

Is Smoking Actually Beneficial to Your Health?


This is Example 1 on page 93 of text. 1314 women were asked whether
they were smokers. They were followed over a period of 20 years.



Smoker

Survival Status



Total

Dead

Alive

Yes

139

443

582

No

230

502

732

Total

369

945

1314

The data indicate that smoking could apparently be
beneficial to your health. Could a lurking variable be
responsible for the association?


The were also age information about the 1314
women involved in the study. These women can
be stratified into 4 different age groups, creating 4
contingency tables.

Smoker

Age group

18
-

34

35
-

54

55
-

64

65 +

Dead Alive

Dead Alive

Dead Alive

Dead Alive

Yes

No

5 174

6 213

41 198

19 180

51 64

40 81


42 7

165 28

Question
: For each age group, find conditional proportions of
deaths for smokers and nonsmokers.

More Simpson Paradoxes


http://en.wikipedia.org/wiki/Simpson's_par
adox




Simpson's paradox for continuous data: a positive
trend appears for two separate groups (blue and red),
a negative trend (black, dashed) appears when the
data are combined.

Confounding


When two explanatory variables are both
associated with a response variable but
are also associated with each other, there
is said to be
confounding
.


Age is a confounding variable in the study
of the association between smoking and
survival status.

Difference between a Confounding
Variable and a Lurking Variable


A confounding variable is already included in the
study. It is associated both with the response
variable and the explanatory variable.


A lurking variable is not measured in the study. It has
the
potential

for confounding.


The effect of an explanatory variable can be
analyzed by
adjusting for
confounding variables.


Ignoring lurking variables results in misleading
conclusions. (age in smoking
-
survival association).

Chapter 4:

Gathering Data

Section 4.1

Should We Experiment or Should
We Merely Observe?

Statistics for the Physical Sciences (STAT
229
-
02)

Homework #4


4
-
1: Problems 4.2, 4.4, 4.6, 4.8, 4.10


4
-
2: Problems 4.14, 4.18, 4.20, 4.22, 4.28,
4.30


4
-
3: Problems 4.34, 4.36, 4.38, 4.40, 4.42


4
-
4: Problems 4.44, 4.46, 4.48, 4.50, 4.52,
4.54


145

1.
Population versus Sample

2.
Types of Studies: Experimental and
Observational

3.
Comparing Experimental and
Observational Studies

Learning Objectives:

146




Population


Population: all the subjects of interest


We use statistics to learn about the population, the
entire group of interest


Sample: subset of the population


Data is collected for the sample because we cannot
typically measure all subjects in the population

Learning Objective 1:

Population and Sample

Sample

147

Learning Objective 2:

Type of Study: Observational Study


In an
observational study,

the
researcher observes values of the
response variable and explanatory
variables for the sampled subjects,
without anything being done to the
subjects (such as imposing a
treatment)

148

Learning Objective 2:

Observational Study


Sample Survey


A sample survey selects a sample of
people from a population and interviews
them to collect data.


A sample survey is a type of observational
study.


A census is a survey that attempts to
count the number of people in the
population and to measure certain
characteristics about them


149

Learning Objective 2:

Type of Study: Experiment


A researcher conducts an
experiment
by assigning subjects to certain
experimental conditions and then
observing outcomes on the response
variable


The experimental conditions, which
correspond to assigned values of the
explanatory variable, are called
treatments

150

Learning Objective 2:

Example


Headline: “Student Drug Testing Not Effective in
Reducing Drug Use”



Facts about the study:



76,000 students nationwide


Schools selected for the study included schools
that tested for drugs and schools that did not test
for drugs


Each student filled out a questionnaire asking
about his/her drug use

151

Learning Objective 2:

Example









Conclusion: Drug use was similar in schools that
tested for drugs and schools that did not test for
drugs


152

Learning Objective 2:

Example

This study was an observational study.

In order for it to be an experiment, the
researcher would had to have assigned
each school to use or not use drug
testing rather than leaving this decision
to the school.


153

Learning Objective 3:

Comparing Experiments and Observational Studies


An experiment reduces the potential for
lurking variables

to affect the result.
Thus, an experiment gives the
researcher more control over outside
influences.


Only an experiment can establish
cause and effect. Observational
studies can not.


Experiments are not always possible
due to ethical reasons, time
considerations and other factors.


154

Chapter 4

Gathering Data

Section 4.2

What are Good Ways and Poor Ways to
Sample?

Learning Objectives:

1.
Sampling Frame & Sampling Design

2.
Simple Random Sample (SRS)

3.
Random number table

4.
Margin of Error

5.
Convenience Samples

6.
Types of Bias in Sample Surveys

7.
Key Parts of a Sample Survey

156

Learning Objective 1:

Sampling Frame & Sampling Design


The sampling frame is the list of subjects
in the population from which the sample is
taken, ideally it lists the entire population
of interest


The sampling design determines how the
sample is selected. Ideally, it should give
each subject an equal chance of being
selected to be in the sample


157

Learning Objective 2:

Simple

Random Sampling, SRS


Random Sampling is the best way of
obtaining a sample that is
representative of the population


A
simple random sample

of ‘n’
subjects from a population is one in
which each possible sample of that
size has the same chance of being
selected

158

Learning Objective 2:

SRS Example


Two club officers are to be chosen for a New Orleans trip



There are 5 officers: President, Vice
-
President,
Secretary, Treasurer and Activity Coordinator


The 10 possible samples are:


(P,V) (P,S) (P,T) (P,A) (V,S)


(V,T) (V,A) (S,T) (S,A) (T,A)


For a SRS, each of the ten possible samples has an
equal chance of being selected. Thus, each sample has
a 1 in 10 chance of being selected and each officer has
a 4 in 10 chance of being selected.

159

Learning Objective 3:

SRS: Table of Random Numbers


Table E on pg. A6 of text

Table of Random Numbers

160

Leaning Objective 3:

Using Random Numbers to select a SRS


To select a simple random sample


Number the subjects in the sampling frame
using numbers of the same length (number of
digits)


Select numbers of that length from a table of
random numbers or using a random number
generator


Include in the sample those subjects having
numbers equal to the random numbers
selected

161

We need to select a random sample of 5 from a class of 20 students.


1)
List and number all members of the
population
, which is the class of 20.

2)
The number 20 is two
-
digits long.

3)
Parse the list of random digits into numbers that are two digits long. Here
we choose to start with line 2, for no particular reason.

Learning Objective 3:

Choosing a simple random sample

22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34 02

162

1 Alison

2 Amy

3 Brigitte

4 Darwin

5 Emily

6 Fernando

7 George

8 Harry

9 Henry

10 John

11 Kate

12 Max

13 Moe

14 Nancy

15 Ned

16 Paul

17 Ramon

18 Rupert

19 Tom

20 Victoria



Remember that 1 is 01, 2 is 02, etc.



If you were to hit 09 again before getting five people,
don’t sample Ramon twice

you just keep going.

4)
Choose a
random sample

of size 5 by reading through the
list of two
-
digit random numbers, starting with line 2 and on.

5)
The first five random numbers matching numbers assigned
to people make the SRS.


22 36 84 65 73 25 59 58 53 93 30 99 58 91 98 27 98 25 34
02


The first individual selected is Amy, number 02. That’s it
from line 2. Move to line 3

Then Moe (13), Darwin, (04), Henry (09), and Net (15)

24
13

04

83 60 22 52 79 72 65 76 39 36 48
09

15

17 92 48 30

163

Learning Objective 4:

Margin of Error


Sample surveys are commonly used to
estimate population percentages


These estimates include a
margin of
error

which tells us how well the sample
estimate predicts the population
percentage


When a SRS of n subjects is used, the margin
of error is approximately


1
100%
n

164

Learning Objective 4:

Example: Margin of Error


A survey result states: “The
margin of error

is plus
or minus 3 percentage points”



This means: “It is very likely that the reported
sample percentage is no more than 3% lower or 3%
higher than the population percentage”


Click
here

to see a Gallup example. Read the
“Survey Methods” part and justify the margin of
error in the survey.


165

Learning Objective 5:

Convenience Samples: Poor Ways to Sample


Convenience Sample:

a type of survey
sample that is easy to obtain



Unlikely to be representative of the
population


Often severe biases result from such a
sample


Results apply ONLY to the observed
subjects; that is, they are descriptive.

166

Learning Objective 5:

Convenience Samples: Poor Ways to Sample


Volunteer Sample:

most common form
of convenience sample


Subjects volunteer for the sample


Volunteers do not tend to be
representative of the entire population

167

Learning Objective 6:

Types of Bias in Sample Surveys

Bias
: Tendency to systematically favor certain parts of the
population over others




Sampling Bias
: Occurs when using biased samples, which
are based on sampling methods such as
using nonrandom
samples
or having undercoverage



Nonresponse bias
: occurs when some sampled subjects
cannot be reached or refuse to participate or
fail to answer
some questions



Response bias
: occurs when the subject gives an
incorrect
response
or the question is misleading


A Large Sample Does Not Guarantee An Unbiased Sample!

168

Learning Objective 7:

Key Parts of a Sample Survey


Identify

the population of all subjects of interest


Construct

a sampling frame which attempts to list all
subjects in the population


Use

a random sampling design to select
n

subjects
from the sampling frame


Be

cautious of sampling bias due to nonrandom
samples


We can make inferences about the population of interest
when sample surveys that use random sampling are
employed.

169

Chapter 4

Gathering Data

Section 4.3

What Are Good Ways and Poor Ways to Experiment?

Learning Objectives:

1.
Identify the elements of an experiment

2.
Experiments

3.
3 Components of a good experiment

4.
Blinding the Study

5.
Define Statistical Significance

6.
Generalizing Results of the Study



171

Learning Objective 1:

Elements of an Experiment


Experimental units
: the subjects of an experiment; the
entities that we measure
in an experiment


Treatment
: A specific experimental condition imposed on
the subjects of the study; the treatments correspond to
assigned values of the explanatory variable


Explanatory variable
: Defines the groups to be compared
with respect to values on the response variable


Response variable
: The outcome measured on the
subjects to reveal the effect of the treatment(s).


172

Learning Objective 2:

Experiments


An experiment deliberately imposes treatments on the
experimental units in order to observe their responses.


The
goal

of an experiment is
to compare the effect of

the treatment on the response
.


Experiments that are randomized occur when the
subjects are randomly assigned to the treatments;
randomization helps to eliminate the effects of lurking
variables

173

Learning Objective 3:

3 Components of a Good Experiment


Control/Comparison group: allows the
researcher to analyze the effectiveness of
the primary treatment


Randomization: eliminates possible
researcher bias, balances the comparison
groups on known as well as on lurking
variables (so that the observed difference
among subjects is attributed to treatments)


Replication: allows us to attribute
observed effects to the treatments rather
than ordinary variability

174

Learning Objective 3:

Principle 1: Control or Comparison Group


A placebo is a dummy treatment, i.e. sugar pill. Many
subjects respond favorable to any treatment, even a
placebo.


A control group typically receives a placebo. A control
group allows us to analyze the effectiveness of the
primary treatment.


A control group need not receive a placebo. Clinical trials
often compare a new treatment for a medical condition, not
with a placebo, but with a treatment that is already on the
market.

175

Learning Objective 3:

Principle 1: Control or Comparison Group


Experiments should
compare

treatments rather
than attempt to assess the effect of a single
treatment in isolation


Is the treatment group better, worse, or no different
than the control group?


Example: 400 volunteers are asked to quit
smoking and each start taking an antidepressant.
In 1 year, how many have relapsed? Without a
control group (individuals who are not on the
antidepressant), it is not possible to gauge the
effectiveness of the antidepressant.

176

Learning Objective 3:

Placebo effect


Placebo effect (power of suggestion) : The
“placebo effect” is an improvement in
health due not to any treatment but only to
the patient’s belief that he or she will
improve.


177

Learning Objective 3:

Principle 2: Randomization


To have confidence in our results we should
randomly

assign subjects to the treatments. In doing so, we


Eliminate bias that may result from the researcher
assigning the subjects


Balance the groups on variables known to affect the
response


Balance the groups on lurking variables that may be
unknown to the researcher

178

Learning Objective 3:

Principle 3: Replication



Replication is the process of assigning
several experimental units to each
treatment


The difference due to ordinary variation is
smaller with larger samples


We have more confidence that the sample
results reflect a true difference due to
treatments when the sample size is large


Since it is always possible that the observed
effects were due to chance alone, replicating
the experiment also builds confidence in our
conclusions

179

Learning Objective 4:

Blinding the Experiment


Ideally, subjects are unaware, or
blind
, to the
treatment they are receiving


If an experiment is conducted in such a way that
neither the subjects nor the investigators working with
them know which treatment each subject is receiving,
then the experiment is
double
-
blinded


A double
-
blinded experiment controls
response bias
from the respondent and experimenter

180


If an experiment (or other study) finds a difference in
two (or more) groups, is this difference really
important?


If the observed difference is larger than what would
be expected just by chance, then it is labeled
statistically significant
.


Rather than relying solely on the label of
statistical
significance
, also look at the actual results to
determine if they are
practically significant
.

Learning Objective 5:

Define Statistical Significance

181

Learning Objective 6:

Generalizing Results


Recall that the goal of experimentation is
to analyze the association between the
treatment and the response for the
population
, not just the sample


However, care should be taken to
generalize the results of a study only to
the population that is represented by the
study.


182

Chapter 4

Gathering Data

Section 4.4

What are Other Ways to Conduct Experimental and
Observational Studies

Learning Objectives

1.
Sample Surveys: Other Random Sampling Designs

2.
Types of Observational Studies: Prospective and
Retrospective

3.
Multifactor Experiment

4.
Matched pairs design

5.
Randomized block design





184

Learning Objective 1:

Sample Surveys: Random Sampling
Designs


It is not always possible to conduct an
experiment , so it is necessary to have
well designed, informative studies that are
not experimental, e.g., sample surveys
that use randomization


Simple Random Sampling


Cluster Sampling


Stratified Random Sampling

185

Learning Objective 1:

Sample Surveys: Cluster Random Sample

Steps


Divide the population into a large number of
clusters
, such as city blocks


Select a simple random sample of the clusters


Use the subjects in those clusters as the
sample

186

Learning Objective 1:

Sample Surveys: Cluster Random
Sample


Preferable when


A reliable sampling frame is unavailable


The cost of selecting a SRS is excessive


Disadvantage


Usually need a larger sample size than with a
SRS in order to achieve a particular margin of
error



187

Learning Objective 1:

Sample Surveys: Stratified Random Sample

Steps


Divide the population into separate groups,
called
strata


Select a

simple random sample from each
strata


Combine the samples from all strata to form
complete sample


188

Learning Objective 1:

Sample Surveys: Stratified Random Sample


Advantage is that you can include in your
sample enough subjects in each stratum
you want to evaluate


Disadvantage is that you must have a
sampling frame and know the stratum into
which each subject belongs

189

Learning Objective 1:

Stratified Random Sample
-

Example

Suppose a university has the following student
demographics:


Undergraduate Graduate First Professional Special


55% 20% 5% 20%

In order to insure proper coverage of each
demographic, a stratified random sample of 100
students could be chosen as follows: select a SRS
of 55 undergraduates, a SRS of 20 graduates, a
SRS of 5 first professional students, and a SRS of
20 special students; combine these 100 students.

190

Learning Objective 1:

Comparing Random Sampling Methods

191

Learning Objective 2:

Types of Observational Studies

An observational study can yield useful information when an
experiment is not practical.


Types of observational studies:


Sample Survey: attempts to take a cross section of a
population at the current time


Retrospective

study: looks into the past


Prospective

study: follows its subjects into the future


Causation can never be definitively established with an
observational study, but well designed studies can provide
supporting evidence for the researcher’s beliefs


192

Learning Objective 2:

Retrospective Case
-
Control Study


A
case
-
control study

is a retrospective
observational study in which subjects who
have a response outcome of interest (the
cases) and subjects who have the other
response outcome (the controls) are
compared on an explanatory variable

193

Learning Objective 2:

Example: Case
-
Control Study


Response outcome of interest: Lung cancer


The
cases

have lung cancer


The
controls

did not have lung cancer



The two groups were compared on the explanatory
variable

smoker/nonsmoker

Smoker
Cases
Controls
yes
688
650
no
21
59
Total
709
709
Prob(smoker)
97%
92%
Lung Cancer
194

Learning Objective 2:

Example: Prospective Study

Nurses’ Health Study:


Began in 1976 with 121,700 female nurses aged 30 to 55;
questionnaires are filled out every two years


Purpose was to explore the relationships among diet,
hormonal factors, smoking habits and exercise habits and
the risk of coronary heart disease, pulmonary disease and
stroke


Nurses are followed into the future to determine whether
they eventually develop an outcome such as lung cancer
and whether certain explanatory variables are associated
with it

195

Learning Objective 3:

Multifactor Experiments


A Multifactor experiment uses a single experiment to
analyze the effects of two or more explanatory variables
on the response


Categorical explanatory variables in an experiment are
often called
factors


We are often able to learn more from a multifactor
experiment than from separate one
-
factor experiments
since the response may vary for different factor
combinations



196

Learning Objective 3:

Example: Multifactor experiment

Examine the effectiveness of
both Zyban and nicotine
patches on quitting smoking


Two factor experiment


4 treatments

197

Learning Objective 3:

Example: Multifactor experiment


subjects
: a certain number of undergraduate
students


all subjects viewed a 40
-
minute television program
that included ads for a digital camera


some subjects saw a 30
-
second commercial; others
saw a 90
-
second version


same commercial was shown either 1, 3, or 5 times
during the program


there were two
factors
: length of the commercial (2
values), and number of repetitions (3 values)

198

Learning Objective 3:

Example: Multifactor experiment


the 6 combinations of one value of
each factor form six
treatments

Factor B:

Repetitions

1 time

3 times

5 times

Factor A:

Length

30
seconds

1

2

3

90
seconds

4

5

6

subjects assigned
to Treatment 3 see
a 30
-
second ad
five times during
the program


after viewing, all subjects answered questions about:
recall of the ad, their attitude toward the commercial,
and their intention to purchase the product


these
were the
response variables
.

199

Learning Objective 4:

Matched Pairs Design


In a matched pairs design, the subjects receiving the two
treatments are somehow matched (same person,
husband/wife, two plots in the same field, etc.)


In a
crossover design
, the same individual is used for the
two treatments


Randomly


assign the two treatments to the two matched subjects, or


randomize the order of applying the treatments in a
crossover design


The number of replicates equals the number of pairs


Helps to reduce effects of lurking variables

200

Learning Objective 5:

Randomized Block Design


A
block

is a set of experimental units that
are matched with respect to one or more
characteristics


A
Randomized Block Design, RBD,

is
when the random assignment of
experimental units to treatments is carried
out separately within each block

201

Learning Objective 5:

Example: Randomized Block Design


Block = gender; 3 treatments = 3 types of therapy


The men (as well as the women) are randomly assigned to the


3 treatments; differences can be compared with respect to


gender as well as therapy type

202

Learning Objective 5:

Randomized Block Design



RBD eliminates variability in the response
due to the blocking variable; allows for
better comparisons to be made among the
treatments of interest


A matched pairs design is a special case
of a RBD with two observations in each
block

203

Chapter 5

Probability in our Daily Lives

Section 5