Transcript Slide 1

Turning data into knowledge to
solve real world problems
Christopher R. Bilder, Ph.D.
Department of Statistics
University of Nebraska-Lincoln
www.chrisbilder.com
www.chrisbilder.com
1 of 31
15 years ago…





The year is 1990
Music – U2
George Bush is president
TV – The Simpson’s
Millard South
– Senior year
– Big hair
– In the middle of winning state titles in
basketball for 3 out of 4 years (1988, 1989, 1991)
 What am I going to major in at college?
– Calculus I
– No AP Statistics!
www.chrisbilder.com
2 of 31
15 years ago…
 UNO (1990 – 1994)
– Math undergraduate major – What can you do with a degree?
– Planned to be an actuary
– Hypothesis testing in a statistics course (junior year)
• Use for decision making!
• Scientifically prove a hypothesis or statement
 Kansas State University for graduate school (1994 – 2000)
– Statistics graduate major in Department of Statistics
– Master of Science (MS) and Doctor of Philosophy (PhD)
 Oklahoma State University faculty (2000 – 2003)
– Department of Statistics
 UNL faculty (2003 – now)
– NEW Department of Statistics
www.chrisbilder.com
3 of 31
Purpose
 Tell you a little about the statistical science
 Turning data into knowledge to solve real world problems
– 3 actual examples
 AP statistics exam
 Website (www.chrisbilder.com/statistics) for more information
www.chrisbilder.com
4 of 31
Grocery store prices
 Undergraduate teaching example for a course like AP STATs
 How could you determine which grocery store, Super Wal-Mart
or Baker’s, has lower average prices?
– Paired or dependent two sample hypothesis test for Wal-Mart - Baker’s
– Sample the same items at each store
www.chrisbilder.com
5 of 31
Grocery store prices
 Undergraduate teaching example for a course like AP STATs
 How could you determine which grocery store, Dillon’s or
Food-4-Less in Manhattan, KS, has lower average prices?
– Paired or dependent two sample hypothesis test for Dillon’s - Food-4-Less
– Sample the same items at each store
 Only cereals from Fall 1998
www.chrisbilder.com
6 of 31
Grocery store prices
 Sample:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Item
Malt-o-meal - Tootie Fruities, 15oz
Malt-o-meal - Golden Puffs, 18oz
Quaker Oats - Life Cereal: Original, 21oz
Cheerios, 20oz
Cheerios, 15oz
Wheaties, 18oz
Kellogg’s Funpack, 8 9/16oz
Kellogg’s Variety Pack 9 5/8oz.
Kellogg’s Frosted Mini-Wheats Bite Size
19oz
Kellogg’s
Frosted Mini-Wheats, 16oz
Kellogg’s Frosted Flakes, 15oz
Our Family Frosted Flakes, 20oz.
Kellogg’s Crispix, 12oz.
Our Family - Raisin Bran, 20oz
Kellogg’s Smart Start, 13.3oz
Grape Nuts, 24oz
Frosted Alpha Bits, 15oz
Dillon's
$1.99
$1.99
$3.69
$4.59
$3.79
$3.89
$2.89
$3.49
$3.49
$2.50
$3.19
$2.50
$3.49
$2.50
$3.49
$3.00
$3.00
www.chrisbilder.com
Food-4-Less
$1.84
$1.84
$3.49
$4.24
$3.50
$3.60
$2.67
$3.14
$2.50
$2.73
$2.92
$1.90
$3.20
$1.92
$3.24
$2.85
$2.87
Difference
$0.15
$0.15
$0.20
$0.35
$0.29
$0.29
$0.22
$0.35
$0.99
-$0.23
$0.27
$0.60
$0.29
$0.58
$0.25
$0.15
$0.13
7 of 31
Grocery store prices
 Do you think there are
mean differences?
Dillon's - Food 4 Less
$0.8
Dillon's
$1.99
$1.99
$3.69
$4.59
$0.6
$3.79
$3.89
$2.89
$3.49
$0.4
$3.49
$2.50
$3.19
$2.50
$0.2
$3.49
$2.50
$3.49
$3.00
$0.0
$3.00
$1.0
9
Food-4-Less Difference
$0.15
$1.84
$0.15
$1.84
$0.20
$3.49
$0.35
$4.24
12
14
$0.29
$3.50
$0.29
$3.60
$0.22
$2.67
$0.35
$3.14
$0.99
$2.508
4
-$0.23
$2.73
5
6
13
$0.27
$2.92
11
15
$0.60
$1.90
7
3
$0.29
$3.20
1 2
16
$0.58
$1.92
17
$0.25
$3.24
$0.15
$2.85
$0.13
$2.87
-$0.2
www.chrisbilder.com
10
$0.8
$0.6
$0.4
75%
50%
$0.2
25%
Dillon's - Food 4 Less
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Item
Malt-o-meal - Tootie Fruities, 15oz
Malt-o-meal - Golden Puffs, 18oz
Quaker Oats - Life Cereal: Original, 21oz
Cheerios, 20oz
Cheerios, 15oz
Wheaties, 18oz
Kellogg’s Funpack, 8 9/16oz
Kellogg’s Variety Pack 9 5/8oz.
Kellogg’s Frosted Mini-Wheats Bite Size
19oz
Kellogg’s
Frosted Mini-Wheats, 16oz
Kellogg’s Frosted Flakes, 15oz
Our Family Frosted Flakes, 20oz.
Kellogg’s Crispix, 12oz.
Our Family - Raisin Bran, 20oz
Kellogg’s Smart Start, 13.3oz
Grape Nuts, 24oz
Frosted Alpha Bits, 15oz
$1.0
$0.0
-$0.2
8 of 31
Grocery store prices
 Paired two sample hypothesis test
– Ho:Dillon’s - Food-4-Less=0
Ha:Dillon’s - Food-4-Less0
– t = 4.77, p-value = 0.0002,
95% C.I.: 0.1644 < Dillon’s - Food-4-Less < 0.4274
– Reject equal mean prices
 If price was the only consideration, what store should one
shop at?
 Assumptions
– Normal populations
– The sample was taken in 1998; what about now?
– Finite populations
www.chrisbilder.com
9 of 31
Placekicking
 The use of the statistical science in sports
 Find a model to estimate the probability of success
for placekicks (field goals, PATs) in the NFL
 Video
– January 7, 1996
– Playoff game
– Indianapolis Colts 10
Kansas City Chiefs 7
– Lin Elliott of KC will attempt a 42 yard field goal to tie the game and
send it into overtime
– Field goal video
www.chrisbilder.com
10 of 31
Placekicking
 What factors affect the probability of success for NFL
placekicks?
–
–
–
–
–
Distance
Pressure – How do you quantitatively measure?
Wind
Grass vs. artificial turf
Dome vs. outdoor stadium
 Collect sample of >1,700 placekicks during the 1995 NFL
season
 Find the best logistic regression model of the form
e0 1x1 2x2 k xk
p=
1  e0 1x1 2x2  k xk
where p is the probability of success
xi for i=1,…,k are independent variables
i measures the effect of xi on p for i=1,…,k
e  2.718; ln(e) = 1 www.chrisbilder.com
11 of 31
Placekicking
 The i’s are parameters which are estimated using
“iteratively reweighted least squares”
 Estimated model
e4.49840.3306change0.0807distance1.2592PAT  2.8778wind0.0907distancewind
ˆ
p=
1  e4.49840.3306change0.0807distance1.2592PAT  2.8778wind0.0907distancewind
–
–
–
–
Change: lead change = 1, non-lead change = 0
Distance: distance in yards
PAT: point after touchdown = 1, field goal = 0
Wind: windy (speed > 15 MPH) = 1, non-windy = 0
 What is the estimated probability of success for Elliott’s
field goal?
Change Distance PAT
Wind
– Conditions:
1
42
0
0
– Estimated probability of success: pˆ  0.6850
– 90% confidence interval for probability of success:
0.6298 < p < 0.7402
www.chrisbilder.com
12 of 31
Estimated probability of success of a field goal (PAT=0)
0.4
0.6
0.685
0.2
Change=0, Wind=0
Change=1, Wind=0
Change=0, Wind=1
Change=1, Wind=1
0.0
Estimated Probability of Success
0.8
1.0
Estimated probability of success for a field goal (PAT=0)
20
30
40
42
Distance in Yards
www.chrisbilder.com
50
60
13 of 31
HCV prevalence
 Hepatitis C (HCV)
– Viral infection that causes cirrhosis and cancer of the liver
 Questions:
– How can people be tested in a cost effective and timely manner?
• Blood bank setting
– What is the probability a person has HCV?
• What proportion of people is inflicted with HCV in a population?
• Prevalence in a population
 Individual testing
– Each blood sample is tested individually
– Problems:
• Costly
• Time
+ or - + or - + or - + or - + or -
www.chrisbilder.com

+ or -
14 of 31
HCV prevalence
 Group testing
– Pool the blood samples together to form n groups of size s
+ or -
Group 1

+ or -
Group 2
+ or -
Group n
– If the GROUP sample is negative, then all s people do not have the
disease
– If the GROUP sample is positive, then at least ONE of the s people
have the disease
• May want to determine who in the group has the disease
– Strategy works well when prevalence of a disease is small
www.chrisbilder.com
15 of 31
HCV prevalence
 Notation
–
–
–
–
–
p = probability an INDIVIDUAL is HCV positive (prevalence)
 = probability a GROUP is HCV positive
s = group size
n = number of groups
T be a random variable denoting the number of positive GROUPS
• T has a binomial distribution with “n trials” and “ as the probability
of success”
n
f(t)    t (1  )n t for t=0,1,2,...,n
t
www.chrisbilder.com
16 of 31
HCV prevalence
 How can we estimate p?
– We observe information about the groups, not individuals!
– Estimate  with ˆ  T / n = # positive / # of groups
–  = P(group is positive)
= P(at least one individual is positive)
= 1 – P(no individuals are positive)
using complement rule
= 1 – P(all individuals are negative)
= 1 – (1 – p)s
since p = P(individual is
positive) and s individuals
per group
– p = 1 – (1 – )1/s
1/ s
1/ s
– Then pˆ  1 (1 ˆ )  1 (1 T / n)
www.chrisbilder.com
17 of 31
HCV prevalence
 Estimation of HCV prevalence in Xuzhou City, China
– Data from Liu et al. (Transfusion, 1997)
– 1,875 blood donors screened for HCV
• There were 42 positives
– In order to test the usefulness of group testing, blood samples were
also pooled
• n = 375 groups
• s = 5 individuals per group
• t = 37 positive groups
– Estimates of p, probability individual is positive
• Using individual data: 42/1875 = 0.0224
• Using group data: pˆ  0.0206
– Which is easier and more cost effective?
• 1875 tests using individual testing
• 375 tests using group testing
www.chrisbilder.com
18 of 31
HCV prevalence
 New research – MS/PhD research
– What factors could affect p?
– Include independent variables to help model p
e0 1x1 2x2 k xk
p=
1  e0 1x1 2x2  k xk
– Problem: Do not have the individual outcomes
– After a group is tested positive, how can you find what individuals have
the disease?
• Use model to help decide who to retest if get a positive group
– Multiple diseases
• HCV
• HIV
• Other disease
• Simultaneously model
www.chrisbilder.com
19 of 31
HCV prevalence

Multiple vector transfer designs
–
–
Swallow (Phytopathology, 1985)
Want to estimate the probability a insect vector transfers a pathogen
(virus, bacteria, etc.) to a plant
Brown
planthopper
Whitebacked
planthopper
www.chrisbilder.com
20 of 31
HCV prevalence

Multiple vector transfer designs (continued)
y = 0 if plant is negative, 1 if plant is positive
y=0
y=1
y=0
Greenhouse
Does not
transmit
virus

Transmits
virus
Enclosed
test plant
y=0
y=1
Planthopper
y=0
T = number of plants with disease
www.chrisbilder.com
21 of 31
Why statistics?
 Statistics is used in many diverse areas!
– Statistics is the “science of science”
– Florence Nightingale quote:
the most important science in the whole world: for upon it depends the
practical application of every other science and of every art: the one
science essential to all political and social administration, all education,
all organization based on experience, for it only gives results of our
experience.
 Take statistics courses in college!
– Of course, I want you to consider coming to UNL!
– Statistics is mainly a graduate discipline, so there is no undergraduate
major at UNL
– Undergraduate minor in statistics can be useful for many majors
– Most statisticians have an undergraduate degree (Bachelor of Science)
in math
www.chrisbilder.com
22 of 31
Why statistics?
 Where do statisticians work?
–
–
–
–
–
–
Pharmaceutical and medical research – Pfizer, Merck, medical centers
Marketing – Target, Hallmark
Government research labs – INEEL, Los Alamos, Sandia, Argonne
Agriculture – Pioneer Hi-Bred
Consulting firms – Quintiles
In Nebraska – ConAgra, Gallup, First National Bank, MDS Pharma,
Experian, UNMC and Creighton medical center, various universities,
Pfizer, Acton International, Nebraska state agencies, Union Pacific
 Everyone that I have known has had a job offer before they
graduated!
 How many statisticians are there?
– 20,000
www.chrisbilder.com
23 of 31
Why statistics?
 Salaries
– Non-academic starting (2003 American Statistical Association survey)
Background needed
Survey response rate
Percentile
was 23.5%; see salary
th
th
th
Degree Sample size 25
50
75
surveys at the American
MS
102
45.5K 50K 59K
Statistical Association’s
PhD
99
60K 65K 75K
website
 Strong in mathematics and using computers
– Majority of statisticians have Bachelor’s degrees in mathematics
• Good with calculus
• Applied math courses
• Take at least one statistics course
• Comfortable with using software packages
– To actually be a “statistician”, usually need to go to graduate school to
get a MS or PhD in statistics
• Financial support
• Graduate Teaching Assistantship
www.chrisbilder.com
24 of 31
Why statistics?
 What courses to take next in college?
– AP statistics equivalent to a one semester introductory statistics course
without calculus
• UNL: STAT 218 (Introduction to Statistics)
• UNO: MATH/STAT 3000 (Statistical Methods I); Business
Administration 2130 (Principles of Business Statistics)
– Theory – 2 semester sequence using calculus I-III
• UNL: STAT 462 (Distribution Theory) and
STAT 463 (Statistical Inference)
• UNO: MATH 4740 and 4750
(Intro. to Probability and Statistics I and II)
– Applications
• UNL: STAT 450 (Introduction to Regression Analysis) or
STAT 412 (Introduction to Experimental Design)
• UNO: MATH /STAT 3010 (Statistical Methods II); Business
Administration 3140 (Business Statistical Applications)
www.chrisbilder.com
25 of 31
Why statistics?
 Other recommended UNL classes (undergraduate)
– MATH 340 Numerical Analysis
– MATH 314 Applied Linear Algebra
– MATH 325 Elementary Analysis and MATH 425 Mathematical Analysis
• Helpful if go on for a PhD
– Computer science programming courses
 Other recommended UNO classes (undergraduate)
–
–
–
–
MATH 3300 Numerical Methods
MATH 4050 Linear Algebra
MATH 4760 Topics in Modeling
MATH 4230 and 4240 Mathematical Analysis I and II
• Helpful if go on for a PhD
– Computer science programming courses
www.chrisbilder.com
26 of 31
AP Statistics
 Grading done in Lincoln!
– State fair grounds
– Grade the free response
section of about 66,000 student
exams (2004)
– 250 AP statistics high school
teachers and college
professors
– June 13 to June 19, 2005
– 8:30AM – 4:45PM EVERYDAY
www.chrisbilder.com
27 of 31
AP Statistics
 I graded in 2002
– About 900 problems graded!
– 16 graders in a room split into two groups
– Each group has a leader
• Answer questions
• CHECKS some of your grading!
– Paid $1,450
• Stay in dorms
• Free meals and snacks
 Grading is not fun
– Evening activities
– Discussions on how to teach introductory statistics better
 The grading rubric
– An outline of how to grade a problem that must be followed!
– These are put together before graders arrive through examining sample
set of tests
www.chrisbilder.com
28 of 31
AP Statistics
 Question #6 in 2002
– 4 parts – (a), (b), (c), (d)
– Each part is graded as
• E = Essentially correct
• P = Partially correct
• I = Incomplete
– Graders are given a “conversion” table to show how to convert the
scores into a numerical score
• 4 = Complete response
• 3 = Substantial response
• 2 = Developing response
• 1 = Minimal response
• 0 = No credit
– 1 point given to an E, 0.5 points given to a P, 0 points given to an I
• Round up if (a) or (c) has the correct interpretation
– Example given at end of PowerPoint file
www.chrisbilder.com
29 of 31
For more information…
 E-mail me at [email protected]
 Website: www.chrisbilder.com/statistics
– This PowerPoint presentation (including example question)
– Links to
• Introductory information about being a statistician
• Jobs (including internships)
• Salary information
• List of all Departments of Statistics
• Professional societies
• Course websites that myself and others teach
• Newspaper and magazine articles about statistical applications
www.chrisbilder.com
30 of 31
Turning data into knowledge to
solve real world problems
Christopher R. Bilder, Ph.D.
Department of Statistics
University of Nebraska-Lincoln
www.chrisbilder.com
www.chrisbilder.com
31 of 31
33rd st.
Statistics at UNL
Department
of Statistics
www.chrisbilder.com
32 of 31
AP Statistics
www.chrisbilder.com
33 of 31
AP Statistics
www.chrisbilder.com
34 of 31
AP Statistics
May actually
be an E?
www.chrisbilder.com
35 of 31
AP Statistics
www.chrisbilder.com
36 of 31
AP Statistics
www.chrisbilder.com
37 of 31
Estimated probability of success for a field goal (PAT=0)
0.6
0.4
Lowest Number of Risk Factors
Estimated Probability
90% Confidence Interval
0.2
Highest Number of Risk Factors
Estimated Probability
90% Confidence Interval
0.0
Estimated Probability of Success
0.8
1.0
Estimated probability of success for a field goal (PAT=0)
20
30
40
Distance in Yards
www.chrisbilder.com
50
60
38 of 31