Transcript Slide 1

Turning data into knowledge to
solve real world problems
Christopher R. Bilder, Ph.D.
Department of Statistics
University of Nebraska-Lincoln
www.chrisbilder.com
www.chrisbilder.com
1 of 36
11 years ago…
 The year is 1993
 Pearl Jam records second CD, Vs.,
– Daughter, Go, Elderly Woman Behind the Counter in a Small Town
– Almost 1 million CDs are sold in the first week
 Bill Clinton was inaugurated as the 41st president
 Movies
– Jurassic Park
– Schindler's List
– Sleepless in Seattle
 Husker football
– Began 1993 by losing the Orange Bowl badly (again)
– 1993 season went undefeated
 Math 4750 - Introduction to Probability and Statistics II
– 4-5:15PM Tuesdays and Thursdays in DC 164
– Dr. Stephens
www.chrisbilder.com
2 of 36
11 years ago…
 Actuarial Science!
– Planned to be an actuary when I started college
– Internship at National Indemnity Company at 32nd and Harney
– Passed 4 exams under old system
 Wanted to go on to graduate school
– Math?
– Actuarial Science?
 Hypothesis testing in Math 4750
– Use for decision making!
– Scientifically prove a hypothesis or statement
– Go to graduate school for statistics!
 1994 received BS in Mathematics with pre-actuarial science
minor from UNO
www.chrisbilder.com
3 of 36
After UNO
 Went on to graduate school for statistics
– MS 1996 from Kansas State University
– PhD 2000 from Kansas State University
– Internships at INEEL in Idaho and pharmaceutical company in Kansas
City
– Consult with students and professors in
• Institute of Social and Behavorial Research
• College of Agriculture
– Taught courses like Statistical Methods I and II (STAT 3000 and 3010)
 Assistant Professor at Oklahoma State University
– Department of Statistics
– 2000-2003
 Assistant Professor at UNL
– NEW Department of Statistics
– 2003-present
www.chrisbilder.com
4 of 36
Purpose
 Tell you a little about statistics
– Statistics is mainly a graduate discipline
– Most statisticians have undergraduate degrees in math
 Turning data into knowledge to solve real world problems
– 3 actual examples that come from my teaching and research
 About statistics at UNL
 Website (www.chrisbilder.com/statistics) for more information
www.chrisbilder.com
5 of 36
Grocery store prices
 Undergraduate teaching example for a course like STAT 3000
 How could you determine which grocery store, Super Wal-Mart
or Albertson’s, has lower average prices?
– Paired or dependent two sample hypothesis test for Wal-Mart - Albertsons
– Sample the same items at each store
www.chrisbilder.com
6 of 36
Grocery store prices
 Undergraduate teaching example for a course like STAT 3000
 How could you determine which grocery store, Dillon’s or
Food-4-Less in Manhattan, KS, has lower average prices?
– Paired or dependent two sample hypothesis test for Dillon’s - Food-4-Less
– Sample the same items at each store
 Only cereals from Fall 1998
– Possible problems described later
www.chrisbilder.com
7 of 36
Grocery store prices
 Sample:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Item
Malt-o-meal - Tootie Fruities, 15oz
Malt-o-meal - Golden Puffs, 18oz
Quaker Oats - Life Cereal: Original, 21oz
Cheerios, 20oz
Cheerios, 15oz
Wheaties, 18oz
Kellogg’s Funpack, 8 9/16oz
Kellogg’s Variety Pack 9 5/8oz.
Kellogg’s Frosted Mini-Wheats Bite Size
19oz
Kellogg’s
Frosted Mini-Wheats, 16oz
Kellogg’s Frosted Flakes, 15oz
Our Family Frosted Flakes, 20oz.
Kellogg’s Crispix, 12oz.
Our Family - Raisin Bran, 20oz
Kellogg’s Smart Start, 13.3oz
Grape Nuts, 24oz
Frosted Alpha Bits, 15oz
Dillon's
$1.99
$1.99
$3.69
$4.59
$3.79
$3.89
$2.89
$3.49
$3.49
$2.50
$3.19
$2.50
$3.49
$2.50
$3.49
$3.00
$3.00
www.chrisbilder.com
Food-4-Less
$1.84
$1.84
$3.49
$4.24
$3.50
$3.60
$2.67
$3.14
$2.50
$2.73
$2.92
$1.90
$3.20
$1.92
$3.24
$2.85
$2.87
Difference
$0.15
$0.15
$0.20
$0.35
$0.29
$0.29
$0.22
$0.35
$0.99
-$0.23
$0.27
$0.60
$0.29
$0.58
$0.25
$0.15
$0.13
8 of 36
Grocery store prices
 Do you think there are
mean differences?
Dillon's - Food 4 Less
$0.8
Dillon's
$1.99
$1.99
$3.69
$4.59
$0.6
$3.79
$3.89
$2.89
$3.49
$0.4
$3.49
$2.50
$3.19
$2.50
$0.2
$3.49
$2.50
$3.49
$3.00
$0.0
$3.00
$1.0
9
Food-4-Less Difference
$0.15
$1.84
$0.15
$1.84
$0.20
$3.49
$0.35
$4.24
12
14
$0.29
$3.50
$0.29
$3.60
$0.22
$2.67
$0.35
$3.14
$0.99
$2.508
4
-$0.23
$2.73
5
6
13
$0.27
$2.92
11
15
$0.60
$1.90
7
3
$0.29
$3.20
1 2
16
$0.58
$1.92
17
$0.25
$3.24
$0.15
$2.85
$0.13
$2.87
-$0.2
www.chrisbilder.com
10
$0.8
$0.6
$0.4
75%
50%
$0.2
25%
Dillon's - Food 4 Less
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Item
Malt-o-meal - Tootie Fruities, 15oz
Malt-o-meal - Golden Puffs, 18oz
Quaker Oats - Life Cereal: Original, 21oz
Cheerios, 20oz
Cheerios, 15oz
Wheaties, 18oz
Kellogg’s Funpack, 8 9/16oz
Kellogg’s Variety Pack 9 5/8oz.
Kellogg’s Frosted Mini-Wheats Bite Size
19oz
Kellogg’s
Frosted Mini-Wheats, 16oz
Kellogg’s Frosted Flakes, 15oz
Our Family Frosted Flakes, 20oz.
Kellogg’s Crispix, 12oz.
Our Family - Raisin Bran, 20oz
Kellogg’s Smart Start, 13.3oz
Grape Nuts, 24oz
Frosted Alpha Bits, 15oz
$1.0
$0.0
-$0.2
9 of 36
Grocery store prices
 Paired two sample hypothesis test
– Ho:Dillon’s - Food-4-Less=0
Ha:Dillon’s - Food-4-Less0
– t = 4.77, p-value = 0.0002,
95% C.I.: 0.1644 < Dillon’s - Food-4-Less < 0.4274
– Reject equal mean prices
 If price was the only consideration, what store should one
shop at?
 Assumptions
–
–
–
–
Prices and selection at these two stores are indicative of all stores
Normal populations
The sample was taken in 1998; what about now?
Finite populations
www.chrisbilder.com
10 of 36
Placekicking
 MS report – applying statistics or investigating new
methodology
– 120 page book!
– Reduced version published in Chance in 1998
 Find a model to estimate the probability of success
for placekicks in the NFL
 Video
– January 7, 1996
– Playoff game
– Indianapolis Colts 10
Kansas City Chiefs 7
– Lin Elliott of KC will attempt
a 42 yard field goal to tie the
game and send it into
overtime
www.chrisbilder.com
11 of 36
Placekicking
 What factors affect the probability of success for NFL
placekicks?
–
–
–
–
–
Distance
Pressure – How do you quantitatively measure?
Wind
Grass vs. artificial turf
Dome vs. outdoor stadium
 Collected data >1,700 placekicks during the 1995
NFL season
 Find the best logistic regression model of the form
e0 1x1 2x2 k xk
p=
1  e0 1x1 2x2  k xk
where p is the probability of success
xi for i=1,…,k are independent variables
i measures the effect of xi on p for i=1,…,k
www.chrisbilder.com
12 of 36
Placekicking
 The i’s are parameters which are estimated through
maximum likelihood estimation
 Estimated model
e4.49840.3306change0.0807distance1.2592PAT  2.8778wind0.0907distancewind
ˆ
p=
1  e4.49840.3306change0.0807distance1.2592PAT  2.8778wind0.0907distancewind
–
–
–
–
Change: lead change = 1, non-lead change = 0
Distance: distance in yards
PAT: point after touchdown = 1, field goal = 0
Wind: windy (speed > 15 MPH) = 1, non-windy = 0
 What is the estimated probability of success for Elliott’s
field goal?
Change Distance PAT
Wind
– Conditions:
1
42
0
0
– Estimated probability of success: pˆ  0.6850
– 90% confidence interval for probability of success:
0.6298 < p < 0.7402
www.chrisbilder.com
13 of 36
Estimated probability of success of a field goal (PAT=0)
0.4
0.6
0.685
0.2
Change=0, Wind=0
Change=1, Wind=0
Change=0, Wind=1
Change=1, Wind=1
0.0
Estimated Probability of Success
0.8
1.0
Estimated probability of success for a field goal (PAT=0)
20
30
40
42
Distance in Yards
www.chrisbilder.com
50
60
14 of 36
Estimated probability of success for a field goal (PAT=0)
0.6
0.4
Lowest Number of Risk Factors
Estimated Probability
90% Confidence Interval
0.2
Highest Number of Risk Factors
Estimated Probability
90% Confidence Interval
0.0
Estimated Probability of Success
0.8
1.0
Estimated probability of success for a field goal (PAT=0)
20
30
40
Distance in Yards
www.chrisbilder.com
50
60
15 of 36
Placekicking
 UNL Department of Statistics developing statistics in sports
specialty
– Dr. David Marx
• Works with the UNL athletic department
• January 10, 2004 Omaha World Herald article about his work the
men’s basketball team (available at www.chrisbilder.com/statistics)
• His students this semester have worked with NASCAR, Lincoln SE
women’s high school soccer team, and Tendu, Inc. (baseball
software company).
– Myself
• Placekicking
• Modeling 64-team NCAA tournaments
www.chrisbilder.com
16 of 36
HCV prevalence
 MS/PhD research – forwarding statistical theory and
methodology
 Hepatitis C (HCV)
– Viral infection that causes cirrhosis and cancer of the liver
– Since HCV is transmitted through contact with infectious blood,
screening blood donors is important to prevent further transmission
 Questions:
– How can blood be screened in a cost effective and timely manner?
– What proportion of people is inflicted with HCV in a population?
 Individual testing
– Each blood sample is tested individually
– Problems:
• Costly
+ or • Time
www.chrisbilder.com

+ or - + or - + or - + or -
+ or -
17 of 36
HCV prevalence
 Group testing
– Pool the blood samples together to form n groups of size s
+ or -
Group 1

+ or -
Group 2
+ or -
Group n
– If the GROUP sample is negative, then all s people do not have the
disease
– If the GROUP sample is positive, then at least ONE of the s people
have the disease
• May want to determine who in the group has the disease
– Strategy works well when prevalence of a disease is small
– Dorfman (1943) – first used to test members of the military for disease
www.chrisbilder.com
18 of 36
HCV prevalence
 Notation
–
–
–
–
–
Let p = probability an INDIVIDUAL is HCV positive
Let  = probability a GROUP is HCV positive
Let s = group size
Let n = number of groups
Let T be a random variable denoting the number of positive GROUPS
• T has a binomial distribution with “n trials” and “ as the probability
of success”
n t
n t
• f(t)     (1  ) for t=0,1,2,...,n
t
– Let Y be an UNOBSERVABLE random variable denoting the number of
positive INDIVIDUALS in a group
• Y has a binomial distribution with “s trials” and “p as the probability
of success”
s s
sy
g(y)

p
(1

p)
for y=0,1,2,...,s
•
y
 
www.chrisbilder.com
19 of 36
HCV prevalence
 How can we estimate p?
– We observe information about the groups, not individuals!
– Maximum likelihood estimate of  is ˆ  T / n = # positive / # of groups
–  = P(group is positive)
= P(at least one individual is positive)
= 1 – P(no individuals are positive)
using complement rule
= 1 – (1-p)s
since p = P(individual is positive) and
s individuals per group
– Solve for p, p = 1- (1- )1/s
– Use invariance property of maximum likelihood estimates to find
pˆMLE  1 (1 ˆ )1/ s  1 (1 T / n)1/ s
• For fixed sample size, pˆ is positively biased
• It is unbiased as n 
www.chrisbilder.com
20 of 36
HCV prevalence
 Can we find a better estimator?
– Yes
 How do we measure “better”?
– Let ˆ be an estimator of 
– Bias = E(ˆ )  
• Which would you prefer for a bias: small or large
• Given two estimators, which would you prefer
– Estimator with smaller bias
– Estimator with larger bias
– Can compare to competing estimators through the “relative bias”
• Let ˆ 1 and ˆ 2 be two estimators of 
Bias(ˆ 1)
• RB =
Bias(ˆ 2 )
• If RB > 1, then Bias(ˆ 1)  Bias(ˆ 2 ) and ˆ 2 would be “better”
www.chrisbilder.com
21 of 36
HCV prevalence
 New estimators
– Proposed in Tebbs, Bilder, and Moser (Communications in Statistics,
2003) and Bilder and Tebbs (under review in Biometrical Journal)
– Derived through “empirical Bayesian methods”
(n  ˆ / s  1)(n  t  ˆ / s  1/ s)
– pˆ EB1  1 
(n  t  ˆ / s)(n  ˆ / s  1  1/ s)

where ()   x 1e x dx
0
(n  1)(n  t   / s)
and ˆ is found from maximizing fT (t | ) 
s(n  t  1)(n   / s)
1/ s

t 1 
ˆ
p

1

1

– EB2


 n  ˆ / s  1
www.chrisbilder.com
22 of 36
15
10
10
Relative bias
EB1
EB2
EB3
5
5
n=30, s=10
0
0
1
Relative bias
15
ˆ EB,i is better than pˆ MLE
andRB
s=10> 1, p
when
RB  Bias(pˆ MLE ) / Bias(pˆ EB,i );n=30
0.00
0.02
0.04
0.06
0.08
0.10
0.00
0.02
n=80 and s=25
EB1
EB2
EB3
0.08
0.10
EB1
EB2
EB3
n=80, s=25
0.00
0.02
EB1
0.04
EB2
EB3
Relative efficiency
0.9 1.0
1.1 10
1.2
.6 0.7 0.8
0
5
1
Relative efficiency
Relative bias
n=30 and s=10
.6 0.7 0.8 0.9 1.0 1.1 1.2
15
p
0.06
p
www.chrisbilder.com
0.08
0.10
23 of 36
HCV prevalence
 Estimation of HCV prevalence in Xuzhou City, China.
– Data from Liu et al. (Transfusion, 1997)
– 1,875 blood donors screened for HCV at the Blood Transfusion Service
in Xuzhou City, China
• There were 42 positive blood donors found
– In order to test the usefulness of group testing, blood samples were
also pooled
• n = 375 groups
• s = 5 individuals per group
• t = 37 positive groups
– Point estimates of p, the individual probability of being HCV positive
• Using individual data: 42/1875 = 0.0224
• Using group data: pˆ MLE  0.020562
pˆ EB1  0.020557
pˆ EB2  0.020534
www.chrisbilder.com
24 of 36
HCV prevalence

Multiple vector transfer designs
–
–
Swallow (Phytopathology, 1985)
Want to estimate the probability a insect vector transfers a pathogen
(virus, bacteria, etc.) to a plant
Brown
planthopper
Whitebacked
planthopper
www.chrisbilder.com
25 of 36
HCV prevalence

Multiple vector transfer designs (continued)
–
–
–
s insect vectors are transferred to a healthy plant
The plant is the “group”
Observe number of plants which contract the pathogen
y =0 if plant is negative, 0 if plant is positive
y=0
y=1
y=0
Greenhouse
Does not
transmit
virus

Transmits
virus
Enclosed
test plant
y=0
y=1
www.chrisbilder.com
Planthopper
y=0
26 of 36
HCV prevalence
 New research
– Include independent variables to help model p in a logistic regression
model,
e0 1x1 2x2 k xk
p=
1  e0 1x1 2x2  k xk
– Problem: Do not have the individual outcomes
– Help to decide who to retest if get a positive group
– Multiple traits
• HCV
• HIV
• Other disease
• Simultaneously model
www.chrisbilder.com
27 of 36
Why statistics?
 Statistics is used in many diverse areas!
– Statistics is the “science of science”
– Florence Nightingale quote:
the most important science in the whole world: for upon it depends the
practical application of every other science and of every art: the one
science essential to all political and social administration, all education,
all organization based on experience, for it only gives results of our
experience.
 I hope you have an interest to take more statistics courses
– UNO
– Graduate school in statistics or non-statistics programs
 Of course, I want you to consider coming to UNL!
www.chrisbilder.com
28 of 36
Statistics at UNL
 Facts
–
–
–
–
–
–
July 1, 2003 formed
11 faculty + 2 more in 2004
No undergraduate major
40+ graduate students (most MS)
Strong commitment from administration
Hardin Hall on East Campus
www.chrisbilder.com
29 of 36
33rd st.
Statistics at UNL
Department
of Statistics
www.chrisbilder.com
30 of 36
Statistics at UNL
 Background of new students
– A few statistics courses – like UNO MATH 4740 and 4750
– Statistics is mainly a graduate discipline
– Majority have math degrees
 Recommendation for UNO classes
–
–
–
–
–
Math 4740 and 4750 Intro. to Probability and Statistics I and II
Math 3300 Numerical Methods
Math 4760 Topics in Modeling
Math 4050 Linear Algebra
Math 4230 and 4240 Mathematical Analysis I and II
• Helpful if you plan to go on for a PhD
– Stat 3000 and 3010 Statistical Methods I and II
www.chrisbilder.com
31 of 36
Statistics at UNL
 Recommendation for UNO classes (continued)
– Business administration course: 3140 Business Statistical Applications
– Computer science programming courses
– Information Systems & Quantative Analysis Department courses
• 4150 Advanced Statistical Methods for IS&T
• 8160 Applied Distribution Free Statistics
• 8340 Applied Regression Analysis
• 9120 Applied Experimental Design and Analysis
• 9130 Applied Multivariate Analysis
www.chrisbilder.com
32 of 36
Statistics at UNL
 Assistantships
– Work 16-20 hours a week
– Teaching - $13K per school year + tuition (MS students)
– Project Fulcrum grants - $30K per school year!
• 6 statistics students over the past 3 years have received grant
– Research - variable depending on grants
• Statistics and non-statistics faculty grants
 What makes us unique?
–
–
–
–
Consulting course and help desk
STAT 971 – Statistical Modeling
Statistics in sports and work with UNL athletic department
Consulting - All departments in the College of Agriculture and Natural
Resources
– Gallup
– Bioinformatics
www.chrisbilder.com
33 of 36
Statistics at UNL
 Where do statistics graduates work?
–
–
–
–
–
–
Pharmaceutical – Pfizer, Merck
Marketing – Target, Hallmark
Government research labs – INEEL, Los Alamos, Sandia, Argonne
Agriculture - Pioneer Hi-Bred
Consulting firms – Quintiles
Everyone that I have known has had a job offer before they graduated!
 Salaries
– Non-academic starting (2003 American Statistical Association survey)
Percentile
Degree Sample size 25th 50th 75th
MS
102
45.5K 50K 59K
PhD
99
60K 65K 75K
– Survey response rate was 23.5% by organizations surveyed
– See salary surveys at the American Statistical Association’s website
www.chrisbilder.com
34 of 36
Statistics at UNL
 Applying for graduate school in statistics
–
–
–
–
Send out applications before end of fall semester
Apply to more than one school
Visit schools in fall or early spring
Assistantship offers usually first go out in March
 7th Annual UNL Regional Workshop in Mathematical Sciences
–
–
–
–
–
Statistics, Mathematics, and Computer Science departments
November 2004
Friday afternoon & evening and Saturday morning
Speakers introducing statistics and jobs in statistics
FUNDING available!
www.chrisbilder.com
35 of 36
Statistics at UNL
 For more information…
– E-mail me at [email protected] or [email protected]
• Advice
• Sit in on a class
– Website: www.chrisbilder.com/statistics
• This PowerPoint presentation
• Links to
– Introductory information about being a statistician
– Jobs (including internships)
– Salary information
– List of all Departments of Statistics
– Professional societies
– MS and PhD course websites that myself and others teach
– Newspaper and magazine articles about statistical applications
www.chrisbilder.com
36 of 36