Transcript bgb_talk

Verification of Probability Forecasts
at Points
WMO QPF Verification Workshop
Prague, Czech Republic
14-16 May 2001
Barbara G. Brown
NCAR
Boulder, Colorado, U.S.A.
[email protected]
14 May 2001
QPF Verification Workshop
Why probability forecasts?
“…the widespread practice of ignoring
uncertainty when formulating and
communicating forecasts represents an extreme
form of inconsistency and generally results in
the largest possible reductions in quality and
value.”
--Murphy (1993)
14 May 2001
QPF Verification Workshop
Outline
1. Background and basics
– Types of events
– Types of forecasts
– Representation of probabilistic forecasts in the
verification framework
14 May 2001
QPF Verification Workshop
Outline continued
2. Verification approaches: focus on 2-category case
–
–
–
–
–
–
–
Measures
Graphical representations
Using statistical models
Signal detection theory
Ensemble forecast verification
Extensions to multi-category verification problem
Comparing probabilistic and categorical forecasts
3. Connections to value
4. Summary, conclusions, issues
14 May 2001
QPF Verification Workshop
Background and basics
• Types of events:
– Two-category
– Multi-category
• Two-category events:
– Either event A happens or Event B happens
– Examples:
Rain/No-rain
Hail/No-hail
Tornado/No-tornado
• Multi-category event
– Event A, B, C, ….or Z happens
– Example:
Precipitation categories
(< 1 mm, 1-5 mm, 5-10 mm, etc.)
14 May 2001
QPF Verification Workshop
Background and basics cont.
• Types of forecasts
– Completely confident
• Forecast probability is either 0 or 1
• Example: Rain/No rain
– Probabilistic
• Objective (deterministic, statistical, ensemble-based)
• Subjective
• Probability is stated explicitly
14 May 2001
QPF Verification Workshop
Background and basics cont.
• Representation of probabilistic forecasts in the
verification framework
x = 0 or 1
f = 0, …, 1.0
f may be limited to only certain values between
0 and 1
• Joint distribution:
p(f,x), where x = 0, 1
Ex: If there are 12 possible values of f, then p(f,x)
is comprised of 24 elements
14 May 2001
QPF Verification Workshop
Background and basics, cont.
• Factorizations: Conditional and marginal
probabilities
– Calibration-Refinement factorization:
• p(f,x) = p(x|f) p(f)
• p(x=0|f) = 1 – p(x=1|f) = 1 – E(x|f)
Only one number is needed to specify the distribution
p(x|f) for each f
• p(f) is the frequency of use of each forecast probability
• Likelihood-Base Rate factorization:
• p(f,x) = p(f|x) p(x)
• p(x) is the relative frequency of a Yes observation (e.g.,
the sample climatology of precipitation); p(x) = E(x)
14 May 2001
QPF Verification Workshop
Attributes [from Murphy and Winkler(1992)]
(sharpness)
14 May 2001
QPF Verification Workshop
Verification approaches: 2x2 case
Completely confident forecasts:
Forecast
Observation
Yes (f = 1)
No (f = 0)
Total
Yes (x = 1)
YY
NY
YY+NY
No (x = 0)
YN
NN
YN+NN
Total
YY+YN
NY+NN
YY+YN+NY+NN
Use the counts in this table to compute various
common statistics (e.g., POD, POFD, H-K, FAR,
CSI, Bias, etc.)
14 May 2001
QPF Verification Workshop
Verification measures for 2x2 (Yes/No)
completely confident forecasts
Statistic
Definition

POD
YY/(YY+NY)



POFD
NN/(YN+NN)
FAR
YN/(YY+YN)
CSI
YY/(YY+YN+NY)
H-K
POD + POFD – 1
Bias
(YY+YN) / (NY + NN)
14 May 2001







Description
Probability of Detection of “Yes”
observations
Estimate of p(f=1|x=1)
Also called Hit Rate (HR) in SDT
Probability of False Detection =
Probability of Detection of “No”
observations
Estimate of p(f=0|x=0)
1-POFD = False Alarm Rate in SDT
False Alarm Ratio
Estimate of p(x=0|f=1)
Critical Success Index
Also known as “Threat Score”
Hanssen-Kuipers Discrimination
 A measure of over- or under-forecasting
 Estimate of p(f=1)/p(x=1)
QPF Verification Workshop
Relationships among measures in the 2x2 case
Many of the measures in the 2x2 case are strongly
related in surprisingly complex ways.
For example:
1

pc
POD 
FAR  1 
, where pc  p(x  1 )

 (1  pc) (1  POFD) 
14 May 2001
QPF Verification Workshop
0.10
0.30
0.50
0.70
0.90
14 May 2001
The lines indicate different values of POD and POFD
(where POD = POFD).
From Brown and Young (2000)
QPF Verification Workshop
CSI as a function of p(x=1) and POD=POFD
0.9
0.7
0.5
0.3
0.1
14 May 2001
QPF Verification Workshop
CSI as a function of FAR and POD
14 May 2001
QPF Verification Workshop
Measures for Probabilistic Forecasts
• Summary measures:
– Expectation
• Conditional:
E(f|x=0), E(f|x=1)
E(x|f)
• Marginal:
E(f)
E(x) = p(x=1)
– Variability
• Conditional:
Var.(f|x=0), Var(f|x=1)
Var(x|f)
• Marginal:
Var(f)
Var(x) = E(x)[1-E(x)]
– Correlation
• Joint distribution
14 May 2001
QPF Verification Workshop
From Murphy and Winkler (1992)
Summary measures for joint and marginal distributions:
14 May 2001
QPF Verification Workshop
From Murphy and Winkler (1992)
Summary measures for conditional distributions:
14 May 2001
QPF Verification Workshop
Performance measures
• Brier score:
1 n
BS   ( f k  xk ) 2
n k 1
– Analogous to MSE; negative orientation;
– For perfect forecasts: BS=0
• Brier skill score:
BS  BS ref
BS
BSS 
 1
0  BS ref
BS ref
– Analogous to MSE skill score
14 May 2001
QPF Verification Workshop
From Murphy and Winkler (1992):
14 May 2001
QPF Verification Workshop
Brier score displays
From Shirey and Erickson,
http://www.nws.noaa.gov/tdl/synop/amspapers/masmrfpap.htm
14 May 2001
QPF Verification Workshop
Brier score displays
From http://www.nws.noaa.gov/tdl/synop/mrfpop/mainframes.htm
14 May 2001
QPF Verification Workshop
Decomposition of the Brier Score
Break Brier score into more elemental components:
I
1 I
1
BS   N i ( f i  xi ) 2   N i ( xi  x ) 2  x (1  x )
n i 1
n i 1
Reliability
Resolution
Uncertainty
Where I = the number of distinct probability values and
1
xi  p( x  1 | f i ) 
Ni
x
kN i
k
Then, the Brier Skill Score can be re-formulated as
RES  REL
BSS 
UNC
14 May 2001
QPF Verification Workshop
Graphical representations of measures
• Reliability diagram
p(x=1|fi) vs. fi
• Sharpness diagram
p(f)
• Attributes diagram
– Reliability, Resolution, Skill/No-skill
• Discrimination diagram
p(f|x=0) and p(f|x=1)
Together, these diagrams provide a relatively
complete picture of the quality of a set of
probability forecasts
14 May 2001
QPF Verification Workshop
Reliability and Sharpness (from Wilks 1995)
Climatology
Good RES, at
expense of REL
14 May 2001
Minimal RES
Reliable forecasts
of rare event
Underforecasting
Small sample size
QPF Verification Workshop
Reliability and Sharpness
(from Murphy and Winkler 1992)
Sub
Model
St. Louis
12-24 h PoP
Model
Cool Season
Sub
No skill
No RES
14 May 2001
QPF Verification Workshop
Attributes diagram (from Wilks 1995)
14 May 2001
QPF Verification Workshop
Icing forecast examples
14 May 2001
QPF Verification Workshop
Use of statistical models to describe
verification features
• Exploratory study by Murphy and Wilks
(1998)
• Case study
– Use regression model to model reliability
– Use Beta distribution to model p(f) as measure of
sharpness
– Use multivariate diagram to display combinations of
characteristics
• Promising approach that is worthy of more
investigation
14 May 2001
QPF Verification Workshop
Fit Beta distribution to p(f)
2 parameters: p. q
Ideal: p<1; q<1
0
14 May 2001
1
QPF Verification Workshop
Fit regression to
Reliability diagram
[p(x|f) vs. f]
2 parameters: b0, b1
Murphy and Wilks (1997)
14 May 2001
QPF Verification Workshop
Summary Plot
Murphy and Wilks 1997
14 May 2001
QPF Verification Workshop
Signal Detection Theory (SDT)
• Approach that has commonly been applied in medicine and
other fields
• Brought to meteorology by Ian Mason (1982)
• Evaluates the ability of forecasts to discriminate between
occurrence and non-occurrence of an event
• Summarizes characteristics of the Likelihood-Base Rate
decomposition of the framework
• Tests model performance relative to specific threshold
• Ignores calibration
• Allows comparison of categorical and probabilistic
forecasts
14 May 2001
QPF Verification Workshop
Mechanics of SDT
• Based on likelihood-base rate decomposition
p(f,x) = p(f|x) p(x)
• Basic elements :
– Hit rate (HR)
• HR = POD = YY / (YY+NY)
• Estimate of p(f=1|x=1)
– False Alarm Rate (FA)
• FA = 1 - POFD = YN / (YN + NN)
• Estimate of p(f=1|x=0)
• Relative Operating Characteristic curve
– Plot HR vs. FA
14 May 2001
QPF Verification Workshop
ROC Examples: Mason(1982)
14 May 2001
QPF Verification Workshop
ROC Examples: Icing forecasts
14 May 2001
QPF Verification Workshop
ROC
• Area under the ROC is a measure of forecast
skill
– Values less than 0.5 indicate negative skill
• Measurement of ROC Area often is better if a
normal distribution model is used to model HR
and FA
– Area can be underestimated if curve is approximated by
straight line segments
– Harvey et al (1992), Mason (1982); Wilson (2000)
14 May 2001
QPF Verification Workshop
Idealized ROC (Mason 1982)
f(x=1)
f(x=0)
f(x=0)
f(x=1)
f(x=0)
S=2
S=1
f(x=1)
S=0.5
S = s0 /s1
14 May 2001
QPF Verification Workshop
Comparison of Approaches
• Brier score
– Based on squared error
– Strictly proper scoring rule
– Calibration is an important
factor; lack of calibration
impacts scores
– Decompositions provide
insight into several
performance attributes
– Dependent on frequency of
occurrence of the event
14 May 2001
• ROC
– Considers forecasts’ ability
to discriminate between Yes
and No events
– Calibration is not a factor
– Less dependent on
frequency of occurrence of
event
– Provides verification
information for individual
decision thresholds
QPF Verification Workshop
Relative operating levels
• Analogous to the ROC, but from the CalibrationRefinement perspective (i.e., given the forecast)
• Curves based on
YY
 1  FAR
– Correct Alarm Ratio:
YY  YN
NY
– Miss Ratio:
NY  NN
• These statistics are estimates of two conditional
probabilities:
– Correct Alarm Ratio: p(x=1|f=1)
– Miss Ratio: p(x=1|f=0)
– For a system with no skill, p(x=1|f=1) = p(x=1|f=0) = p(x)
14 May 2001
QPF Verification Workshop
ROC Diagram
(Mason and Graham 1999)
14 May 2001
QPF Verification Workshop
ROL Diagram
(Mason and Graham 1999)
14 May 2001
QPF Verification Workshop
Verification of ensemble forecasts
• Output of ensemble forecasting systems can be
treated as
– A probability distribution
– A probability
– A categorical forecast
• Probabilistic forecasts from ensemble systems
can be verified using standard approaches for
probabilistic forecasts
• Common methods
– Brier score
– ROC
14 May 2001
QPF Verification Workshop
Example: Palmer et al. (2000)
Reliability
ECMWF ensemble
Multi-model ensemble
<0
<1
14 May 2001
QPF Verification Workshop
Example: Palmer et al. (2000)
ROC
ECMWF ensemble
Multi-model
ensemble
14 May 2001
QPF Verification Workshop
Verification of ensemble forecasts (cont.)
A number of methods have been developed specifically
for use with ensemble forecasts. For example:
• Rank histograms
– Rank position of observations relative to ensemble members
– Ideal: Uniform distribution
– Non-ideal can occur for many reasons (Hamill 2001)
• Ensemble distribution approach
(Wilson et al. 1999)
– Fit distribution to ensemble
– Determine probability associated with that observation
14 May 2001
QPF Verification Workshop
Rank histograms:
14 May 2001
QPF Verification Workshop
Distribution approach (Wilson et al. 1999)
14 May 2001
QPF Verification Workshop
Extensions to multiple categories
• Examples:
– QPF with several thresholds/categories
• Approach 1: Evaluate each category on its own
– Compute Brier score, reliability, ROC, etc. for each
category separately
– Problems:
• Some categories will be very rare, have few Yes observations
• Throws away important information related to the ordering of
predictands and magnitude of error
14 May 2001
QPF Verification Workshop
Example: Brier skill score for several
categories
From http://www.nws.noaa.gov/tdl/synop/mrfpop/mainframes.htm
14 May 2001
QPF Verification Workshop
Extensions to multiple categories (cont.)
• Approach 2: Evaluate all categories
simultaneously
– Rank Probability Score (RPS)
– Analogous to Brier Score for multiple categories
2
K
i
i

1

 
RPS 
   Pn   d n  
K  1  i 1  n 1
n 1
 
RPSS  RPS F
– Skill score: RPSS 
RPSS
– Decompositions analogous to BS, BSS
14 May 2001
QPF Verification Workshop
Multiple categories: Examples of
alternative approaches
• Continuous ranked probability score
(Bouttier 1994; Brown 1974; Matheson and Winkler 1976;
Unger 1985)
and decompositions (Hersbach 2000)
– Analogous to RPS with infinite number of classes
– Decompose into Reliability and Resolution/uncertainty components
• Multi-category reliability diagrams (Hamill 1997)
– Measures calibration in a cumulative sense
– Reduces impact of categories with few forecasts
• Other references
–
–
–
–
14 May 2001
Bouttier 1994
Brown 1974
Matheson and Winkler 1976
Unger 1985
QPF Verification Workshop
Continuous RPS example (Hersbach 2000)
14 May 2001
QPF Verification Workshop
MCRD example (Hamill 1997)
14 May 2001
QPF Verification Workshop
Connections to value
Protect?
• Cost-Loss ratio model
Yes
No
Adverse weather?
Yes
No
C
C
L
0
• Optimal to protect whenever C < pL or p > C/L
where p is the probability of adverse weather
14 May 2001
QPF Verification Workshop
Wilks’ Value Score (Wilks 2001)
• VS is the percent improvement in value
between climatological and perfect information
as a function of C/L
• VS is impacted by (lack of) calibration
• VS can be generalized for particular/idealized
distributions of C/L
14 May 2001
QPF Verification Workshop
VS example: Wilks (2001)
Las Vegas, PoP April 1980 – March 1987
14 May 2001
QPF Verification Workshop
VS example: Icing forecasts
14 May 2001
QPF Verification Workshop
VS: Beta model example (Wilks 2001)
14 May 2001
QPF Verification Workshop
Richardson approach
• ROC context
• Calibration errors don’t impact the score
14 May 2001
QPF Verification Workshop
Miscellaneous issues
• Quantifying the uncertainty in verification
measures
– Issue: Spatial and temporal correlation
– A few approaches:
• Parametric methods
Ex: Seaman et al. (1996)
• Robust methods (confidence intervals for medians)
Ex: Brown et al. (1997)
Velleman and Hoaglin (1981)
• Bootstrap methods
Ex: Hamill (1999)
Kane and Brown (2001)
• Treatment of observations as probabilistic?
14 May 2001
QPF Verification Workshop
Conclusions
• Basis for evaluating probability forecasts was
established many years ago (Brier, Murphy,
Epstein)
• Recent renewal in interest has led to new ideas
• Still more to do
– Develop and implement a cohesive set of meaningful
and useful methods
– Develop greater understanding of methods we have and
how they inter-relate
14 May 2001
QPF Verification Workshop
Verification of Probabilistic QPFs:
Selected References
Brown, B.G., G. Thompson, R.T. Bruintjes, R. Bullock and T. Kane, 1997:
Intercomparison of in-flight icing algorithms. Part II: Statistical
verification results. Weather and Forecasting, 12, 890-914.
Davis, C., and F. Carr, 2000: Summary of the 1998 workshop on mesoscale
model verification. Bulletin of the American Meteorological Society, 81,
809-819.
Hamill, T.M., 1997: Reliability diagrams for multicategory probabilistic
forecasts. Weather and Forecasting, 12, 736–741.
Hamill, T.M., 1999: Hypothesis tests for evaluating numerical precipitation
forecasts. Weather and Forecasting, 14, 155-167.
Hamill, T.M., 2001: Interpretation of rank histograms for verifying ensemble
forecasts. Monthly Weather Review, 129, 550-560.
14 May 2001
QPF Verification Workshop
References (cont.)
Harvey, L.O., Jr., K.R. Hammond, C.M. Lusk, and E.F. Mross, 1992: The
application of signal detection theory to weather forecasting behavior.
Monthly Weather Review, 120, 863-883.
Hersbach, H., 2000: Decomposition of the continuous ranked probability
score for ensemble prediction systems. Weather and Forecasting, 15, 559570.
Hsu, W.-R., and A.H. Murphy, 1986: The attributes diagram: A geometrical
framework for assessing the quality of probability forecasts. International
Journal of Forecasting, 2, 285-293.
Kane, T.L., and B.G. Brown, 2000: Confidence intervals for some verification
measures – a survey of several methods. Preprints, 15th Conference on
Probability and Statistics in the Atmospheric Sciences, 8-11 May, Asheville,
NC, U.S.A., American Meteorological Society (Boston), 46-49.
14 May 2001
QPF Verification Workshop
References (cont.)
Mason, I., 1982: A model for assessment of weather forecasts. Australian
Meteorological Magazine, 30, 291-303.
Mason, I., 1989: Dependence of the critical success index on sample climate
and threshold probability. Australian Meteorological Magazine, 37, 75-81.
Mason, S., and N.E. Graham, 1999: Conditional probabilities, relative
operating characteristics, and relative operating levels. Weather and
Forecasting, 14, 713-725.
Murphy, A.H., 1993: What Is a god forecast? An essay on the nature of
goodness in weather forecasting. Weather and Forecasting, 8, 281–293.
Murphy, A.H., and D.S. Wilks, 1998: A case study of the use of statistical
models in forecast verification: Precipitation probability forecasts.
Weather and Forecasting, 13, 795-810.
14 May 2001
QPF Verification Workshop
References (cont.)
Murphy, A.H., and R.L. Winkler, 1992: Diagnostic verification of probability
forecasts. International Journal of Forecasting, 7, 435-455.
Richardson, D.S., 2000: Skill and relative economic value of the ECMWF
ensemble prediction system. Quarterly Journal of the Royal Meteorological
Society, 126, 649-667.
Seaman, R., I. Mason, and F. Woodcock, 1996: Confidence intervals for some
performance measures of Yes-No forecasts. Australian Meteorological
Magazine, 45, 49-53.
Stanski, H., L.J. Wilson, and W.R. Burrows, 1989: Survey of common
verification methods in meteorology. WMO World Weather Watch Tech.
Rep. 8, 114 pp.
Velleman, P.F., and D.C. Hoaglin, 1981: Applications, Basics, and Computing
of Exploratory Data Analysis. Duxbury Press, 354 pp.
14 May 2001
QPF Verification Workshop
References (cont.)
Wilks, D.S., 1995: Statistical Methods in the Atmospheric Sciences,
Academic Press, San Diego, CA, 467 pp.
Wilks, D.S., 2001: A skill score based on economic value for
probability forecasts. Meteorological Applications, in press.
Wilson, L.J., W.R. Burrows, and A. Lanzinger, 1999: A strategy for
verification of weather element forecasts from an ensemble
prediction system. Monthly Weather Review, 127, 956-970.
14 May 2001
QPF Verification Workshop