forecast system

Download Report

Transcript forecast system

EVALUATION OF WEATHER FORECASTS
SHORT COURSE ON PREDICTABILITY - lll
Zoltan Toth
Global Systems Division
GSD/ESRL/OAR/NOAA
Acknowledgements:
Yuejian Zhu, Gopal Iyengar, Olivier Talagrand and a Large Group of
Colleagues and Collaborators
ELTE, Budapest, 26-27 May 2016
1
OUTLINE
• Purposes and types of evaluation
– Verification, diagnostics
• Attributes of forecast systems
– Statistical reliability & resolution
• Measures of reliability & resolution
• Metrics of ensemble performance
– Reliability
– More complex measures
FORM OF WEATHER FORECASTS
• Predictability of weather is limited due to
– Initial condition and model related errors &
– Chaotic nature of atmosphere
• Predictab. varies from case to case due to
– Complex nonlinear interactions in the
atmosphere
• Probabilistic forecasts capture case
dependent variations in predictability
– Ensemble forecasting
• Only scientifically consistent form of forecasts
FORECASTING IN A CHAOTIC ENVIRONMENT –
PROBABILISTIC FORECASTING BASED ON A SINGLE FORECAST –
One integration with an NWP model, combined with past verification statistics
DETERMINISTIC APPROACH - PROBABILISTIC FORMAT
•Does not contain all forecast information
•Not best estimate for future evolution of system
•UNCERTAINTY CAPTURED IN TIME AVERAGE SENSE •NO ESTIMATE OF CASE DEPENDENT VARIATIONS IN FCST UNCERTAINTY
4
PROBABILISTIC FORECAST APPROACHES
• Theoretically appealing approach – Liouville Eq
– Linear constraints & computational expense renders it impractical
• Brute force Monte Carlo approach – Ensemble forecasting
– Dependable results with multiple integrations
– Case dependent probabilistic forecasts = predictability
• Conceptual challenges
– Estimate analysis error distribution
– Sample analysis error distribution
– Represent model related uncertainty
• Practical implementations
– Multiple analyses (EnKF, (L)ETKF, multiple 4dvar, etc)
– Breeding or Ensemble Transform
– Singular Vectors
• Probabilistic forecast products derived from ensembles
NEEDS FOR SYSTEMATIC EVALUATION OF FCSTS
• Research exploration
– Learn about forecast systems - Random vs systematic errors, etc
• Research guidance
– Guide NWP development efforts
• Compare different systems
– Eg, w different types of output - single unperturbed vs ensemble
systems
• Evaluate value of / added by subcomponents of system
– E.g., improved DA, ensemble, product generation methods
• Management decisions
– Select best performing DA-Forecast system for operational use
• Assess user value
– Social, economic, environmental applications
USER REQUIREMENTS
• General characteristics of forecast users
– Each user affected in specific way by
• Various weather elements at
• Different points in time & space
• Requirements for optimal decisions related to
operations affected by weather
– Possible weather scenarios with covariances across
• Variables, space, and time
• Provision of weather information
– Specific information needed by each user can’t be foreseen
or provided
• Only vanilla-type often used probabilistic info can be distributed
– Ensemble data must be made accessible to users in
statistically reliable form
• All forecast info can be derived from this, including vanilla
probabilistic products
7
STATISTICAL ASSESSMENT
• Performance of forecast systems
– Sample of forecasts - not a single forecast
• Sample size affects
– How fine details we can uncover
– Statistical significance of results
– Size of sample limited by available obs. Record
• Verification – Quality & utility
– Compare output from fcst system with proxy for truth
– Main subject of lecture
• Diagnostics – Other characteristics
– Depends only on forecast properties
• Unrelated to truth
8
VERIFICATION
• Measures of quality
– Environmental science issues - Focus of talk
• Measures of utility – Multidisciplinary issues
– Social & economic questions, beyond environmental sciences
– Socio-economic value of forecasts is ultimate measure
• Approximate measures can be constructed
• Improved quality => Enhanced utility
• How to improve utility if quality is fixed?
– Communicate all available user relevant information
• Offer probabilistic or other information on forecast uncertainty
– Engage in education, training
• Can we selectively improve certain aspects of forecasts?
– E.g, improve precip fcsts without improving circulation forecasts?
9
VERIFICATION BASICS
• Define predictand
– Exhaustive set of events, eg
• Continuous temperature
• Precipitation types (Categorical)
• Choose proxy for truth
– Observations or
– Observationally-based fine scale analysis
– Results subject to error in proxy for truth
• Collect sample of forecasts and verifying truth
10
VERIFICATION PREAMBLE
• What do we want / need to assess?
– Attributes of forecast systems
• What can we measure that reflects attributes?
– Quantitative metrics
• Comprehensive approach
– Cover all bases
– Understand relationships between metrics
EVALUATING QUALITY OF FORECAST SYSTEMS
• Formats of forecasts
– Categorical, single value / multiple value, probabilistic
• Forecast system attributes
– Performance specific characteristics
– Traditionally, defined separately for each forecast format
• General definition needed
– Need to compare forecasts
• From any system &
• Of any type / format
– Support systematic evaluation of
• End-to-end (provider-user) forecast process
– Statistical post-processing as integral part of system
12
FORECAST SYSTEM ATTRIBUTES
• Abstract concept (like length)
– Reliability and Resolution
• Both can be measured through different statistics
• Statistical properties
– Interpreted for large set of forecasts
• Describe behavior of forecast system, not a single forecast
• Assumptions
– Forecasts
• Can be of any format
– Single value, ensemble, categorical, probabilistic, etc
• Take a finite number of different “classes” Fa
– Observations
• Can also be grouped into finite number of “classes” like Oa
13
STATISTICAL RELIABILITY – TEMPORAL AGGREGATE
STATISTICAL CONSISTENCY OF FORECASTS WITH OBSERVATIONS
BACKGROUND:
• Consider particular forecast class – Fa
• Consider frequency distribution of observations that follow forecasts Fa - fdoa
DEFINITION:
• If forecast Fa has the exact same form as fdoa, for all forecast classes,
the forecast system is statistically consistent with observations =>
The forecast system is perfectly reliable
MEASURES OF RELIABILITY:
• Based on different ways of comparing Fa and fdoa
EXAMPLES:
CONTROL FCST
ENSEMBLE
14
STATISTICAL RESOLUTION – TEMPORAL EVOLUTION
ABILITY TO DISTINGUISH, AHEAD OF TIME, AMONG DIFFERENT OUTCOMES
BACKGROUND:
• Assume observed events are classified into finite number of classes, like Oa
DEFINITION:
• If all observed classes (Oa, Ob,…) are preceded by
– Distinctly different forecasts (Fa, Fb,…)
– The forecast system “resolves” the problem =>
The forecast system has perfect resolution
MEASURES OF RESOLUTION:
• Based on degree of separation of fdo’s that follow various forecast classes
• Measured by difference between fdo’s & climate distribution
• Measures differ by how differences between distributions are quantified
EXAMPLES
FORECASTS
OBSERVATIONS
15
CHARACTERISTICS OF RELIABILITY & RESOLUTION
•
Reliability
– Related to form of forecast, not forecast content
•
Fidelity of forecast
– Reproduce nature when resolution is perfect, forecast looks like nature
– Not related to time sequence of forecast/observed systems
– How to improve?
•
•
Make model more realistic
– Also expected to improve resolution
Statistical bias correction: Can be statistically imposed at one time level
– If both natural & forecast systems are stationary in time &
– If there is a large enough set of observed-forecast pairs
– Link with verification:
» Replace forecast with corresponding fdo
•
Resolution
– Related to inherent predictive value of forecast system
– Not related to form of forecasts
•
Statistical consistency at one time level (reliability) is irrelevant
– How to improve?
•
Enhanced knowledge about time sequence of events
– More realistic numerical model should help
» May also improve reliability
16
CHARACTERISTICS OF FORECAST SYSTEM ATTRIBUTES
RELIABILITY AND RESOLUTION ARE
•
General forecast attributes
– Valid for any forecast format (single, categorical, probabilistic, etc)
•
Independent attributes
– For example
•
•
Climate pdf forecast is perfectly reliable, yet has no resolution
Reversed rain / no-rain forecast can have perfect resolution and no reliability
– To separate them, they must be measured according to general definition
•
•
If measured according to traditional, narrower definition
– Reliability & resolution can be mixed
Function of forecast quality
– There is no other relevant forecast attribute
•
•
Perfect reliability and perfect resolution = perfect forecast system =
– “Deterministic” forecast system that is always correct
Both needed for utility of forecast systems
– Need both reliability and resolution
•
Especially if no observed/forecast pairs available (eg, extreme forecasts, etc)
17
FORMAT OF FORECASTS – PROBABILSITIC FORMAT
• Do we have a choice?
– When forecasts are imperfect
• Only probabilistic format can be reliable/consistent with nature
• Abstract concept - Dimensionless
– Related to forecast system attributes
• Space of probability – dimensionless pdf or similar format
– For environmental variables (not those variables themselves)
• Steps
1. Define event
• Function of concrete variables, features, etc
– E.g., “temperature above freezing”; “thunderstorm”
2. Determine probability of event occurring in future
– Based on knowledge of initial state and evolution of system
18
OPERATIONAL PROB/ENSEMBLE FORECAST VERIFICATION
• Requirements
– Use same general dimensionless probabilistic measures for verifying
• Any event
• Against either
– Observations or
– Numerical analysis
• Measures used at NCEP
– Probabilistic forecast measures – ensemble interpreted probabilistically
• Reliability
– Component of BSS, RPSS, CRPSS
– Attributes & Talagrand diagrams
• Resolution
– Component of BSS, RPSS, CRPSS
– ROC, attributes diagram, potential economic value
– Special ensemble verification procedures
• Designed to assess performance of finite set of forecasts
– Most likely member statistics, PECA
• Missing components include
– General event definition - Spatial/temporal/cross variable considerations
– Routine testing of statistical significance
– Other “spatial” and/or “diagnostic” measures?
19
FORECAST PERFORMANCE MEASURES
COMMON CHARACTERISTIC:
Function of both forecast and observed values
MEASURES OF RELIABILITY:
DESCRIPTION:
Statistically compares any sample of
forecasts with sample of
corresponding observations
GOAL:
To assess similarity of samples (e.g.,
whether 1st and 2nd moments match)
EXAMPLES:
Reliability component of
Brier Score
Ranked Probability Score
Analysis Rank Histogram
Spread vs. Ens. Mean error
Etc.
MEASURES OF RESOLUTION:
DESCRIPTION:
Compares the distribution of
observations that follows different
classes of forecasts with the climate
distribution (as reference)
GOAL:
To assess how well the observations
are separated when grouped by
different classes of preceding fcsts
EXAMPLES:
Resolution component of
Brier Score
Ranked Probability Score
Information content
Relative Operational Characteristics
Relative Economic Value
Etc.
COMBINED (REL+RES) MEASURES: Brier, Cont. Ranked Prob. Scores, rmse, PAC,…
20
EXAMPLE – PROBABILISTIC
FORECASTS
RELIABILITY:
Forecast probabilities for given event
match observed frequencies of that
event (with given prob. fcst)
RESOLUTION:
Many forecasts fall into classes
corresponding to high or low
observed frequency of given event
(Occurrence and non-occurrence of
event is well resolved by fcst
system)
21
22
PROBABILISTIC FORECAST PERFORMANCE MEASURES
TO ASSESS TWO MAIN ATTRIBUTES OF PROBABILISTIC FORECASTS:
RELIABILITY AND RESOLUTION
Univariate measures:
Statistics accumulated point by point in space
Multivariate measures: Spatial covariance is considered
EXAMPLE:
BRIER SKILL SCORE (BSS)
COMBINED MEASURE OF RELIABILITY AND RESOLUTION
23
BRIER SKILL SCORE (BSS)
COMBINED MEASURE OF RELIABILITY AND RESOLUTION
METHOD:
Compares pdf against analysis
• Resolution (random error)
• Reliability (systematic error)
EVALUATION
BSS
Higher better
Resolution
Higher better
Reliability
Lower better
RESULTS
Resolution dominates initially
Reliability becomes important later
• ECMWF best throughout
– Good analysis/model?
•
NCEP good days 1-2
– Good initial perturbations?
– No model perturb. hurts later?
•
CANADIAN good days 8-10
– Model diversity helps?
May-June-July 2002 average Brier skill score for the EC-EPS (grey lines with full
circles), the MSC-EPS (black lines with open circles) and the NCEP-EPS (black lines
with crosses). Bottom: resolution (dotted) and reliability(solid) contributions to the
Brier skill score. Values refer to the 500 hPa geopotential height over the northern
hemisphere latitudinal band 20º-80ºN, and have been computed considering 24
10
equally-climatologically-likely intervals (from Buizza, Houtekamer, Toth et al, 2004)
BRIER SKILL SCORE
COMBINED MEASURE OF RELIABILITY AND RESOLUTION
25
RANKED PROBABILITY SCORE
COMBINED MEASURE OF RELIABILITY AND RESOLUTION
26
Continuous Rank Probability Score

CRPS 
2
[
F
(
x
)

H
(
x

x
)]
dx
0


CRPSS 
CRP Skill Score is
CRPSc CRPS f
CRPSc
Xo
100%
Obs (truth)
Heaviside Function H
50%
H ( x  x0 ) 
0%
p01
p02
p03
p04
p05 p06
p07
p08
p09
p10

0( x xo )
1( x  xo )
X
Order of 10 ensemble members (p01, p02,…,p10)
27
INFORMATION CONTENT
MEASURE OF RESOLUTION
28
RELATIVE OPERATING CHARACTERISTICS
MEASURE OF RESOLUTION
29
ECONOMIC VALUE OF FORECASTS
MEASURE OF RESOLUTION
30
ENSEMBLE MEAN ERROR VS. ENSEMBLE SPREAD
MEASURE OF RELIABILITY
Statistical consistency
between the ensemble and
the verifying analysis
means that the verifying
analysis should be
statistically
indistinguishable from the
ensemble members =>
Ensemble mean error
(distance between ens.
mean and analysis) should
be equal to ensemble
spread (distance between
ensemble mean and
ensemble members)
In case of a statistically consistent ensemble, ens. spread = ens. mean error,
and they are both a MEASURE OF RESOLUTION. In the presence of bias,
both rms error and PAC will be a combined measure of reliability and resolution31
ANALYSIS RANK HISTOGRAM (TALAGRAND DIAGRAM)
MEASURE OF RELIABILITY
32
PERTURBATION VS. ERROR
CORRELATION ANALYSIS (PECA)
MULTIVATIATE COMBINED MEASURE OF
RELIABILITY & RESOLUTION
METHOD: Compute correlation between
ens perturbtns and error in control fcst for
–
–
–
–
Individual members
Optimal combination of members
Each ensemble
Various areas, all lead time
EVALUATION: Large correlation indicates
ens captures error in control forecast
– Caveat – errors defined by analysis
RESULTS:
– Canadian best on large scales
• Benefit of model diversity?
– ECMWF gains most from combinations
• Benefit of orthogonalization?
– NCEP best on small scale, short term
• Benefit of breeding (best estimate initial
error)?
– PECA increases with lead time
• Lyapunov convergence
• Nonlilnear saturation
– Higher values on small scales
33
SAMPLING ANALYSIS ERRORS
• Multiple analysis cycles
– Random & realistic sample of both growing & non-growing patterns
• Bed Vectors (BV)
– Random sample of growing subspace only
• Ensemble Transform (ET)
– Random & orthogonalized sample of growing subspace
– Minor increase in space spanned
• Singular Vectors
– Directed sampling of fastest finite time apparent growth
• From unknown distribution (when general norms used)
– Hessian norm samples
errors ASSUMED by DA
Wei et al
2006
DA-FORECAST CYCLES & ANALYSIS ERRORS
• Analysis = weighted mean of First Guess (FG) forecast & obs
• FG error dominated by growing errors
– Initial errors rotate into fast growing nonlinear patterns
– Very few fast growing directions supported by instabilities of the flow
• Observations - random errors
– Very large subspace
• Analysis error - Combination of
FG & obs errors
Patil et al 2002
10-20% unstable area
characterized by 2-3
growing perturbations in
1,100 x 1,100 km boxes
Toth & Kalnay 1993
ENSEMBLE METRICS
• Mean / spread
• Talagrand
– Regular
– Time-lag for jump
•
•
•
•
Temporal consistency of perts
Craig’s / Mozheng’s stat
PECA
Higher moment statistics
– Joint probabilities, correlations, covariances
• Diagnostics
– Dof
FORECAST EVALUATION - SUMMARY
• Forecast system attributes
– Statistical reliability (calibration)
– Statistical resolution (skill or information content)
• Array of metrics for
– Either attributes
– Joint evaluation of both attributes
• Different metrics for assessing
– (Continuous) probabilistic forecasts
– Scenario-based (ensemble) forecasts
• Complex forecast systems need multitude of metrics
– Different metrics
– Different parameters including
• Joint probabilities, covariance, etc
BACKGROUND
38
http://wwwt.emc.ncep.noaa.gov/gmb/ens/ens_info.html
Toth, Z., O. Talagrand, and Y. Zhu, 2005: The Attributes of Forecast Systems: A Framework for the Evaluation and Calibration of
Weather Forecasts. In: Predictability Seminars, 9-13 September 2002, Ed.: T. Palmer, ECMWF, pp. 584-595.
Toth, Z., O. Talagrand, G. Candille, and Y. Zhu, 2003: Probability and ensemble forecasts. In: Environmental Forecast Verification: A
39
practitioner's guide in atmospheric science. Ed.: I. T. Jolliffe and D. B. Stephenson. Wiley, p. 137-164.
REFERENCES
• Value of forecasts; decomposition of scores - A.
40
OUTLINE / SUMMARY
• SCIENCE OF FORECASTING
– GOAL OF SCIENCE
– VERIFICATION
Forecasting
Model development, user feedback
• GENERATION OF PROBABILISTIC FORECASTS
– SINGLE FORECASTS
– ENSEMBLE FORECASTS
Statistical rendition of pdf
NWP-based, case-dependent pdf
• ATTRIBUTES OF FORECAST SYSTEMS
– RELIABILITY
– RESOLUTION
Forecasts look like nature statistically
Forecasts indicate actual future developments
• VERIFICATION OF PROBABILSTIC & ENSEMBLE FORECASTS
– UNIFIED PROBABILISTIC MEASURES
– ENSEMBLE MEASURES
Dimensionless
Evaluate finite sample
• STATISTICAL POSTPROCESSING OF FORECASTS
– STATISTICAL RELIABILITY
– STATISTICAL RESOLUTION
Make it perfect
Keep it unchanged
41
WHAT WE NEED FOR POSTPROCESSING TO WORK?
• LARGE SET OF FCST – OBS PAIRS
• Consistency defined over large sample – need same for post-processing
• Larger the sample, more detailed corrections can be made
• BOTH FCST AND REAL SYSTEMS MUST BE STATIONARY IN TIME
• Otherwise can make things worse
• Subjective forecasts difficult to calibrate
HOW WE MEASURE STATISTICAL INCONSISTENCY?
• MEASURES OF STATIST. RELIABILITY
•
•
•
•
Time mean error
Analysis rank histogram (Talagrand diagram)
Reliability component of Brier etc scores
Reliability diagram
42
SOURCES OF STATISTICAL INCONSISTENCY
• TOO FEW FORECAST MEMBERS
• Single forecast – inconsistent by definition, unless perfect
• MOS fcst hedged toward climatology as fcst skill is lost
• Small ensemble – sampling error due to limited ensemble size
(Houtekamer 1994?)
• MODEL ERROR (BIAS)
• Deficiencies due to various problems in NWP models
• Effect is exacerbated with increasing lead time
• SYSTEMATIC ERRORS (BIAS) IN ANALYSIS
• Induced by observations
• Effect dies out with increasing lead time
• Model related
• Bias manifests itself even in initial conditions
• ENSEMBLE FORMATION (INPROPER SPREAD)
• Not appropriate initial spread
• Lack of representation of model related uncertainty in ensemble
• I. E., use of simplified model that is not able to account for model related
uncertainty
43
HOW TO IMPROVE STATISTICAL CONSISTENCY?
• MITIGATE SOURCES OF INCONSISTENCY
• TOO FEW MEMBERS
• Run large ensemble
• MODEL ERRORS
• Make models more realistic
• INSUFFICIENT ENSEMBLE SPREAD
• Enhance models so they can represent model related forecast
uncertainty
• OTHERWISE =>
• STATISTICALLY ADJUST FCST TO REDUCE INCONSISTENCY
•
•
•
•
Unpreferred way of doing it
What we learn can feed back into development to mitigate problem at sources
Can have LARGE impact on (inexperienced) users
Two separate issues
• Bias correct against NWP analysis
• Reduce lead time dependent model behavior
• Downscale NWP analysis
• Connect with observed variables that are unresolved by NWP models
44
45
SCIENCE OF FORECASTING
• Ultimate goal of science
– Forecasting – Prominent role of Meteorology
• Weather forecasting constantly in public’s eye
• Approach
– Observe what is relevant and available
• Analyze data
– Create general knowledge about nature
• Generalization & abstraction – Laws, relationships
– Build model of reality
• Conceptual, Analogs, Quantitative/numerical, incl. physical processes
– Predict what’s not observable in
• Space – eg, data assimilation
• Time - eg, future weather
• Variables / processes
– Verify (ie, compare with observations)
• Determine to what extent model represents reality
• Assess if predictions have any utility
– Improve general knowledge and model
46
USER REQUIREMENTS
• General characteristics of forecast users
– Each user affected in specific way by
• Various weather elements at
• Different points in time &
• Space
• Requirements for optimal decisions related to operations affected by
weather
– Possible weather scenarios with covariances across
• Variables, space, and time
• Provision of weather information
– Specific information needed by each user can’t be foreseen or provided
• Only vanilla-type often used probabilistic info can be distributed
– Ensemble data must be made accessible to users in statistically reliable form
• All forecast info can be derived from this, including vanilla probabilistic products
• IT infrastructure requirements
– Staging ground for ensemble data (disc)
– Sophisticated data access / interrogation tools
• Subset ensemble data, derive required parameters
– Technical achievements - NOMADS, AWIPS2
– Telecommunication (bandwidth)
47
USER REQUIREMENTS:
PROBABILISTIC FORECAST INFORMATION IS CRITICAL
48
PREDICTIONS IN TIME
• Method
– Use model of nature for projection in time
– Start model with estimate of state of nature at “initial” time
• Sources of errors
– Discrepancy between model and nature
• Added at every time step
– Discrepancy between estimated and actual state of nature
• Initial error
• Chaotic systems
– Common type of dynamical systems
• Characterized with at least one perturbation pattern that amplifies
– All errors project onto amplifying directions
• Any initial and/or model error
– Predictability limited
• Ed Lorenz’ legacy
• Verification quantifies situation
49