Transcript Document

Space-Time Scan Statistics for Early
Warning Systems
Martin Kulldorff
Department of Ambulatory Care and Prevention
Harvard University Medical School
and Harvard Pilgrim Health Care
Content
• Background on Disease Surveillance
• Purely Spatial Scan Statistics: Brain Cancer
in the United States
• Early Warning System using a Space-Time
Permutation Scan Statistic: Syndromic
Surveillance in New York City
• Various Extensions
Collaborators
Harvard Medical School: Ken Kleinman,
Richard Platt, Katherine Yih
New York City Dep Health: Jessica Hartman,
Rick Heffernan, Farzad Mostashari
University of Connecticut: David Gregorio,
Zixing Fang
Universidad Federal Minais Gerais: Renato
Assunção, Luiz Duczmal
Importance of Early Disease
Outbreak Detection
•
•
•
•
•
Eliminate health hazards
Warn about risk factors
Earlier diagnosis of new cases
Quarantine cases
Scientific research concerning treatments,
vaccines, etc.
• Early detection is especially critical for
infectious diseases
Disease Surveillance
Data Sources
• Disease Registries
• Reportable Diseases
• Electronic Health
Records
• Health Insurance
Claims Data
• Vital Statistics
(Mortality)
Types of Data
• Diagnosed Diseases
• Symptoms (Syndromic
Surveillance)
• Lab Test Results
• Pharmaceutical Drug
Sales
Disease Surveillance
Frequency of Analyses
• Daily
• Weekly
• Monthly
• Yearly
Purely Temporal Methods
Farrington CP, Andrews NJ, Beale AD, Catchpole MA
(1996) A statistical algorithm for the early detection of
outbreaks of infectious disease. J R Stat Soc A Stat
Soc 159: 547–563.
Hutwagner LC, Maloney EK, Bean NH, Slutsker L,
Martin SM (1997) Using laboratory-based
surveillance data for prevention: An algorithm for
detecting salmonella outbreaks. Emerg Infect Dis 3:
395–400.
Nobre FF, Stroup DF (1994) A monitoring system to
detect changes in public health surveillance data. Int
J Epidemiol 23: 408–418.
Reis B, Mandl K (2003) Time series modeling for
syndromic surveillance. BMC Med Inform Decis Mak
3: 2.
Three Important Issues
•
•
•
An outbreak may start locally.
Purely temporal methods can be used
simultaneously for multiple geographical
areas, but that leads to multiple testing.
Disease outbreaks may not conform to the
pre-specified geographical areas.
Why Use a Scan Statistic?
With disease outbreaks:
• We do not know where they will occur.
• We do not know their geographical size.
• We do not know when they will occur.
• We do not know how rapidly they will
emerge.
One-Dimensional Scan Statistic
The Spatial Scan Statistic
Create a regular or irregular grid of centroids
covering the whole study region.
Create an infinite number of circles around each
centroid, with the radius anywhere from zero up
to a maximum so that at most 50 percent of the
population is included.
For each circle:
– Obtain actual and expected number of cases inside and
outside the circle.
– Calculate likelihood function.
Compare Circles:
– Pick circle with highest likelihood function as Most Likely
Cluster.
Inference:
– Generate random replicas of the data set under the nullhypothesis of no clusters (Monte Carlo sampling).
– Compare most likely clusters in real and random data sets
(Likelihood ratio test).
Poisson Likelihood Function
c
[c / μ ] x [(C-c)/(C- μ)] C-c
c=cases in circle
μ = expected cases in circle
C = total cases
Spatial Scan Statistic: Properties
– Adjusts for inhomogeneous population density.
– Simultaneously tests for clusters of any size and
any location, by using circular windows with
continuously variable radius.
– Accounts for multiple testing.
– Possibility to include confounding variables, such
as age, sex or socio-economic variables.
– Aggregated or non-aggregated data (states,
counties, census tracts, block groups, households,
individuals).
U.S. Brain Cancer Mortality
1986-1995
deaths
Children (age <20):
5,062
Adults (age 20+): 106,710
Adult Women:
48,650
Adult Men:
58,060
* annual deaths / 100,000
rate* (95% CI)
0.75 (0.66-0.83)
6.0 (5.8-6.2)
4.9 (4.7-5.0)
7.2 (7.0-7.5)
Brain Cancer
Known risk factors:
• High dose ionizing radiation
• Selected congenital and genetic disorders
Explains only a small percent of cases.
Potential risk factors:
N-nitroso compounds?, phenols?, pesticides?,
polycyclic aromatic hydrocarbons?, organic
solvents?
Adjustments
All subsequent analyses where adjusted for:
• Age
• Gender
• Ethnicity (African-American, White, Other)
Brain Cancer Mortality, Children 1986-1995
SMR
2.07-42.82 (highest 10%)
1.20-2.06
0.83-1.19
0.50-0.82
Zero cases (1867 counties)
0
200
400
600
Miles
Spatial Scan Statistic, Children
6
3
2
7
5
Risk Factor Color Key
High Risk, Not Significant
0
200
400
600
Miles
1
4
Children: Seven Most Likely Clusters
Cluster
1. Carolinas
2. California
3. Michigan
4. S Carolina
5. Kentucky-Tenn
6. Wisconsin
7. Nebraska
Obs
86
16
318
24
127
10
12
Exp
51
4.9
250
10
88
2.4
3.6
RR
1.7
3.3
1.3
2.5
1.4
4.1
3.3
p=
0.24
0.74
0.74
0.79
0.79
0.98
0.99
Conclusions: Children
No statistically significant clusters detected.
Any part of the pattern seen on the original
map may be due to chance.
What About Adults?
Brain Cancer Mortality, Adults 1986-1995
SMR
0
9.46-24.44 (highest 10%)
8.05-9.45
7.27-8.04
6.72-7.26
6.17-6.71
5.68-6.16
5.19-5.67
4.51-5.18
3.40-4.50
Zero Cases (312 counties)
200
400
600
Miles
Spatial Scan Statistic: Adults
3
5
4
12
9
6
11
13
2
1
8
7
10
Spatial Scan Statistic, Women
6
4
9
10
7
12
Risk Factor Color Key
Low Risk,
High Risk,
Low Risk,
High Risk,
0
200
p < 0.05
p < 0.05
Not Significant
Not Significant
400
600
Miles
5
13
2
3
1
8
11
Women: Most Likely Clusters
Cluster
1. Arkansas et al.
2. Carolinas
3. Oklahoma et al.
4. Minnesota et al.
Obs Exp RR
2830 2328 1.22
1783 1518 1.17
1709 1496 1.14
2616 2369 1.10
p=
0.0001
0.0001
0.003
0.01
10. N.J. / N.Y.
1809 2300 0.79 0.0001
11. S Texas
127 214 0.59 0.0001
12. New Mexico et al. 849 1049 0.81 0.0001
Spatial Scan Statistic: Men
4
7
14
5
8
11
13
1
3
2
15
10
9
Risk Factor Color Key
Low Risk, p < 0.05
High Risk, Not Significant
High Risk, p < 0.05
0
200
400
600
Miles
6
12
Men: Most Likely Clusters
Cluster
1. Kentucky et al.
2. Carolinas
3. Arkansas et al.
4. Washington et al.
5. Michigan
Obs Exp RR
3295 2860 1.15
1925 1658 1.16
1143 964 1.19
1664 1455 1.14
1251 1074 1.17
p=
0.0001
0.0001
0.001
0.003
0.005
11. N.J. / N.Y.
2084 2615 0.80
12. S Texas
157 262 0.60
13. New Mexico et al. 1418 1680 0.84
14. Upstate N.Y. et al. 1642 1895 0.87
0.0001
0.0001
0.0001
0.0001
Conclusions: Adults
It is possible to pinpoint specific areas with higher
and lower rates that are statistically significant, and
unlikely to be due to chance.
The exact borders of detected clusters are
uncertain.
Similar patterns for men and women.
Conclusion: General
The spatial scan statistic can be useful as an
addition to disease maps, in order to determine
if the observed patterns are likely due to
chance or not.
A complement rather than a replacement for
regular disease maps.
Space-Time Scan Statistic
Use a cylindrical window, with the
circular base representing space and the
height representing time.
We will only consider cylinders that
reach the present time.
For each cylinder:
– Obtain actual and expected number of cases inside and
outside the cylinder.
– Calculate likelihood function.
Compare Cylinders:
– Pick cylinder with highest likelihood function as Most Likely
Cluster.
Inference:
– Generate random replicas of the data set under the nullhypothesis of no clusters (Monte Carlo sampling).
– Compare most likely clusters in real and random data sets
(Likelihood ratio test).
For each cylinder:
– Obtain actual and expected number of cases inside and
outside the cylinder.
– Calculate likelihood function.
Compare Cylinders:
– Pick cylinder with highest likelihood function as Most Likely
Cluster.
Inference:
– Generate random replicas of the data set under the nullhypothesis of no clusters (Monte Carlo sampling).
– Compare most likely clusters in real and random data sets
(Likelihood ratio test).
Space-Time Permutation
Scan Statistic
1. For each cylinder, calculate the expected
number of cases conditioning on the marginals
μst = Σscst x Σtcst / C
where cst = # cases at time t in location s
and C = total number of cases
Space-Time Permutation
Scan Statistic
2. For each cylinder, calculate
Tst = [cst / μst
cst
] x
[(C-cst)/(C- μst)] C-cst
= 1, otherwise
3. Test statistic T = maxst Tst
if cst > μst
Space-Time Permutation
Scan Statistic
4. Generate random replicas of the data set
conditioned on the marginals, by permuting
the pairs of spatial locations and times.
5. Compare test statistic in real and random
data sets using Monte Carlo hypothesis
testing (Dwass, 1957):
p = rank(Treal) / (1+#replicas)
Space-Time Permutation
Scan Statistic: Properties
– Adjusts for purely geographical clusters.
– Adjusts for purely temporal clusters.
– Simultaneously tests for outbreaks of any
size at any location, by using a cylindrical
windows with variable radius and height.
– Accounts for multiple testing.
– Aggregated or non-aggregated data
(counties, zip-code areas, census tracts,
individuals, etc).
Let’s Try It!
•
•
•
•
•
•
Historic data, Nov 15, 2001 – Nov 14, 2002
Diarrhea, all age groups
Use last 30 days of data.
Temporal window size: 1-7 days
Spatial window size: 0-5 kilometers
Residential zip code and hospital coordinates
Results: Hospital Analyses
Date
#days #hosp #cases #exp
A Nov 21 6
1
101 73.6
B Jan 11 1
1
10
2.3
C Feb 26 4
2
97 66.9
D Mar 31 2 1
38 19.2
E Nov 1
6
3
122 86.6
F Nov 2
7
3
135 98.3
RR
1.4
4.4
1.4
2.0
1.4
1.4
p= recurrence interval
0.0008
1 / 3.4 years
0.0007
1 / 3.9 years
0.0018
1 / 1.5 years
0.0017
1 / 1.6 years
0.0017
1 / 1.6 years
0.0008
1 / 3.4 years
Results: Residential Analyses
Date #days #zips #cases #exp RR
G Feb 9
2
15
63 34.7 1.8
H Mar 7
2
8
63 37.3 1.7
p=
0.0005
0.0027
reccurence
interval
1 / 5.5 years
1 / 1.0 years
200
180
Citywide
160
Areas with residential signals
Areas with hospital signals
# of visits
140
120
100
80
60
H
C
G
40
A
20
E,F
D
B
0
Nov
2001
Dec
Jan
Feb
Mar
2002 ----->
Apr
May
Month
Jun
Jul
Aug
Sep
Oct
Nov
Real-Time Daily Analyses
•
•
•
•
•
Starting November 1, 2003.
Respiratory, Fever/Flu, Diarrhea, (+Vomiting)
Hospital (and Residential) Analyses
Spatial window size: 0-5 kilometers
Temporal window size: 1-7 days
Real-Time Results, Nov 24, 2003:
Hospital Analysis
Syndrome #days #hosp #cases #exp
Respiratory 2
3
80 57.4
Fever/Flu
3
1
24 14.8
Diarrhea
2
4
18
8.2
RR
1.4
1.6
2.2
p= recurrence interval
0.13
every 8 days
0.68
every day
0.04
every 26 days
Real-Time Results, Nov 25, 2003:
Hospital Analysis
Syndrome #days #hosp #cases #exp
Respiratory 7
1
45 30.4
Fever/Flu
1
5
50 31.5
Diarrhea
3
4
22 11.5
RR
1.5
1.6
1.9
p= recurrence interval
0.46
every 2 days
0.04
every 23 days
0.17
every 6 days
Real-Time Results, Nov 26, 2003:
Hospital Analysis
Syndrome #days #hosp #cases #exp
Respiratory 5
2 233 199.4
Fever/Flu
7
7 299 252.1
Diarrhea
4
4
23 12.6
RR
1.1
1.2
1.8
p= recurrence interval
0.63
every 2 days
0.05
every 22 days
0.22
every 5 days
Real-Time Results, Nov 27, 2003:
Hospital Analysis
Syndrome #days #hosp #cases #exp
Respiratory 1
4
41 26.9
Fever/Flu
6
4 181 142.9
Diarrhea
5
3
29 14.1
RR
1.5
1.3
1.7
p= recurrence interval
0.45
every 2 days
0.03
every 36 days
0.50
every 2 days
Real-Time Results, Nov 28, 2003:
Hospital Analysis
Syndrome #days #hosp #cases #exp
Respiratory 2
4
98 78.8
Fever/Flu
7
5 228 178.0
Diarrhea
6
3
29 17.5
RR
1.2
1.3
1.5
p= recurrence interval
0.82
every day
0.001 every 1000 days
0.26
every 4 days
Real-Time Results, Nov 29, 2003:
Hospital Analysis
Syndrome #days #hosp #cases #exp
Respiratory 7
2 146 123.6
Fever/Flu
7
4 253 195.7
Diarrhea
7
4
44 29.4
RR
1.2
1.3
1.5
p=
recurrence interval
0.95
every day
0.001 every 1000 days
0.21
every 5 days
Real-Time Results, Nov 30, 2003:
Hospital Analysis
Syndrome #days #hosp #cases #exp
Respiratory 1
1
19 10.7
Fever/Flu
6
9 429 364.1
Diarrhea
1
5
12
4.4
RR
1.8
1.2
2.7
p=
recurrence interval
0.69
every day
0.002 every 500 days
0.06
every 17 days
Summary
Four strong diarrhea signals:
• Two were early signals for city-wide outbreaks likely
due to norovirus.
• One was an early signal for a city-wide children
outbreak, likely due to rotavirus.
• One small outbreak of unknown etiology.
Three medium strength diarrhea signals:
• All during the rotavirus outbreak, possibly due to a shift
in the geographical epicenter
One real-time fever/flu signal, coinciding with the start of
the flu season.
Different Data Streams
For example:
• Nurses Hotline Calls
• Regular Physician Visits
• Emergency Department Visits
• Ambulance Dispatches
• Pharmaceutical Drug Sales
• Lab Test Results
Multiple Data Streams
For each cylinder, add the Poisson log
Tst =
[1]
[2]
[3]
log[ T st ] +log[ T st ] +log[ T st ]
likelihoods:
Test statistic T = maxst Tst
Syndromic Surveillance in Boston:
Upper and Lower GI
• Harvard Pilgrim Health Care HMO members
cared for by Harvard Vanguard Medical
Associates
• Historical Data from Jan 1 to Dec 31, 2002
• Mimicking Surveillance from Sept 1 to Dec 31,
2002
Three Data Streams
• Telephone Calls ( ~ 20 / day)
• Urgent Care Visits ( ~ 9 / day)
• Regular Physician Visits ( ~ 22 / day)
Multiple contacts by the same person removed.
Strongest Signal: October 18
Recurrence Interval
Multiple Data Streams: < 1 / 1000 days
Single Data Streams:
Tele:
< 1 / 1000 days
Urgent
~ every day
Regular:
~ every day
October 18 Signal
•
•
•
•
•
Friday
Number of Cases: 5
Expected Cases: 0.04
Location: Zip Code 01740
Time Length: One Day
October 18 Signal
•
•
•
•
•
•
Friday
Number of Cases: 5
Expected Cases: 0.04
Location: Zip Code 01740
Time Length: One Day
Diagnosis: Pinworm Infestation (all 5)
October 18 Signal
•
•
•
•
•
•
•
Friday
Number of Cases: 5
(all tele)
Expected Cases: 0.04
Location: Zip Code 01740
Time Length: One Day
Diagnosis: Pinworm Infestation (all 5)
Same Family: Mother, Father, 3 Kids
Limitations
• Space-time clusters may occur for other reasons
than disease outbreaks
• Automated detection systems does not replace
the observant eyes of physicians and other health
workers.
• Epidemiological investigations by public health
department are needed to confirm or dismiss the
signals.
Scan Statistics for
Irregular Shaped Clusters
Duczmal, Assunção. A simulated annealing strategy for the
detection of arbitrarily shaped spatial clusters. Computational
Statistic and Data Analysis, 2004.
Patil, Talllie. Upper level set scan statistic for detecting arbitrarily
shaped hotspots. Environmental and Ecological Statistics, 2004.
Iyengar. Space-time clusters with flexible shapes. Morbidity and
Mortality Weekly Report, 2005.
Tango, Takahashi. A flexibly shaped spatial scan statistic for
detecting clusters. Int J Health Geographics, 2005.
Assunção, Costa, Tavares, Ferreira. Fast detection of arbitrarily
shaped disease clusters. Statistics in Medicine, 2006.
Probability Models
•
•
•
•
•
Poisson model (e.g. incidence, mortality)
Bernoulli model (e.g. case-control data)
Normal model (e.g. weight, blood lead levels)
Exponential model (e.g. survival data)
Ordinal model (e.g. early, medium and late
stage cancer)
• Space-time permutation model (when only
case data is available)
Application Areas
•
•
•
•
•
•
•
Chronic Diseases
Infectious Diseases
Health Services
Accidents
Brain Imaging
Toxicology
Veterinary Medicine
•
•
•
•
•
•
Psychology
Demography
Criminology
History
Archeology
Ecology
Examples of Applications
Beato Filho, Assunção, Silva, Marinho, Reis, Almeida. Homicide
clusters and drug traffic in Belo Horizonte, Minas Gerais, Brazil
from 1995 to 1999. Cadernos de Saúde Pública, 2001.
Pellegrini. Analise espaço-temporal da leptospirose no municipio do
Rio de Janeiro. Fiocruz, 2002.
Andrade, Silva, Martelli, Oliveira, Morais Neto, Siqueira Junior,
Melo, Di Fabio. Population-based surveillance of pediatric
pneumonia: use of spatial analysis in an urban area of Central
Brazil. Cadernos de Saúde Pública, 2004.
Ceccato. Homicide in São Paulo, Brazil: Assessing spatial-temporal
and weather variations. J Environmental Psychology, 2005.
Simões, Mendes, Marques, Pereira, Bagagli. Spatial clusters of
paracoccidioido-mycosis in southeastern Brazil. Revista do
Instituto de Medicina Tropical de São Paulo, 2005.
SaTScan Software
Free. Download from www.satscan.org
Registered users in 116 countries:
1. USA
2. Canada
3. United Kingdom
4. Brazil
5. Italy
...
100s. Albania, Bhutan, Burma, Fiji, Grenada, Guinea,
Iraq, Macao, Madagascar, Malawi, Malta, etc
Future Topics
•
•
•
•
•
Irregular shaped clusters
Non-Euclidean neighbor definitions
Multivariate data
Multiple locations per observation
Computational speed
Acknowledgement
Research funded by:
Alfred P Sloan Foundation
Centers for Disease Control and Prevention
Massachusetts Department of Health
National Cancer Institute
National Institute of Child Health and Development
National Institute of General Medical Sciences:
Modeling Infectious Disease Agent Study (MIDAS)
References
Kulldorff. A spatial scan statistic. Communications in Statistics,
Theory and Methods. 26:1481-1496, 1997.
Fang, Kulldorff, Gregorio: Brain cancer in the United States 19861995, A Geographical Analysis. Neuro-Oncology, 6:179-187, 2004.
Kulldorff, Heffernan, Hartman, Assunção, Mostashari. A space-time
permutation scan statistic for disease outbreak detection. PLoS
Medicine, 2(3):e59, 2005.
Kulldorff, Mostashari, Duczmal, Yih, Kleinman, Platt. Multivariate
spatial scan statistics for disease surveillance. Statistics in
Medicine, 2006, in press.
Kulldorff and IMS Inc. SaTScan v.7.0: Software for the spatial and
space-time scan statistics, 2004. Free: http://www.satscan.org/