What`s Strange About Recent Events (WSARE v3.0)

Download Report

Transcript What`s Strange About Recent Events (WSARE v3.0)

Bayesian Network Anomaly Pattern
Detection for Disease Outbreaks
Weng-Keen Wong (Carnegie Mellon University)
Andrew Moore (Carnegie Mellon University)
Gregory Cooper (University of Pittsburgh)
Michael Wagner (University of Pittsburgh)
1
The Problem
Suppose we have real-time access to Emergency
Department data from hospitals around a city
(with patient confidentiality preserved)
Primary
Key
Date
Time
Hospital
ICD9
Prodrome
Gender
Age
Home
Location
Work
Location
Many
more…
100
6/1/03
9:12
1
781
Fever
M
20s
NE
?
…
101
6/1/03
10:45
1
787
Diarrhea
F
40s
NE
NE
…
102
6/1/03
11:03
1
786
Respiratory
F
60s
NE
N
…
:
:
:
:
:
:
:
:
:
:
:
From this data, can we detect if a disease outbreak is
happening? How early can we detect it?
The question we’re really asking: what’s strange about
recent events?
2
Traditional Approaches
What about using traditional anomaly detection?
• Typically assume data is generated by a model
• Finds individual data points
that have low probability
with respect to this model
• These outliers have rare
attributes or combinations
of attributes
• Need to identify anomalous patterns not
isolated data points
3
Traditional Approaches
What about monitoring aggregate
daily counts of certain attributes?
50
40
30
20
10
100
91
82
73
64
55
46
37
28
19
10
0
1
Number of ED Visits
• We’ve now turned
multivariate data into
univariate data
• Lots of algorithms have been
developed for monitoring
univariate data:
Num ber of ED Visits per Day
Day Num ber
– Time series algorithms
– Regression techniques
– Statistical Quality Control methods
• Need to know apriori which attributes to form daily aggregates for!
4
Traditional Approaches
What if we don’t know what attributes to
monitor?
What if we want to exploit the spatial,
temporal and/or demographic characteristics
of the epidemic to detect the outbreak as
early as possible?
5
One Possible Approach
Recent records ( from today )
Primary
Key
Date
Time
Gender
Age
…
100
8/24/03
9:12
M
Child
…
101
8/24/03
10:45
M
Senior
…
:
:
:
:
:
:
Baseline records ( from 7 days ago )
…
Date
2164
8/17/03
13:05
F
Senior
…
2165
8/17/03
13:57
F
Senior
…
:
:
Gender
:
Age
:
Idea: Can use association
rules to find patterns in
today’s records that weren’t
there in past data
:
…
Source
100
8/24/03 9:12
…
Recent
101
8/24/03 10:45 …
Recent
:
Primary
Key
:
Time
Time
Primary
Key
:
Date
:
:
:
:
2164
8/17/03 13:05 … Baseline
2165
8/17/03 13:57 … Baseline
:
:
:
:
Find which rules predict unusually
high proportions in recent records
when compared to the baseline eg.
52/200 records from “recent” have
Gender = Male AND Age = Senior
90/180 records from “baseline” have
6
Gender = Male AND Age = Senior
Which rules do we report?
• Search over all rules with at most 2 components
• For each rule, form a 2x2 contingency table eg.
CountRecent
CountBaseline
Home Location = NW 48
45
Home Location  NW 86
220
• Perform Fisher’s Exact Test to get a p-value for
each rule (call this the score)
• Report the rule with the lowest score
7
Problem #1: Multiple Hypothesis Testing
• Can’t interpret the rule scores as p-values
• Suppose we reject null hypothesis when score < ,
where  = 0.05
• For a single hypothesis test, the probability of
making a false discovery = 
• Suppose we do 1000 tests, one for each possible
rule
• Probability(false discovery) could be as bad as:
1 – ( 1 – 0.05)1000 >> 0.05
8
Solution: Randomization Test
Aug 16, 2003
C2
Aug 16, 2003
C2
Aug 17, 2003
C3
Aug 17, 2003
C3
Aug 17, 2003
C4
Aug 24, 2003
C4
Aug 17, 2003
C5
Aug 17, 2003
C5
Aug 17, 2003
C6
Aug 24, 2003
C6
Aug 17, 2003
C7
Aug 17, 2003
C7
Aug 21, 2003
C8
Aug 21, 2003
C8
Aug 21, 2003
C9
Aug 21, 2003
C9
Aug 22, 2003
C10
Aug 22, 2003
C10
Aug 22, 2003
C11
Aug 22, 2003
C11
Aug 23, 2003
C12
Aug 23, 2003
C12
Aug 23, 2003
C13
Aug 23, 2003
C13
Aug 24, 2003
C14
Aug 17, 2003
C14
Aug 24, 2003
C15
Aug 17, 2003
C15
• Take the recent cases and the baseline cases. Shuffle the date field to
produce a randomized dataset called DBRand
9
• Find the rule with the best score on DBRand.
Randomization Test
Repeat the procedure on the
previous slide for 1000
iterations. Determine how
many scores from the 1000
iterations are better than the
original score.
If the original score were here, it would
place in the top 1% of the 1000 scores from
the randomization test. We would be
impressed and an alert should be raised.
Corrected p-value of the rule is:
# better scores / # iterations
10
Problem #2: A Changing Baseline
From: Goldenberg, A., Shmueli, G., Caruana,
R. A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks
by tracking over-the-counter medication
sales. Proceedings of the National
Academy of Sciences (pp. 5237-5249)
• Baseline is affected by temporal trends in health care data eg:
– Seasonal effects in temperature and weather
– Day of Week effects
– Holidays
• Choosing the wrong baseline distribution can affect the detection
11
time and false positives rate
Solution: Bayesian Network
All Historical
Data
1. Learn Bayesian Network
using Optimal Reinsertion
[Moore and Wong 2003]
Today’s
Environment
Baseline
2. Generate baseline given
today’s environment
12
Environmental Attributes
Divide the data into two types of attributes:
• Environmental attributes: attributes that
cause trends in the data eg. day of week,
season, weather, flu levels
• Response attributes: all other nonenvironmental attributes
13
Environmental Attributes
When learning the Bayesian network structure, do not allow
environmental attributes to have parents.
Why?
• We are not interested in predicting their distributions
• Instead, we use them to predict the distributions of the response
attributes
Side Benefit: We can speed up the structure search by avoiding
DAGs that assign parents to the environmental attributes
Season
Day of Week
Weather
Flu Level
14
Generate Baseline Given Today’s Environment
Suppose we know the
following for today:
We fill in these
values for the
environmental
attributes in the
learned Bayesian
network
Today
Season =
Winter
We sample 10000 records
from the Bayesian network
and make this data set the
baseline
Season
Day of Week
Weather
Flu Level
Winter
Monday
Snow
High
Day of Week =
Monday
Weather =
Snow
Baseline
Flu Level =
High
15
What’s Strange About Recent Events (WSARE)
1. Obtain Recent and
Baseline datasets
2. Search for rule with
best score
3. Determine p-value of
best scoring rule
4. If p-value is less than
threshold, signal alert
16
Simulator
17
Results on Simulation
18
Results on Actual ED Data from 2001
1. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000
14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False
7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000
12.42% ( 58/467) of today's cases have Respiratory Syndrome = True
6.53% (653/10000) of baseline have Respiratory Syndrome = True
3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000
1.44% ( 9/625) of today's cases have 100 <= Age < 110
0.08% ( 8/10000) of baseline have 100 <= Age < 110
4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000
83.80% (481/574) of today's cases have Unknown Syndrome = False
74.29% (7430/10001) of baseline have Unknown Syndrome = False
5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000
14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False
7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
6. Thu 2001-12-09: SCORE = -0.00000000 PVALUE = 0.00000000
8.58% ( 38/443) of today's cases have Hospital ID = 1 and Viral Syndrome = True
2.40% (240/10000) of baseline have Hospital ID = 1 and Viral Syndrome = True
19
Related Work
• Deviations between models induced by two datasets
[Ganti, Gehrke and Ramakrishnan]
• Emerging Patterns [Dong and Li]
• Mining Surprising Patterns using Temporal Description
Length [Chakrabarti, Sarawagi and Dom]
• Contrast sets [Bay and Pazzani]
• Association Rules and Data Mining in Hospital
Infection Control and Public Health Surveillance
[Brossette et. al.]
• Spatial Scan Statistic [Kulldorff]
20
Conclusion
• One approach to biosurveillance: one algorithm
monitoring millions of signals derived from
multivariate data
instead of
Hundreds of univariate detectors
• WSARE is best used as a general purpose safety
net in combination with other detectors
• Careful evaluation of statistical significance
• Modeling historical data with Bayesian Networks
to allow conditioning on unique features of today
Software: http://www.autonlab.org/
21