Transcript Data Mining

Data Mining for the Early Detection of Disease
Outbreaks
Weng-Keen Wong, School of EECS, Oregon State University
Email: [email protected]
Joint work with the RODS Lab (University of Pittsburgh) and the AUTON Lab (Carnegie Mellon University)
Introduction
The threat of a deadly disease
outbreak is very real. There are two
scenarios of concern:
Before public health can respond, we
first need to be able to detect that an
outbreak is occurring.
1. Naturally occurring outbreaks eg.
SARS, Asian bird flu.
The earlier we detect the outbreak,
the more we can reduce morbidity
and mortality.
2. Outbreaks due to bioterrorist
attacks eg. anthrax, smallpox.
Examples of Prediagnosis Data
Many cities throughout the US have
established syndromic surveillance
systems to monitor the health of the
community.
Syndromic surveillance systems
collect and analyze health-related
data that precede diagnosis.
The Syndromic
Surveillance Pipeline
1. Identify useful
data sources
2. Collect
data
Challenges
1. Finding anomalies in rich
multivariate data that includes
spatial, temporal, demographic and
symptomatic information.
3. Analyze
data
Over-thecounter
medication
sales
School/Work
absenteeism
Computer Science comes in here
in the form of data mining: find
anomalies that correspond to
disease outbreaks
Veterinarian
data
Emergency
Department
records
2. Finding anomalies that are truly
indicative of a disease outbreak of
interest.
3. Combining information from
multiple data sources eg. Emergency
Department data and over-thecounter medication sales.
Telephone
triage calls
Lab test
requests
911 Calls
Data being monitored is HIPAA
compliant with personal
identifying information removed.
The “What’s Strange About Recent
Events” (WSARE) Algorithm
Recent ED records
Primary
Key
Date
Time
Gender
Age
…
100
10/29/05
9:12
M
20-30
…
101
:
10/29/05 10:45
:
:
F
:
…
50-60
:
:
Baseline (from a model that
takes temporal fluctuations and
other factors into account)
Find which rules
predict unusually high
proportions in recent
records when
compared to the
baseline eg.
50/200 records from
Baseline have
Gender = Male AND
Home Location = NW
The Population-wide Anomaly Detection
and Assessment (PANDA) Algorithm
Anthrax Release
Time Of Release
…
…
Female
20-30
Gender = Male AND
Home Location = NW
50-60
Gender
Home Zip
Anthrax Infection
Respiratory CC
From Other
Anthrax Infection
False
Respiratory
from Anthrax
Respiratory CC
From Other
Respiratory
CC
ED Admit
from Other
ED Admit
from Anthrax
Respiratory CC
When Admitted
ED Admission
Other ED
Disease
15146
Respiratory
CC
ED Admit
from Anthrax
Gender
Home Zip
Other ED
Disease
15213
Respiratory
from Anthrax
Male
Age Decile
Age Decile
Yesterday
90/180 records from
Recent have
Location of Release
Unknown
ED Admit
from Other
Respiratory CC
When Admitted
never
ED Admission
Models every individual in the population
in order to improve detection of an
airborne release of inhalational anthrax