Selected Problems in Epidemiology

Download Report

Transcript Selected Problems in Epidemiology

Selected Problems in Epidemiology
Nina H. Fefferman, Ph.D.
Co-Director Tufts Univ. InForMID
Data mining in public health is not new, but it is more complicated
A small historical example : Cholera, John Snow, 1854
During the height of the Miasmic theory of Disease
1) There was a Cholera outbreak in London
2) John Snow became ‘irrationally’ convinced that Cholera came from
contaminated drinking water
So Snow went to the London Registrar-General
He looked at where those who died from Cholera got their water and when
"The experiment … was on the grandest scale. No fewer than 300,000 people of both
sexes, of every age and occupation, and of every rank and station, from gentlefolks
down to the very poor, were divided into two groups without their choice, and, in
most cases, without their knowledge; one group being supplied with water containing
the sewage of London, and, amongst it, whatever might have come from the cholera
patients, the other group having water quite free from such impurity."
On the Mode of Communication of Cholera, Second Edition,
1854
Snow’s findings:
Number of
Houses
Death from
Cholera
Death in
Each 10,000
Houses
Southwark and Vauxhall Company
40,046
1,263
315
Lambeth Company
26,107
98
37
Rest of London
256,423
1,422
59
Before 1852, your chances of getting cholera were not correlated with getting your
water from either water company
In the epidemic of 1853-54, your chances of getting cholera if your water was from
Southwark and Vauxhall were more than eight times greater than if you got
your water from Lambeth
And then it got really impressive:
Then Cholera reoccurred in the Soho
district of London
About 600 people died from cholera in a
10-day period
Once again Snow took the operational
death-certificate data from the
Registrar-General
This time he plotted the data on a
clustering diagram, using a
stacked histogram technique
plotted on a map of Soho to
do the data mining
Lives saved due to real-time data mining
Based upon this map, Snow was able to convince the London Board of
Guardians to remove the pump handle from the public pump
located on Broad Street
The outbreak of cholera subsided with this operational change
It was later revealed that the Broad Street well was contaminated by an
underground cesspool located at 40 Broad Street which was
just three feet from the well
The Broad Street pump without a handle remains today as a tribute to
Snow
Modern problems: Happening on every scale imaginable:
Genetic –
We know what we’re looking at and what we’re looking for, just not
how to find it
Single Defined Population –
We know who we’re looking at and what we’re looking for, but not
how to find it
Undefined Population –
We don’t know who to look at, but we know what to look for
Undefined Everything –
We want to save lives, but don’t know what to do at all
Chromosome
Sequence Length
(in base pairs)
1
245,203,898
2
243,315,028
3
199,411,731
4
191,610,523
5
180,967,295
6
170,740,541
7
158,431,299
8
145,908,738
9
134,505,819
10
135,480,874
11
134,978,784
12
133,464,434
13
114,151,656
14
105,311,216
Normally one-tenth of a single percent of
DNA (about 3 million bases) differs from
one person to the next
15
100,114,055
16
89,995,999
17
81,691,216
18
77,753,510
Luckily junk DNA makes up at least 50%
of the human genome
19
63,790,860
20
63,644,868
21
46,976,537
But we still know of about 1.4 million
locations where single nucleotide
polymorphisms (SNPs) occur in humans
22
49,476,972
X
152,634,166
Y
50,961,097
Genetic Epidemiology:
You have good reason to believe that a
disease has a genetic component
You have the sequenced genomes of
some afflicted people
The human genome is huge
A paper on something like this: Rodin et al. 2005 J Comput Biol. 12(1): 1–11.
Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene
Variation and Plasma Lipid Levels
So we need Data Mining
This type of examination is called a “large-scale genotype–phenotype
association study”
Classical statistical methods (i.e. multivariable regression, contingency table
analysis) are ill suited for high dimensional problems because they
are “single inference procedures”
We need “joint inference procedures”
Methods for combining results across multiple “single inference procedures”
are inefficient
In this type of case, Data-mining methods are hypothesis-generating and
classical statistical methods are hypothesis-testing
A single defined population:
We know who we’re looking at and what we’re looking for, but not how to find it
In an adverse reaction study for a new vaccine or drug
We know who to watch (those who receive the treatment)
We know we’re looking for (“bad things that happen to them”)
How do we find “it”?
We also have to monitor people who don’t get the treatment and see what
happens to them
We wind up with a huge set of “all bad things that happen to lots of people”
This leads to a lot of problems:
A reference and paper on something like this: http://www.fda.gov/cder/aers/default.htm
or Nu et al. 2001 Vaccine. 19(32):4627-34.
Example problems in data mining for adverse events:
Health care providers report adverse reactions by patients to any drug
Unfortunately, many patients need to take several drugs at once, so all will be reported
with the same event
And there’s reporting bias - results don’t reflect the overall population (only the people who
needed the drug in the first place, but that’s probably the portion we’re worried about anyway)
Explicit example: Sudden Infant Death Syndrome (SIDS) and the Polio vaccine
You can easily find a statistical association between the two – Does this mean the polio
vaccine is dangerous?
Not necessarily – the polio vaccine is mainly given to infants, who are the only possible
victims of SIDS
Receiving the polio vaccine increases your likelihood of being an infant, which
significantly increases your chance of SIDS
We would need to if there is an association within infants
Undefined Population –
We don’t know who to look at, but we know what to look for
Example: Figuring out the source of a food-borne outbreak
(Good news: we know some diseases are caused by food-borne pathogens)
We can hypothesize that a certain activity is somehow related to the source
like the food at a party being contaminated
Unfortunately, there can be a lot of food at one large party
You might not know if the food at the party is actually the culprit
You need to ask if people at the party got sick
If they did, you need to know which particular food at the party is
contaminated
The normal process here is to call everyone at the party and conduct a
survey (see handout)
These surveys can generate a huge amount of data and there’s
no guarantee that the party was the source of the outbreak
Horror scenario from a data perspective:
Food poisoning at the Republican National Convention
We wouldn’t know
• Which day
• Which location
• Which caterer
• How many people were made ill
How do you figure out what how and who in real time?
Part of the problem is to get the answer before more people become sick, so
you want to narrow the focus of your investigation as you go
– ask fewer people, ask fewer questions, all these surveys take time
Undefined Everything –
We want to save lives, but don’t know what to do at all
Cancer :
You’ll hear more about this later in the program from
Dmitriy Fradkin
Huge numbers of people diagnosed
Huge numbers of possible contributing risks –
environmental exposure to carcinogens
genetic predisposition
cancer-causing viruses
Huge numbers of confounding factors –
differences in diagnosis, treatment, outcome
co-morbidity
Let’s say we’re worried about the beginning of an outbreak
of H5N1 avian flu
It will probably start out looking like normal flu
How quickly we can figure out where it is will determine how quickly we can try
active intervention strategies
We don’t know where it will start:
International travel?
Near airports?
International bird migration patterns?
Along the coasts? Depending on time of year?
Once it’s here, we don’t really know how it will spread –
Maybe we want an early warning system for cities – is the disease present or
absent?
These are the types of Epidemiological problems we face,
what are the kinds of practical constraints we have to expect?
There are many data collectors:
Insurance companies, HMOs, public health agencies
Issues of data control –
Who controls the data?
Is each entity found only at a single site?
Do different sites contain different types of data?
How can we make sure the data isn’t redundant and therefore skewing our
information?
How can we make sure we get all the pertinent data at the same time?
Or at least how fast is fast enough to figure out what we need as
quickly as possible?
For more information, see http://www.hipaa.org/
And Privacy and Ethics:
Individual privacy concerns limit the willingness of the data custodians to
share
it, even with government agencies such as the U.S. Centers for
Disease Control
In many cases, data is shared only after it has been “de-identified” according to
HIPAA regulations –
This removes a lot of useful information and doesn’t really do a whole lot to
protect privacy, but that’s another issue (see Fefferman et al. 2005 J. Public Health
Policy 26(4):430-449)
We need a whole different slew of data mining techniques to mine data “blind”
(when we don’t know what we’re seeing, what the numbers represent, how much they’ve
been aggregated to represent averages or what we’re looking for)
And other problems:
Sometimes we don’t know where the best source of data is –
We can monitor some cities more closely
We can monitor certain diseases (notifiable diseases)
Although this is constrained by having to verify by lab test
Sometimes our expectations of “normal” levels of disease set the wrong
benchmark for when we should start being concerned
Different diseases have different normal incidence, which means that an
increase of 10 cases per year of one disease is an outbreak, but it
would take an increase of 1000 in another to be ‘unusual’
BOTULISM, FOODBORNE
Number of reported cases,
by year - United States, 1983-2003
ESCHERICHIA COLI, ENTEROHEMORRHAGIC O157:H7
Number of reported cases,
United States and U.S. territories, 2003
Sometimes we expect something intermediate
SALMONELLOSIS
Incidence,* by year
United States, 1973-2003
*Per 100,000 population
And sometimes we expect the numbers to be reasonably large
ACQUIRED IMMUNODEFICIENCY SYNDROME (AIDS)
Number of reported cases, by year
United States* and U.S. territories, 1983-2003
And sometimes our methods of surveillance itself creates issues
*Total number of AIDS cases includes all cases reported to CDC as of December 31, 2003. Total includes cases among
residents in U.S. territories and 220 cases among persons with unknown state of residence.
Sometimes our problems are prospective
Sometimes our problems retrospective
In outbreak detection and biosurveillance, we want to find
“unusual disease incidence” early
In adverse reaction trials, we want to know overall effects,
we don’t particularly care about the time scales on
which they act
In “classic epidemiological investigations” we are looking
for the source of exposure to prevent further
infection
Advances in technology have caused a shift in our
data mining needs
It used to be that the bottle-neck to appropriate analysis was
figuring out where to look for the data and collecting it
A pre-processing problem
Due to advances in reporting technology, we’re very close to
getting real-time reporting for mortality data and we’re
getting there for incidence data (for at least some diseases)
Now we have to figure out how to find meaningful results in the chaos
and clutter
Data mining techniques can be tailored to handle all of
these problems
We haven’t covered all of the problems, but as you
can see, we need better techniques and we need more
people working on the use of these techniques
Thanks for attending this workshop – we need you!