PPT Version of Presentation Slides

Download Report

Transcript PPT Version of Presentation Slides

Discovering Patterns in Adverse Drug Reactions
Student: Ernst Joham
Supervisor: Associate Prof Jiuyong Li
Associate Supervisor Dr. Jan Stanek
Outline
•
•
•
•
•
•
•
Background
Motivation
Research questions
Literature review
Data Mining process
Results
Conclusion
2
Background
• What is data mining?
Data mining is used to discover unexpected, interesting and
valuable information in datasets.
• High percentage of patients admitted or prolonged
hospitalisation is due to ADRS.
• What can cause ADRS?
• Amount of dosage given to patients
• More then one drug taken at the same time
• Ingredients in drugs which can result in adverse reaction.
3
Background
• Problems with medical datasets
• Medical data is more diverse and complex
• Ethical and legal issues
• Data quality
•
•
Missing values
Noise
• Ownership
• Lack of information
4
Motivation
• To have a successful outcome in discovering
patterns for medical datasets
• Finding the most suitable algorithms to handle noise
and missing values for medical datasets
• Improve complexity and diversity of medical datasets
5
Research Questions
• The aim of the research was to use data mining
methods in an attempt to produce relevant results
from real world medical data.
• The following research questions were answered
(1) Is it possible to discover patterns in spares datasets?
(2) What patterns can be identified through data mining for
ADRs?
6
Literature review (techniques)
• Decision Tree, Logistic programs, K nearest
neighbour and Bayesian classifier techniques have
been applied to medical datasets (Laverac 1999).
• Lee et al(2000) states that techniques that easily
extract specific knowledge are the key for medical
decision.
• A study on drug discovery showed that neural
networks performed better then logistic regression,
but decision tree performed better in identifying
active compounds (Obenshain 2004).
7
Literature review (process model)
• Medical data mining applications that is expected to
discover new knowledge should follow a five stage
process model (Wang 2000).
•
•
•
•
•
planning tasks
developing data mining hypotheses
preparing data
selecting data mining tools
evaluating data mining results.
• Cios & Moore 2002 state that for success you need
to follow the DMKD that adds several steps to the
CRISP-DM model that has been applied to several
medical problem domains.
8
Literature review (problems with medical datasets)
• Brown & Kros (2003) focused on the impact of
missing data and how existing methods can help.
They categories methods for dealing with missing
data into:
•
•
•
•
Use complete data only
Delete selected case or variables
Data imputation
Model-based approaches
• Some researchers have focused on data cleansing
tools to help eliminate noise but this can only achieve
a reasonable result (Zhu & Wu 2004).
9
Literature review
• (Zhu & Wu 2004). Attribute noise is more difficult to
handle and include:
• (1) Incorrect attribute values
• (2) Missing or don’t know attribute values
• (3) Incomplete attributes or don’t care values
10
Data Mining Processing
• The project used the data mining method of
CRISP_DM six step data mining process
• Understand the main aim of the project
• Understand the dataset
ADRDATE
Agedays BRAND DRUG
ID Prob ROUTE Recov
Severity
URNO
ATC
31/01/2007
Lyclear Permethrin 707 Cert Topical Rec
Minor
unknown P03AC04
9/06/2003
14367
Tegretol CR Carbamazepine 4 Cert Oral
Rec
ax6cx8z
N03AF01
11/06/2003 1 4173
Zoloft Sertraline
5
Unc Oral
ax66486
N06AB06
11
Data mining Process
Summary of missing values
Missing
values
Unknown
NR
REC
ADRDATE
0
ADEDAYS
1
ROUTE
570
RECOV
344
ATC
191
188
82
657
Total 1286 records
12
Data Mining Process
•
•
•
•
Data .csv format
R programming language
Rattle tool for data mining
Data preparation
•
•
•
•
Remove duplicates
Correct misspelled words
Correct meanings of values
Find missing ATC values (Anatomical Therapeutic
Chemical)
• Leave missing values for rest of dataset
13
Data mining Process
• Data transformation
• Date when the patient was admitted to hospital for ADRs
(October-March =1, April-September = 0)
• How old the patient is categorised into equal number of
records.(0-2 years old = 1, 2-5 years old = 2, 5-11 years
old = 3, 11-16 years old = 4, and above 16 years of age =
5)
• The administration of the medication that caused the ADR
is either oral or intravenous.(Oral = 1, Intravenous = 0)
• Recovered from ADRs or not.(Recovered = 0, Not
recovered = 1)
• The drugs given to the patient either are antibiotics or
not.(Antibiotics =1, Not Antibiotics =0)
14
AGE
ROUTE
Data Mining Processing
AGE
ADRDATE
ROUTE
RECOV
ATC
15
Data Mining Process
• Modelling phase
• Logistic regression,
• Decision tree,
• Risk pattern algorithm
• Evaluation Phase
• Deployment
16
Results
• Results for the logistic regression technique
Coefficients:
(Intercept)
ADRDATE
AGEDAYS
ROUTE
ANTIBIOTICS
Estimate Std. Error
-1.901353 0.466304
0.136312 0.285722
0.002067 0.115482
0.059532 0.290016
-0.181255 0.300150
z value Pr(>|z|)
-4.077 4.55e-05 ***
0.477 0.633
0.018 0.986
0.205 0.837
-0.604 0.546
17
Results
• Decision Tree Result
1) root 1035 473 1 (0.4570048 0.5429952)
2) AGE>=3.5 407 140 0 (0.6560197 0.3439803)
4) ADRDATE< 0.5 203 61 0 (0.6995074 0.3004926) *
5) ADRDATE>=0.5 204 79 0 (0.6127451 0.3872549)
10) AGE>=4.5 100 35 0 (0.6500000 0.3500000)
20) ROUTE>=0.5 79 27 0 (0.6582278 0.3417722) *
21) ROUTE< 0.5 21 8 0 (0.6190476 0.3809524)
42) RECOV=Yes 18 6 0 (0.6666667 0.3333333) *
43) RECOV=NO 3 1 1 (0.3333333 0.6666667) *
18
Results
• Decision Tree Result
11) AGE< 4.5 104 44 0 (0.5769231 0.4230769)
22) ROUTE< 0.5 77 30 0 (0.6103896 0.3896104) *
23) ROUTE>=0.5 27 13 1 (0.4814815 0.5185185) *
3) AGE< 3.5 628 206 1 (0.3280255 0.6719745)
6) ROUTE< 0.5 236 109 1 (0.4618644 0.5381356)
12) RECOV=NO 24 6 0 (0.7500000 0.2500000)
19
Results
• Risk patterns for NO
1
2
3
4
•
•
•
•
3
2
3
3
3.0324
3.1002
2.5663
2.5375
2.4852
2.5582
2.1904
2.1757
26
62
25
34
9
46
9
26
7 ADRDATE 1 A GEDAYS 3 ANTIBIOTICS 0
16 AGEDAYS 3 ANTIBIOTICS 0
6 ADRDATE 1 AGEDAYS 4 ROUTE 1
8 AGEDAYS 4 ROUTE 1 ANTIBIOTICS 0
Pattern 1 where Risk Ratio = 2.48
Agedays = between 5-11 years old
Adrdate = months between October – March
Antibiotics = No
20
Conclusion
• Building a data mining process to answer the
problem posed.
• Use algorithms that work for medical applications
• Noise and missing values does pose a problem but
reasonable results can still be achieved.
• More relevant patterns can be produced for medical
experts if maximum information is included in the
dataset.
21
Reference
•
•
•
•
•
•
•
•
•
•
Brown, ML & Kros, JF 2003, 'Data mining and the impact of missing data', Industrial Management & Data
Systems, vol. 103, pp. 611-621.
Cios, K 2002, 'Uniqueness of medical data mining', Artificial intelligence in medicine, vol. 26, no. 1-2, pp. 124.
CRISP_DM 2000, Cross Industry Standard Process for Data Mining, viewed 27 August 2008,
<http://www.crisp-dm.org/Partners/index.htm>.
Li, J, Fe, AW-c, He, H, Chen, J, Jin, H, McAullay, D, Williams, G, Sparks, R & Kelman, C 2005, Mining risk
patterns in medical data, ACM, Chicago, Illinois, USA.
Lavrač, N 1999, 'Selected techniques for data mining in medicine', Artificial intelligence in medicine, vol. 16,
no. 1, pp. 3-23.
Lee, I-N, Liao, S-C & Embrechts, M 2000, 'Data mining techniques applied to medical information', Medical
Informatics & the Internet in Medicine, vol. 25, no. 2, pp. 81-102.
Obenshain, MK 2004, ‘Application of Data Mining Techniques to Healthcare Data’, Infection Control and
Hospital Epidemiology, vol.25, no 8, pp. 690-695.
Safety of Medicines 2002, A Guide to Detecting and Reporting Adverse DrugReaction Why Health
Professionals Need to Take Action, WHO publications, viewed 15 April 2008,
http://whqlibdoc.who.int/hq/2002/WHO_EDM_QSM_2002.2.pdf>.
Wang, H & Wang, S 2008, 'Medical knowledge acquisition through data mining', paper presented at the IT
in Medicine and Education, 2008. ITME 2008. IEEE International Symposium on, Xiamen
Zhu, X, Khoshgoftaar, T, Davidson, I & Zhang, S 2007, 'Editorial: Special issue on mining low-quality data',
Knowledge and Information Systems, vol. 11, no. 2, pp. 131-136.
22