Data Mining for Healthcare Documents
Download
Report
Transcript Data Mining for Healthcare Documents
Data Mining for Healthcare
Documents
陳啟煌
臺灣大學計資中心程式組
2011.10.27
1
頁尾文字
2015/7/18
關於我
陳啟煌
學歷
– 交大資工、台大資工、台大電機
經歷
– 興匯財務顧問公司、台大計資中心
Email:[email protected]
頁尾文字
Outlines
Introduction
Biomedical
Semantic Similarity Measure
Semantic-driven Keyword Matching Extractor
Web-based Discharge Summary System
Healthcare Mining Project with Mongolia
Conclusions and Future Works
3
頁尾文字
2015/7/18
Clinical Mining
Clinical
Database
Clinical Pathways
4
頁尾文字
2015/7/18
Introduction
In
IOM 2000 report, 44,000 to 98,000
unnecessary deaths per year
– Death rate equivalent to three jumbo jets
crashed every two days
– Motor vehicle accidents: 43,458
– breast cancer:42,297
– AIDS:16,516
頁尾文字
Suggested Solutions
Development
of IT infrastructures
– Computerized Physician Order Entry
(CPOE )
• Order Sets: to do the right thing easier.
• Alerts / reminders
• Clinical guideline
Restriction
on working hours
Greater staffing to patient ratios
頁尾文字
頁尾文字
Motivation
Clinical
Pathway
– a way of treating a patient with a
standardized procedure in order to
•
•
•
•
Enhance the efficiency,
Increase the quality,
Lower the costs,
Shorten the length of stay in hospital.
Usually
represented in a script book
and/or flow chart diagram
8
頁尾文字
2015/7/18
Order Sets System Evolution
Paper
Order Sets
– Predefined orders written on paper.
Electronic
Order Sets
– Just a UI to create and lookup order sets
Knowledge-based
Order Sets
– Machine Learning
– Interactive UI to user.
頁尾文字
How to Create Order Sets
Committee
– Traditional method, time-consuming
Feedback
system
– Interaction with users, suggestions
Data
mining
– Find patterns from existed clinical data
頁尾文字
頁尾文字
Raw Data
頁尾文字
Introduction
13
頁尾文字
2015/7/18
Motivation
Free-Text
Reports
– Discharge summaries
– Radiology reports
– Pathology reports
– Enclose treatments can be extracted,
learned, and gained knowledge
14
頁尾文字
2015/7/18
Motivation
15
Biomedical Semantic Similar Terms exists in
medical reposts.
– “congestive heart failure”,”cardiac
decompensation “, and “volume overload”
頁尾文字
2015/7/18
Approaches
Biomedical
Semantic Similarity Measure
– Calculate semantic similarity between terms
A Powerful
Extractor
– To view, verify, extract data items from reports
Structuralized
– Providing Highly Interactive Editor
• Auto-complete
• Model essay
• User phrases
16
頁尾文字
2015/7/18
Biomedical Semantic
Similarity Measure
17
頁尾文字
2015/7/18
Introduction(1/4)
Ontology-techniques
– Ontology Tree
• Single ontology
• Cross ontology
– Path length, Edge counting
Corpus-based
techniques
– Context vector measure, Latent semantic
analysis (LSA)
18
頁尾文字
2015/7/18
Introduction(2/4)
The
Web Corpus
– The Web is providing unprecedented access
to the information as well as interacting with
people’s daily lives.
– The idea of using the Web as a corpus for
NLP research is getting popular.
19
頁尾文字
2015/7/18
Introduction(3/4)
How to analyze
each document
directly of the
Web?
20
頁尾文字
2015/7/18
Introduction(4/4)
Web
search engines
– Efficient interface
– Numerous documents & high growth rate
– Google – page count
21
頁尾文字
2015/7/18
Background and Related Work
Ontology-techniques
– Single ontology
• Edge counting
• Information content
• Feature based
• Hybrid
– Cross ontology
• Hliaoutakis etc.
22
頁尾文字
2015/7/18
Methodologies
Sample
Construction
Feature Definitions
Feature Selection Strategy
Machine Learning Model
– Support Vector Machine Model
23
頁尾文字
2015/7/18
Sample Construction(1/3)
24
頁尾文字
2015/7/18
Sample Construction(2/3)
25
頁尾文字
2015/7/18
Sample Construction(3/3)
In
our study, we collect
– 1500 synonymous term pairs
– 1500 non-synonymous term pairs
26
頁尾文字
2015/7/18
Feature Definitions(1/4)
Features
–Co-occurrence
• A
• a
• B
27
頁尾文字
2015/7/18
Feature Definitions(2/4)
Features
–Co-occurrence
•A
–Semantic distance
• A
28
頁尾文字
2015/7/18
Feature Definitions(3/4)
”Apoptosis known as programmed cell death”
The phrase known as indicates a synonymous
relationship between the apoptosis and the
programmed cell death.
”Apoptosis known as programmed cell death”
– Google page count - 141
” Isoflavone known as Cyclooxygenase”
– Google page count - 0
29
頁尾文字
2015/7/18
Feature Definitions(4/4)
Features
– Lexico-syntactic pattern
•
•
•
•
•
•
•
•
•
•
30
P known as Q
of P (Q)
P (Q)
and P (Q
, P (Q
against P (Q
prevalence of P Q
patients with P Q
P/Q
P, Q
H( P known as Q )/H( P ∩ Q )
頁尾文字
2015/7/18
Feature Selection Strategy
Rank
the features according to their
ability to express synonymy by F-score:
31
頁尾文字
2015/7/18
Support Vector Machine Model(1/2)
32
頁尾文字
2015/7/18
Support Vector Machine Model(2/2)
LIBSVM 2.89
– C-SVC
•
•
•
•
Linear
Polynomial degree=2
Polynomial degree=3
RBF
– nu-SVC
•
•
•
•
33
Linear
Polynomial degree=2
Polynomial degree=3
RBF
頁尾文字
2015/7/18
Datasets(1/5)
Table 1: Dataset 1 of 36 medical term pairs
Concept 1
34
Concept 2
Human
Anemia
Appendicitis
0.031
Dementia
Bacterial Pneumonia
Atopic Dermatitis
0.062
Malaria
0.156
Osteoporosis
Patent Ductus
Arteriosus
0.156
Amino Acid Sequence
Anti-Bacterial Agents
0.156
Acquired
Immunodeficiency
Syndrome
Congenital Heart
Defects
0.062
Otitis Media
Infantile Colic
0.156
Meningitis
Tricuspid Atresia
0.031
Sinusitis
Mental Retardation
0.031
Hypertension
Kidney Failure
0.5
Hyperlipidemia
Hyperkalemia
0.156
Hypothyroidism
Hyperthyroidism
0.406
Sarcoidosis
Tuberculosis
0.406
Vaccines
Immunity
0.593
Asthma
Pneumonia
0.375
頁尾文字
2015/7/18
Datasets(2/5)
Table 1: Dataset 1 of 36 medical term pairs
Concept 1
35
Concept 2
Human
Diabetic Nephropathy
Diabetes Mellitus
0.5
Lactose Intolerance
Irritable Bowel
Syndrome
0.468
Urinary Tract Infection
Pyelonephritis
0.656
Neonatal Jaundice
Sepsis
0.187
Sickle Cell Anemia
Iron Deficiency Anemia
0.437
Psychology
Cognitive Science
0.593
Adenovirus
Rotavirus
0.437
Migraine
Headache
0.718
Myocardial Ischemia
Myocardial Infarction
0.75
Hepatitis B
Hepatitis C
0.562
Carcinoma
Neoplasm
0.75
Pulmonary Valve
Stenosis
Aortic Valve Stenosis
0.531
Failure To Thrive
Malnutrition
0.625
Breast Feeding
Lactation
0.843
Antibiotics
Antibacterial Agents
0.937
頁尾文字
2015/7/18
Datasets(3/5)
Table 1: Dataset 1 of 36 medical term pairs
Concept 1
36
Concept 2
Human
Seizures
Convulsions
0.843
Pain
Ache
0.875
Malnutrition
Nutritional Deficiency
0.875
Measles
Rubeola
0.906
Chicken Pox
Varicella
0.968
Down Syndrome
Trisomy 21
0.875
頁尾文字
2015/7/18
Datasets(4/5)
Table 2: Dataset 2 of 30 medical term pairs
Concept 1
37
Concept 2
Physician
Expert
4
4
3.3
3
Renal Failure
Kidney Failure
Heart
Myocardium
Stroke
Infarct
3
2.8
Abortion
Miscarriage
3
3.3
Delusion
Schizophrenia
3
2.2
Congestive Heart
Failure
Pulmonary Edema
3
1.4
Metastasis
Adenocarcinoma
2.7
1.8
Calcification
Stenosis
2.7
2
Diarrhea
Stomach Cramps
2.3
1.3
Mitral Stenosis
Atrial Fibrillation
2.3
1.3
Chronic Obstructive
Pulmonary Disease
Lung Infiltrates
2.3
1.9
Rheumatoid Arthritis
Lupus
2
1.1
Brain Tumor
Intracranial
Hemorrhage
2
1.3
Carpel Tunnel
Syndrome
Osteoarthritis
2
Diabetes mellitus
Hypertension
2
1.1
頁尾文字
1
2015/7/18
Datasets(5/5)
Table 2: Dataset 2 of 30 medical term pairs
Concept 1
38
Concept 2
Physician
Expert
Acne
Syringe
1.7
1.2
Antibiotic
Allergy
1.7
1
Cortisone
Total Knee
Replacement
1.7
1.2
Pulmonary Embolus
Myocardial Infarction
1.7
1.4
Pulmonary Fibrosis
Lung Cancer
1.3
1
Cholangiocarcinoma
Colonoscopy
1.3
1
Lymphoid Hyperplasia
Laryngeal Cancer
1
1
Multiple sclerosis
Psychosis
1
1
Appendicitis
Osteoporosis
1
1
Rectal Polyp
Aorta
1
1
Xerostomia
Alcoholic Cirrhosis
1
1
Peptic Ulcer Disease
Myopia
1
1
Depression
Cellulites
1
1
Varicose Vein
Entire Knee Meniscus
1
1
Hyperlidpidemia
Metastasis
1
1
頁尾文字
2015/7/18
Experiment Results
Rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
39
Feature
NGD
WebPMI
, X (Y
X/Y
X(Y)
X, Y
WebOverlap
WebDice
WebJaccard
of X (Y)
and X (Y
against X (Y
patients with X Y
X known as Y
prevalence of X Y
F(i)
0.2751
0.237
0.1648
0.1632
0.1606
0.1585
0.1173
0.0555
0.0347
0.0185
0.0093
0.0027
0.0017
0.0014
0.0011
頁尾文字
2015/7/18
Experiment Results
40
Figure 3.4(a): Correlation vs. No of features and
training samples using C-SVC with linear kernel
頁尾文字
2015/7/18
Experiment Results
41
Figure 3.4(b): Correlation vs. No of features and
training samples using C-SVC with polynomial
degree=2 kernel
頁尾文字
2015/7/18
Experiment Results
42
Figure 3.4(c): Correlation vs. No of features and
training samples using C-SVC with polynomial
degree=3 kernel
頁尾文字
2015/7/18
Experiment Results
43
Figure 3.4(d): Correlation vs. No of features and
training samples using C-SVC with RBF kernel
頁尾文字
2015/7/18
Experiment Results
44
Figure 3.5(a): Correlation vs. No of features and
training samples using nu-SVC with linear kernel
頁尾文字
2015/7/18
Experiment Results
45
Figure 3.5(b): Correlation vs. No of features and
training samples using nu-SVC with polynomial
degree=2 kernel
頁尾文字
2015/7/18
Experiment Results
46
Figure 3.5(c): Correlation vs. No of features and
training samples using nu-SVC with polynomial
degree=3 kernel
頁尾文字
2015/7/18
Experiment Results
47
Figure 3.5(d): Correlation vs. No of features and
training samples using nu-SVC with RBF kernel
頁尾文字
2015/7/18
Experiment Results
48
Number of features
Model
Maximum
correlation
Number of samples
C-SVC(Linear)
0.758
1500
9
C-SVC(Poly=2)
0.776
1200
7
C-SVC(Poly=3)
0.759
300
13
C-SVC(RBF)
0.612
1100
10
nu-SVC(Linear)
0.798
900
7
nu-SVC(Poly=2)
0.766
300
11
nu-SVC(Poly=3)
0.736
300
12
nu-SVC(RBF)
0.743
100
11
頁尾文字
2015/7/18
Experiment Results
Table 5: Correlation vs. Dataset 1 and Dataset 2
with physician scores and expert scores of
different models
49
Model
Dataset 1
Dataset
2(Phy)
Dataset
2(Exp)
C-SVC(Linear)
0.758
0.689
0.482
C-SVC(Poly=2)
0.776
0.698
0.479
C-SVC(Poly=3)
0.759
0.649
0.395
C-SVC(RBF)
0.612
0.388
0.171
nu-SVC(Linear)
0.798
0.705
0.496
nu-SVC(Poly=2)
0.766
0.671
0.424
nu-SVC(Poly=3)
0.736
0.641
0.384
nu-SVC(RBF)
0.743
0.632
0.373
頁尾文字
2015/7/18
Result comparison
Table 3.4 Result comparison for Dataset 1
Measure
50
Dataset 1
SemDist
0.726(2)
Path length
0.422(5)
Leacock &
Chodorow
0.600 (3)
Wu & Palmer
0.498(4)
Proposed
0.798 (1)
頁尾文字
2015/7/18
Result comparison
Table 3.5: Results comparison for Dataset 2
51
Measure
Dataset
2((Physician)
Dataset
2(EXPERT)
Path length
0.512(4)
0.731(2)
Leacock &
Chodorow
0.358(7)
0.497(5)
Lin
0.522(3)
0.565(4)
Resnik
0.534(2)
0.61(3)
Jiang &
Conrath
0.506(5)
0.741(1)
Vector(All
sect, 1M
notes)
0.436(6)
0.497(5)
Proposed
0.705(1)
0.496(6)
頁尾文字
2015/7/18
Semantic-driven Keyword
Matching Extractor
52
頁尾文字
2015/7/18
Introduction
For
Structuralized Clinical Data
– Data can be directly exported for further
analyzing and mining
For
Non-structuralized Clinical Data
– Data need to be further processed to extract
the relevant information
53
頁尾文字
2015/7/18
Background and Related works
Marking
concepts and related semantics
– Cancer Text Information Extraction System
(caTIES)
Extracting
data items fill the outcomes into
the predefined template
– IBM Watson Research Center & Mayo Clinic
Providing
the verification user interface
– Commercial natural language processing (NLP)
engines
54
頁尾文字
2015/7/18
Architecture
Case-oriented
template schema
Clinical data
warehouse
Retrieve
keyword list
Textual
clinical
reports
Matching
metadata
Select
keyword
Keyword selection
interface
Information matching
modules
Textual documents
viewer
Extraction verification
editor
55
Store structuralized data
Retrieve
matching
profile
Send matching
profile
Apply match pattern
on textual reports
Review and verify
matched information
頁尾文字
2015/7/18
Methodology
The default common keyword lists of each type
of textual documents
the personal keyword lists
– matching the keyword and the keywords with
related semantic
– mapping the corresponding matching rules using
the retrieved matching pattern and applying the
matching rules on the textual reports
– Date, 2009/01/01, 12/01
– Size, “4.9 x 1 x 1.8” length x width x height
56
頁尾文字
2015/7/18
Result
57
頁尾文字
2015/7/18
Result
58
頁尾文字
2015/7/18
Discharge Summary System
59
頁尾文字
2015/7/18
Background
Old
Discharge summary system(Dis32)
– Client/Server Architecture
– Install/upgrade client applications
Web
Discharge summary system
– Service-Oriented Architecture
– 2009.10 Online
60
頁尾文字
2015/7/18
61
頁尾文字
2015/7/18
Motivation
Discharge
summary user interface
– Chief Complaint, Brief History
– Free-Text field
– How to generate a list of suggesting phrases
62
頁尾文字
2015/7/18
Motivation
63
Auto-Complete
頁尾文字
2015/7/18
Language Modeling
We want to compute P(w1,w2,w3,w4,w5…wn),
the probability of a sequence
Alternatively we want to compute
P(w5|w1,w2,w3,w4): the probability of a word
given some previous words
The model that computes P(W) or
P(wn|w1,w2…wn-1) is called the language
model.
64
頁尾文字
2015/7/18
SRILM
SRILM
– The SRI Language Modeling Toolkit
– SRILM is a toolkit for building and applying
statistical language models (LMs)
– http://www.speech.sri.com/projects/srilm/
65
頁尾文字
2015/7/18
SRILM
66
Three Main Functionalities
– Generate the n-gram count file from the corpus
– Train the language model from the n-gram count file
– Calculate the test data perplexity using the trained
language mode
頁尾文字
2015/7/18
Implementation
N-gram
Count File
– Chief Complaint, Brief History
Static
– Phrase lists
Dynamic
– AJAX + AutoComplete toolkit
67
頁尾文字
2015/7/18
Discharge notes
68
頁尾文字
2015/7/18
Results
System Name
Time Spent
Client-server system
652 seconds
(00:10:52)
Web-based system
372 seconds
(00:06:12)
The average consumed time (Measure unit: seconds (hh:mm:ss)
7 intern participants
69
頁尾文字
2015/7/18
Healthcare Mining Project
with Mongolia
70
頁尾文字
2015/7/18
Background
Taiwan
— Mongolia
– National Science Council
– Mongolian Ministry of Education, Culture
and Sciences
NTU
— MUST
– Mongolian University of Science and
Technology
3-Year
–
71
Project
2009/8/1 – 2012/7/31
頁尾文字
2015/7/18
Motivation
Reduce
cost
– Length of stay in hospital
– Early detection of disease
Improve
quality and patient safety
– SOP, Clinical Pathways
72
頁尾文字
2015/7/18
Motivation
Clinical
Pathway
– a way of treating a patient with a
standardized procedure in order to
•
•
•
•
Enhance the efficiency,
Increase the quality,
Lower the costs,
Shorten the length of stay in hospital.
Usually
represented in a script book
and/or flow chart diagram
73
頁尾文字
2015/7/18
Project Goal
Build A Data
Mining framework for
– Early detection of disease
• Find out the sequential patterns between
different diseases
– Standardized therapeutic procedure
• Discover clinical pathways and clinical guide
74
頁尾文字
2015/7/18
Mining Clinical Pathway
Clinical
Database
Clinical Pathways
75
頁尾文字
2015/7/18
Clinical Data
The
clinical data include
– Patient information,
– Diagnosis
– Sequences of physicians orders taken at
different time moments.
76
頁尾文字
2015/7/18
77
頁尾文字
2015/7/18
Clinical Sequence Mining
system diagram
Data
Preparation
Data
Pre-Processing
Mining Model
Clinical Pathway
Creation System
Historical
Diagnosis
Database
78
Orders
Sequence
Knowledge
base
Alert and
Reminding
System
頁尾文字
2015/7/18
Data Preparation
Inpatient
Department raw data
– From 2007/1/1 to 2007/5/26
Discharge
notes
– with admission/discharge diagnosis, chief
complaint. 22,000 records
Diagnosis
records in IPD
– with ICD9 code
Related
79
orders in IPD
頁尾文字
2015/7/18
Data Preparation
Chief
complaint
– For scheduled chemotherapy
– Total
• 791 cases
• 33,771 physician orders
80
頁尾文字
2015/7/18
Data Pre-processing
Select
relevant data according to the
order type attribute
– Drop some non-meaningful orders such as
nursing care, Administration routine orders.
81
頁尾文字
2015/7/18
Order Type Statistics
ordertypecode cnt
82
ordercnt
R
10309
368
T
6180
135
A
6063
20
L
5569
175
M
4026
84
D
814
25
X
360
41
B
168
5
O
106
58
J
47
6
E
40
14
P
12
4
I
11
3
N
6
2
頁尾文字
2015/7/18
Mining Model
Sequence
Clustering Algorithm
Mining Tool
– Microsoft SQL Server 2005
– Sequence Clustering Model
– Visualize Data Analysis
Parameter
– Support
– Confidence
83
頁尾文字
2015/7/18
Sequence Clustering Mining
Sequence
Clustering algorithm finds
clusters of cases that contain similar
paths in a sequence.
84
頁尾文字
2015/7/18
Sequence Clustering Sample
CustomID
(Sequence Data)
1
(30) (60 90)
2
(10 20) (30) (40 60 70)
3
(30 50 70)
4
(30) (40 70) (90)
5
(90)
Sequential Pattern :
(30) (90) 、(30) (40 70)
85
頁尾文字
2015/7/18
Mapping
Patient
Item Order
Shopping Cart Concurrent Orders
Custom
86
頁尾文字
2015/7/18
Result
頁尾文字
88
頁尾文字
2015/7/18
89
頁尾文字
2015/7/18
90
頁尾文字
2015/7/18
91
頁尾文字
2015/7/18
Sequence Sample
08011CZP CBC & platelet
血小板
08013CZP WBC differential count
09025CZP AST(GOT)
肝功能指數
09026CZP ALT(GPT) 肝功能指數
白血球
09038CZP Albumin(Blood)
09002CZP (Blood)UN
清蛋白
09015CZP (Blood)Creatinine
09029CZP Bilirubin, total
肌酸酐
09021CZP Sodium, Na
鈉
09022CZP Potassium, K
鉀
膽紅素
92
頁尾文字
2015/7/18
The SAGE Guideline Model
Standards-Based Sharable Active Guideline
Environment
– Developed by
• Stanford Medical Informatics, IDX Systems
Corporation, Apelon Inc., Intermountain Health
Care, Mayo Clinic and University of Nebraska
Medical Center
93
頁尾文字
2015/7/18
The Protégé
頁尾文字
Activity Graphs
Aspirin Therapy for diabetic patients
95
頁尾文字
2015/7/18
96
頁尾文字
2015/7/18
Cooperation Architecture
VM Images
VM-DB
Hospital in
Taiwan
97
VM-Web
VM-DB
VM-Web
Hospital in
Mongolia
Model Feedback
頁尾文字
2015/7/18
Cloud Architecture
Health Mining Server
Hospital in
Mongolia
Hospital in
Taiwan
98
Hospital in Canada
頁尾文字
2015/7/18
Conclusions
A measure
that uses page counts
calculate semantic similarity between two
given concepts.
A semantic-driven
keyword matching
extractor help extract data item from
reports
99
頁尾文字
2015/7/18
Conclusions
A highly
Interactive free-text editor with
auto-complete feature speed up the
composition of discharge summaries.
A Data
100
mining framework is proposed.
頁尾文字
2015/7/18
Future Works
Find
out why corpus-based methods
produce closer correlation with physicians’
scores than experts’
Structuralized the healthcare documents
Prove Data mining models’ robustness
– Variation analysis across hospitals/regions
– Taiwan and Mongolia
– Canada , Taiwan and Mongolia
101
頁尾文字
2015/7/18
Q&A
102
頁尾文字
2015/7/18