Data, Databases, and Discovery

Download Report

Transcript Data, Databases, and Discovery

Data, Databases, and Discovery
Andy Novobilski, PhD
UT Chattanooga Computer Science
Nuts and Bolts
Research Methods Symposium
UT College of Medicine Chattanooga
September 29, 2006
An Introduction to
Knowledge Discovery
•
•
•
•
•
Data Collection
Data Validation
Preprocessing of Data
Mining the Data
Comparing Methods
Data Collection …
• Paper or Electronic?
– Fingernet
• Continuous or Discrete?
• And the Understatement of the Year …
Health Insurance Portability and Accountability Act of 1996
The HIPAA website http://www.hipaa.org/
links to the government’s website http://aspe.hhs.gov/admnsimp/
which states
“Administrative Simplification in the Health Care
Industry”
… And Raw Storage …
• Alphanumeric Data
– Excel Worksheets
– Comma/Tab Delimited Text Files
– XML: The Extensible Markup Language
• http://www.xml.com/
• Binary Data
– Images
• GIF, BMP, EPS
– Streaming Data
• HL7 - http://www.hl7.org/ (http://en.wikipedia.org/wiki/HL7)
• DICOM - http://medical.nema.org/
… Stored in a Relational Manner
• Relational Databases
– Inexpensive
• MS Access
– Expensive
• MS SQL Server, Oracle, Sybase, …
– Free (sort of … open source)
• MySQL, PostgreSQL
• Licensing Varies by Usage
Data Validation
Id
Gender
Age
Months
Pregnant
Temp
Smoker
001
002
M
M
55
55
0
9
98.3
9.82
Yes
.
• Patient 002 is a …
– Pregnant Male ( hit the 9 instead of 0)
– With Ice Water in His Veins (misplaced decimal)
– Who Might or Might Not Smoke (missing data)
Preprocessing the Data
• Clean-up
– Out of Scope vs. Out of Family
• Feature Extraction
– Data Aggregation
• Feature Transformation
– Normalization
– Principle Component Analysis
Turning Data into Information
• Data Mining …
– Clustering
– Decision Trees
– Neural Networks
– Bayesian Networks
Clustering K-Means
Y
N
Y
Y
Y
N
Y
N
N
N
N
N
Decision Trees
• Division of Data Based on Information Gain
• White Box
Gender
M
F
Smoker
N
Age
N
Y
Age
Y
Y
Y
N
N
Y
Neural Networks
• Functional Approximation to Data
– Black Box
Case Data
Forecast
– Most Common is Feed Forward, Back Propagation
• Considerations in Training the Network
– Many Types of Neural Networks
– Difficulties with Discrete Data
– Missing Data Requires Careful Consideration
Bayesian Networks
• Belief Networks
– White Box
• Causal Orientation
• Beliefs are Updated Based on New Information
• Nodes Can Serve as Both Evidence and Query
Points
• Handles Missing Data Gracefully
An Example
Novobilski, Andrew, F. Fesmire, D. Sonnemaker. "Mining Bayesian Networks to Forecast Adverse Outcomes
Related to Acute Coronary Syndrome." ." The 17th International FLAIRS Conference 2004.
Comparing Models – The ROC Curve
• The Receiver Operating Characteristic
(ROC) Curve
– Plots the Percentage of True Positives
against the Percentage of False Positives
as the Cutoff Value is varied from everyone
classified as ill to everyone classified as
healthy.
– Provides a consistent measure of model
fitness that varies between 0 and 100.
An Illustration
Healthy
Cutoff Value
Ill
Comparing Multiple Classifiers
In Summary …
• A Process to Consider …
– Collect, Validate, Preprocess, Mine,
Compare
• Excellent Software is Available
– Both Commercial and Open Source
• Sample Data Is Available
Thank You !
• Questions and/or Comments are
Welcome …
Dr. Andy Novobilski
UT Chattanooga Computer Science
615 McCallie Ave., Dept. 2302
Chattanooga, TN 37403
(423) 425-4202
[email protected]
http://www.utc.edu/Faculty/Andy-Novobilski