Transcript Slide 1
بنام خدا
داده كاوي و كاربرد آن در پزشكي
شماره دانشجويي 85233510 :
نام دانشجو :بابك رزاقي
استاد راهنما :جناب آقاي دكتر توحيد خواه (سمينار درس كاربرد فناوري اطالعات در پزشكي)
WHY DATA MINING?
Necessity is mother of invention
Huge amounts of data
Electronic records of our decisions
Choices in the supermarket
Financial records
Our comings and goings
We swipe our way through the world – every swipe is a record in
a database
Data rich – but information poor
Lying hidden in all this data is information!
2
WHAT IS DATA MINING?
Extracting or “mining” knowledge from large amounts of
data
Data -driven discovery and modeling of hidden patterns in
large volumes of data
Extraction of implicit, previously unknown and unexpected,
potentially extremely useful information from data
3
DATA VISUALIZATION
Data mining
Large database
Data visualization
Ways of seeing patterns in large data sets
Uses the efficiency of human pattern recognition
4
TERMINOLOGY
Gold Mining
Knowledge mining from databases
Knowledge extraction
Data/pattern analysis
Knowledge Discovery Databases or KDD
5
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
6
Transformed
Data
Target
Data
Patterns
and
Rules
Understanding
Raw
Data
DATA MINING CENTRAL QUEST
Find true patterns
and avoid overfitting
(false patterns due
to randomness)
7
MAJOR DATA MINING TASKS
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Estimation: predicting a continuous value
Deviation Detection: finding changes
Link Analysis: finding relationships
8
DATA MINING CHALLENGES
Computationally expensive to investigate all possibilities
Dealing with noise/missing information and errors in data
Choosing appropriate attributes/input representation
Finding the minimal attribute space
Finding adequate evaluation function(s)
Extracting meaningful information
Not over fitting
9
DATA MINING SOFTWARE
INSIGHTFUL MINER
Angoss Knowledge ACCESS
ARMiner
Eudaptics Viscovery
Goal TV
MDR
Viscovery SOMine
SPSS
10
DATA MINING APPLICATIONS
Science: Chemistry, Physics
Bioscience
Financial Industry - banks, businesses, e-commerce
Sequence-based analysis
Protein structure and function prediction
Protein family classification
Microarray gene expression
Stock and investment analysis
Pharmaceutical companies
Health care
Sports and Entertainment
11
Clinical Data Mining processes
Digital format for all pertinent data
Create structure
Obtain coded information
Natural language understanding
Create a widely accessible repository
12
13
Minimum systolic blood
pressure over a 24-hour
period following admission to
the hospital
> 91
<= 91
Class 2:
Age of Patient
<=62.5
>62.5
Early death
CLASSIFICATION EXAMPLE
FOR MEDICAL DIAGNOSIS
AND PROGNOSIS HEART
DISEASE
Class 1:
Was there sinus
tachycardia?
Survivors
YES
NO
Class 1:
Class 2:
Survivors
Early death
14
15
GENOME, DNA & GENE EXPRESSION
An organism’s genome is the “program” for making
the organism, encoded in DNA
Human DNA has about 30-35,000 genes
A gene is a segment of DNA that specifies how to make a
protein
Cells are different because of differential gene
expression
About 40% of human genes are expressed at one time
Microarray devices measure gene expression
16
MICROARRAY RAW IMAGE
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Scanner
enlarged section of raw image
Value
193
-70
144
33
318
1764
1537
1204
707
raw data
17
MICROARRAY POTENTIAL APPLICATIONS
New and better molecular diagnostics
New molecular targets for therapy
Outcome depends on genetic signature
best treatment?
Fundamental Biological Discovery
few new drugs, large pipeline, …
finding and refining biological pathways
Personalized medicine ?!
18
MICROARRAY DATA MINING CHALLENGES
Avoiding false positives, due to
too few records (samples), usually < 100
too many columns (genes), usually > 1,000
Model needs to be robust in presence of noise
For reliability need large gene sets; for diagnostics
or drug targets, need small gene sets
Estimate class probability
Model needs to be explainable to biologists
19
20
21
22
23
INITIAL QUERY PAGE
24
CLUSTERS MATCHING QUERY RESULTS
25
DISPLAY OF CLUSTER
26
DATA MINING SOFTWARE GUIDE
27
28
CONCLUSION
Discover useful relationships in data
Discover information otherwise overlooked
Provide intelligence to improve various phases
Intellectual property
Competitive advantages:
Getting more out of your data
Finding other relevant information faster
Exploratory, hypothesis-generating analyses
Increase productivity – reduced amount of time and
money
29
30
Thank You All
[email protected]
31