Data mining definition

Download Report

Transcript Data mining definition

Data Mining for the NHS
Information Authority
Brief review
by Evandro Leite Jr
Is it maths, management or
computer science?
Data mining definition:
- Analysis of large volumes of data to extract
important trends and higher level information.
“We are drowning in data, but starving for knowledge! “
(J. Naisbett)
Data mining became a Computer Science subject in the
last 10 years, but it will always use mathematics as the
base of it.
Some quick definitions
• Variables:
Continuous: its measured values are real numbers (ex. 73.827, 23).
Categorical: takes values in a finite set not having any natural ordering (ex.
black, red, green).
Ordered: finite set, with some way of sorting the elements of the set. (ex. age in
years, interval of integer numbers, 01/09/2004).
Dependent variable or set of classes: The aspect of the data to be studied.
Independent variable or set of attributes: Variables that are manipulated to
explain the dependent variable.
• Types of problems:
Regression-type -> dependent variable = Continuous:
Ex: House selling price (£ value) price is real
Classification-type -> dependent variable = Categorical:
Ex: Who will graduate (yes, no) yes and no are categories
DECISION TREES SOLVES CLASSIFICATION AND REGRESSION
PROBLEMS
The focus of the project
There are many mathematical and computing tools that can
be applied to data mining.
Association Rules, Regression, Classification and Clustering.
For now the focus has been give to:
• Classification using Classification Trees.
• Regression using Regression Trees.
Classification Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS CATEGORICAL.
Ex: Explain the reasons patient die after going to ICU.
Outcome (Survived/Died)
Aim: To understand complex datasets by splitting it into
datasets with less entropy. The key is how to choose the
best attribute to split the data.
Classification Trees
How to choose the best attribute to split?
Gini impurity
Used by the CART algorithm (Classification and Regression Trees). Suppose y
takes on values in {1, 2, …, m}, and let f(i, j) = frequency of value j in node i.
That is f(i, j) is the proportion of records assigned to node i for which y = j.
I G (i)  1   j 1 f (i, j )
m
2
Entropy
Used by the C4.5 and C5.0 algorithms. This measure is based on the concept
of entropy used in information theory.
I E (i)   j 1 f (i, j ) log f (i, j )
m
Gaps in knowledge
• Are there other algorithms and function to be
found?
• What are the best functions and algorithms for
each dataset?
• The way to find out the goodness of a tree is
known. However, finding the best size tree is a
NP-complete problem. How to improve that?
• How to combine the best from neural networks,
support vector machines, relation rules, decision
trees etc… to create a meta learner and metameta learner.
A software which can implement
multiple algorithms
• The software will be able to run the
different algorithms for the same dataset.
– Trees generated from different algorithms will
be created and will be compared. The user
will be able to visually compare them, or to
pick the one that has the inferior
misclassification rate or model complexity.
– Depending on the nature of the problem
(classification or regression) a specific
algorithm can be much more efficient.
Last presentation’s play golf dataset
Independent variables
OUTLOOK
TEMPERATURE
HUMIDITY
Dep. var
WINDY
PLAY
sunny
85
85
FALSE
Don't Play
sunny
80
90
TRUE
Don't Play
overcast
83
78
FALSE
Play
rain
70
96
FALSE
Play
rain
68
80
FALSE
Play
rain
65
70
TRUE
Don't Play
overcast
64
65
TRUE
Play
sunny
72
95
FALSE
Don't Play
sunny
69
70
FALSE
Play
rain
75
80
FALSE
Play
sunny
75
70
TRUE
Play
overcast
72
90
TRUE
Play
overcast
81
75
FALSE
Play
rain
71
80
TRUE
Don't Play
Comparison between decision tree algorithms
Answer tree solution using the famous CART algorithm
SPSS Analytical Software
Comparison between decision tree algorithms
Spartacus Data Mining tools using the C4.5 algorithm
Southampton University
End of the introductory part
Part 1:
The meta and meta-meta learners
• The meta-learner
– The user will choose the dataset and the
variables.
– A trial of different runs, using combinations
of different methods will be the input of a
neural network (the meta-learner).
Set of
rules
C1
C&RT
Data quality
Meta-learner
CPU time &
Memory utilisation
Optimal data quality
Neural
network
Dataset
C2
QUEST
CPU time &
Memory utilisation
Data quality
Set of
rules
simpler rules
Total CPU time
Memory utilisation
Σ memory(c) / CPU(c)
c ------------------------------Total time
The meta-meta-learners
Meta-Learner 1
C&RT
Dataset
Meta-Learner2
Neural network
Linear discriminant
Meta-Learner 3
Relation rules
C4.5
STR-Tree
User defined
could be a function like:
Best meta-learner =
DataQuality * A +
Simpler rules * B +
- Memory * C +
- Time * D
Neural network
(probably not
necessary)
The meta-meta-learners user input
and output
Input:
Dataset name? NHS
Dependent variables?
LOS, OUTCOME, STROKE
Output:
The best meta-learner for you
is a combination of:
C4.5, ANN and Relation rules.
These are the best rules:
How much do you care about:
Data quality (0-99)
Parsimonious models (0-99)
Time to process (0-99)
Memory utilisation (0-99)
1- IF HEART ATTACK and AGE > 90
then OUTCOME = DEATH (error 3%)
2- Everybody that has STOKE also
has HIGH BLOOD PRESSURE
3- AGE * 2.3 + APACHE 2 * 0.4 =
LOS (error 25%)
A software which can implement
multiple algorithms
Once the best meta-learner is found for a given
situation, dataset and dependent variable, the
user can define this meta-learner as the one
to be executed in similar situations.
– Ex: To find the out the patients’ LOS in the
ICU datasets the ML3(C&RT) will be used.
However to find out the outcome of the patient
(died or survived) the ML103(C4.5, relation
rules) will be used.
No more slides
Nice things about decision trees
There are many mathematical and computing tools that can
be applied to data mining.
Association Rules, Regression, Classification and Clustering.
For now the focus has been give to:
• Classification using Classification Trees.
• Regression using Regression Trees.
Regression Trees
CAN BE USED ONLY IF THE DEPENDENT VARIABLE IS CONTINUOS.
Ex: Time a patient stays in the hospital (LOS in days)
Aim: To reduce the entropy of an dataset by splitting it into
datasets with less entropy. The key is how to choose the
best attribute to split the data.