Semantic Web - University of Huddersfield

Download Report

Transcript Semantic Web - University of Huddersfield

AI Week 15
Machine Learning:
Data Mining :
Association Rule Mining,
Associative Classification,
Applications
Lee McCluskey, room 3/10
Email [email protected]
http://scom.hud.ac.uk/scomtlm/cha2555/
Last Week
Data Mining -as inducing rule
classifiers from
classified training
examples.
Association Rule Mining(ARM)
This is an “unsupervised learning activity” - briefly,
looking for strong associations between features
in data.
Definitions: A transactional database is a set of
“transactions” eg the details of individual sales.
A transaction can be though of as an “item-set”
where each item is an attribute-value
{height=6, temp = 20. weather = warm}
As a special case we could have nominal item sets
{bread, cheese, milk}
Artform Research
Group
Association Rule Mining(ARM):
Important Definitions
An association rule is an expression
X => Y
where X, Y are item-sets, and
The support of an association rule is defined as the
proportion of transactions in the database that contain
X U Y.
The confidence of an association rule is defined as the
probability that a transaction contains Y given that it
contains X, that is
= no of transactions containing (X U Y) / no of transactions
containing
X
Artform
Research
Group
Aims of ARM
Given a transactional database D, the association rule
problem is to find all rules that have supports and
confidences greater than certain user-specified thresholds,
denoted by minimum support (MinSupp) and minimum
confidence (MinConf), respectively.
The aim is the discovery of the most significant associations
between the items in a transactional data set. This process
involves primarily the discovery of so called frequent itemsets, i.e. item-sets that occurred in the transactional data
set above MinSupp and MinConf.
Artform Research
Group
Example
A trader deals in the following currencies in a series of 8 transactions…
1
Sterling
Yen Dollar
Euro
2
Dollar Euro
Rand
Sterling Ruble
3
Pesos Euro
Ruble
Rupee Yen
4
Rupee Sterling Ruble
Euro
Dollar
5
Sterling Dinars Rand
Yen
6
Pesos Kroner Sterling Dollar
7
Ruble Rupee Kroner Sterling Pesos
8
Dollar Euro
Sterling
What is the SUPPORT and CONFIDENCE of the following rules?
{Ruble } → {Rupee}
{Sterling, Euro} → {Ruble}
{Sterling, Euro} → {Ruble, Pesos}
Find an association rule from the set of transactions that has
- at least 2 items in its antecedents,
- better support and better confidence than both rules above.
Artform Research
Group
Sterling
Sterling
Yen Dollar Euro
Yen Dollar Euro
Example
Pesos Kroner
Sterling Dollar
Dollar Euro
Sterling
X
R
X => Y:
Ruble => Rupee
Artform Research
Group
Dollar Euro
Rand Sterling
Ruble
Sterling Yen
Dollar Euro
Sterling Dinars
Rand Yen
Pesos Euro Ruble
Rupee Yen
XuY
Ruble Rupee
Kroner
Sterling Pesos
Rupee Sterling Ruble
Euro Dollar
Associative Classification
If we fuse ARM and classification rule mining we get
“Associative Classification” – use the association technique,
but learning about particular items or item sets.
Associative Classification is a branch in data mining that
combines classification and association rule mining. In
other words, it utlises association rule discovery methods in
classification data sets.
Typically:
• Find Association Rules using ARM
• Sift out the “Class Association Rules” – ones that have the
class of interest on their Right Hand Sides
Artform Research
Group
Validation in Rule Discovery
•
Multi-stage Data Mining “pipelines” are fraught with
various kinds of errors / bias
• the integrity of the data at each stage of the DM
process and the reliability of the results are
particularly important.
• DM usually uses “cross validation”, where the data is
split into a training set and a testing set, and the
results of the data miner applied to the training set is
compared to the training set. Not really applicable to
rule discovery.
Key idea: Look for trends/associations in the data that
are output from the process and that represent known
associations in the application domain.
DM Application 1: Discovering trends
from patient data in the area of Diabetic
Retinopathy
Diabetic Retinopathy:
Basically damage to the
eyes caused by
Diabetes, sometimes
leading to blindness
HUGE problem as diabetes
on the increase. If you
are a long term diabetic
then your are very likely
suffer some retina
damage
Clinics keep large amounts
of data on patients who
are treated in various
ways, over long periods
of time.
Diabetic Retinopathy Application
Data of 20,000 patients over 18 years
Much data cleaning and inference precedes mining –
replacing missing values, noise, anomalies etc
Focus in one a smaller number of patients with a yearly
screening (- timestamp) over a period of 4+ years
Attribute Examples (there are several hundred)
Age_at_Exam ,
Present_Treatment,
calculated_age_at_diagnosis,
Retinopathy_in_R_Eye (RE_RET),
Retinopathy_in_L_Eye (RE_RET),
calculated_diabetes_type,
calculated_diabetes_duration
Trend Mining
Item-sets that have an
increasing support over a
series of time-stamped
instances (events) are called
“emerging patterns”
The changing support for sets
of items during each event
can indicate trends in the
data. For example, the
presence of a particular
treatment over a period of
time may lead to the
alleviation of a symptom.
Diabetic Retinopathy Application
Aim - to find trends in the data e.g. (ficticous
example):
calculated_diabetes_duration > Y &
Age_at_Exam in [60,70] &
Present_Treatment = drugX &
calculated_age_at_diagnosis in [50,60] =>
Retinopathy_in_R_Eye (RE_RET) = low
Retinopathy_in_L_Eye (RE_RET) =low
Increasing trend ..
“people who have had diabetes for a certain length of time, whose age
is in there 60’s, who were diagnosed in their 50’s, who have been
taking treatmentX, often have low DR levels”
Increasing trend adds support for the association.
DM Application 1: Road Traffic Control
Artform Research
Group
Example in Road Traffic Control
Artform Research
Group
Example in Road Traffic Control
Data ..
Numeric Data Record from individual CARS
(date, time, position, actual speed, expected speed)
Textual Data of INCIDENTS
(date, time start, time cleared, position, severity, road type,
area, incident category, cause, road-effect, trafficeffect, reporter ..)
Data Sources ..
ANPR, Mobile Phones, Road (Vehicle) Sensors,
Environment Sensors
Artform Research
Group
Applications in Road Traffic Control
•
•
•
•
associations between variations in speeds with
near-future incidents
effect of a particular type of incident (eg
roadworks) on average speeds on nearby
trunk roads
looking for predictors in "heavy/slow traffic"
incidents: look for associations with speed
variations or accidents on roads downstream
from the incident position (hence causing the
incident)
looking for associations between speeds
around a bypass and a later "heavy traffic"
incident within the town bypassed
Artform Research
Group
Conclusions
Data Mining is a powerful set of techniques
to help discover hidden knowledge
It can be supervised or unsupervised.
•
•
Association Rule Mining
Associative Classification
are important classes of technique used in
DM
Artform Research
Group