data mining - Department of Computer Systems Engineering

Download Report

Transcript data mining - Department of Computer Systems Engineering

Data Warehousing Lecture XVI
Dr. Javed Ali Baloch
Outline
• Hybrid OLAP (HOLAP) or Desktop OLAP
(DOLAP)
• The HOLAP Architecture
• HOLAP Development Issues
• Data Design & Preparation
Hybrid OLAP (HOLAP) or Desktop OLAP (DOLAP)
• HOLAP is meant to provide portability to users of
OLAP.
• HOLAP provide limited analysis capability, either
directly against RDBMS products, or by using an
intermediate MOLAP server.
• HOLAP tools deliver selected data directly from the
DBMS or via a MOLAP server to the desktop (or local
server) in the form of data-cube, where it is stored,
analyzed and maintained locally.
HOLAP Development Issues
• The architecture results in significant data
redundancy and may cause problems for
networks that support many users.
• Ability of each user to build a custom datacube may cause a lack of data consistency
among users.
Data Design & Preparation
• The DW feeds data to the OLAP system.
• In the MOLAP model, multidimensional databases
store the data fed from the DW in the form of multidimensional cubes.
• In the ROLAP model, data is pushed into the OLAP
system with cubes created dynamically on the fly.
• Thus, the sequence of the flow of data is from the
operational source systems to the DW & from there
to the OLAP systems.
• Why not build the OLAP system on the top of
the operational source systems?
– An OLAP system needs transformed & integrated
data.
– An OLAP system needs extensive historical data.
– An OLAP system requires data in multidimensional representation.
– Different departments require data from different
operational systems.
• The techniques for preparing OLAP data for a particular
department e.g.: marketing.
– Define Subset: Select the subset of detailed data
the marketing is interested in.
– Summarize: Aggregate the data in the way
marketing department needs.
– De-normalize: Combine relational tables exactly
the same way the marketing dept. needs.
– Calculate & Derive
– Index: Choose those attributes that are
appropriate for marketing to build indexes.
Data Warehousing Lecture XVII
Dr. Javed Ali Baloch
Outline
•
•
•
•
Data Mining
Decision support progress to Data Mining
Data Mining Defined
The Knowledge Discovery Process
DATA MINING
• Data Mining is used in just about every area of
business from sales and marketing to new product
development to inventory management and human
resource.
• In today’s world, an organization generates more
information in a week than most people can read in a
lifetime. It is humanly impossible to decipher and
interpret all that data to find useful information.
• Data Mining enables companies to find answers and
discover patterns in their customer data.
Decision support progress to Data Mining
Early Filebased Systems
Basic
accounting
data
No Decision
Support
Database
Systems
Data
Warehouse
OLAP
Systems
Operational
systems
data
Data for
decision
Support
Data for
multiDimensional
Analysis
True
Decision
Support
Complex
Analysis &
Calculations
Primitive
Decision
Support
Data Mining
Applications
Selected
and extracted
data
Knowledge
Discovery
Data Mining Defined
• Is the efficient discovery of valuable, non-obvious
information from a large collection of data.
• Data Mining centers around the automated discovery
of new facts and relationships in data.
• With traditional query tools, you search for known
information. Data mining tools enable you to
uncover hidden information.
• The assumption is that more useful knowledge lies
hidden beneath the surface.
The Knowledge Discovery Process
• Data Mining discovers knowledge or
information that you never knew was present
in your data.
• The uncovered hidden knowledge manifests
itself as relationships or patterns.
Data Warehousing Lecture XVIII
Dr. Javed Ali Baloch
Outline
• Relationships
• Patterns
• Knowledge Discovery Phases
Relationships
• Suppose on the way home you visited the nearby
supermarket to pickup bread, milk, and few other “things”.
What other things? You are not sure.
• While you fetch the milk container, you happen to see a pack
of assorted cheeses close by. Yes, you want that.
• You pause to look at the next five customers also reach for the
cheese pack. Coincidence?
• Now on the bread shelf. As you get your bread, a bag of
potato chips catch your eye. Why not get that bag of potato
chips? Now the customer behind you also wants bread &
chips. Coincidence? Not necessarily.
Relationships
• It is possible that this supermarket is part of a
national chain that uses data mining.
• The data mining tools have discovered the
relationship between bread and chips and between
milk and cheese packs.
• So the items must have been deliberately placed in
close proximity.
• Data Mining discovers the relationships of this type.
• The relationships may be between two or more
different objects along with the time dimension.
• Discovery of relationships is a key result of data
mining.
Patterns
• Pattern discovery is another outcome of data mining
operations.
• Consider a credit card company trying to discover the
pattern of usage that usually warrants increase in
credit limit or a card upgrade.
• They would know which of their customers must be
lured with card upgrade & when.
• The data mining tools mine the usage patterns of
thousands of card-holders and discover the potential
pattern of usage that will produce result in marketing
campaign.
Knowledge Discovery Phases
• Step 1: Define Business Objectives: Determine
whether you really need a data mining solution.
State your objectives. Are you looking to improve
your direct marketing campaigns? Do you want to
detect fraud in credit usage? etc
• Step 2: Prepare Data: consists of data selection,
preprocessing of data and data transformation.
Include appropriate metadata.
• Step 3: Perform Data Mining: the knowledge
discovery engine applies the selected algorithm to
the prepared data. The output from this step is a set
of relationships or patterns.
• Step 4: Evaluate Results: In this step, you examine all
the resulting patterns. Apply filtering mechanism &
select only the promising patterns to be presented &
applied.
• Step 5: Present Discoveries: may be in the form of
visual navigation, charts, graphs or free-form text.
Presentation may also includes storing of interesting
discoveries in knowledge base for repeated use.
• Step 6: Incorporate Usage of Discoveries: Assemble
the results in the best way so that they can be
exploited to improve the business.
Data Warehousing Lecture 19
Dr. Javed Ali Baloch
Outline
• OLAP Versus Data Mining
• Data Mining & the Data Warehouse
• Major Data Mining Techniques
OLAP Versus Data Mining
• In OLAP analysis session,
analyst looks for some
prior knowledge.
• OLAP helps the user to
analyze the past & gain
insights.
• In OLAP, the analyst
drives the process while
using OLAP tools.
• In data mining, the analyst
has no prior knowledge of
what results are likely to
be.
• Data Mining helps the
user predict the future.
• In data mining, the analyst
prepares the data and
“sits back” while the tools
drive the process.
OLAP Versus Data Mining
Features
OLAP
DATA MINING
Motivation for
Information request
Data
granularity
What is happening in the
enterprise?
Predict the future based on why
this is happening.
Summary data.
Detailed transaction-level data.
Number of business
dimension
Limited number of
dimensions.
Large number of dimensions.
Number of
dimension attributes
Sizes of datasets for
the dimensions
Analysis
approach
Analysis
techniques
State of the
technology
Small number of attributes. Many dimension Attributes.
Not large for each dimension.
Usually very large
for each dimension
User-driven interactive Data-driven automatic knowledge
analysis.
discovery
Multidimensional, drill-down, Prepare data, launch mining tool
and slice & dice.
& sit back.
Mature & widely used.
Still emerging.
Data Mining & the Data Warehouse
• Data Mining algorithms need large amounts of data,
more so at the detailed level. Most DW contain data
at the lowest level of granularity.
• Data Mining flourishes on integrated & cleansed
data. If your ETL functions were carried out properly,
your DW contains such data, very suitable for data
mining.
• The infrastructure of DW is already robust, with
parallel processing technology & powerful relational
database systems. Because such scalable hardware is
already in place, no new investment is needed to
support data mining.
Major Data Mining Techniques
• Data mining covers a broad range of techniques including
– Cluster Detection
– Decision Trees
– Memory-Based Reasoning
– Link Analysis
– Neural Networks
– Genetic Algorithms etc.
• Various data mining techniques are applicable to each type
of function.
• These techniques consist of the specific algorithms that can
be used for each function.
Data Warehousing Lecture 20
Dr. Javed Ali Baloch
Outline
•
•
•
•
•
Cluster Detection
A Clustering Example
Clusters with two variables
Forming Clusters
Centroids and cluster boundaries
Cluster Detection
• Cluster means forming groups.
• The clustering helps you take specific & proper action
for the individual pieces that make up the cluster.
• The cluster detection algorithm searches for groups
or clusters of data elements that are similar to one
another.
• You expect similar customers or similar products to
behave in the same way. Then you can take a cluster
& do something useful with it.
A Clustering Example
• Consider an example of specialty store owner in
resort community who wants to cater to the
neighborhood by stocking right type of products.
• Store owner has the data about the age group &
income level of each of the people who frequently
visit the store.
• Using these two variable the store owner can put the
customers into 4 clusters, i.e. wealthy retirees staying
in resorts, middle-aged weekend golfers, wealthy
young people with club membership and low-income
clients who happen to stay in community.
Clusters with two variables
Forming Clusters
• Suppose you want to market to the customers & you
are prepared to run marketing campaigns for 15
different groups.
• Fifteen initial records (called “seeds”) are chosen as
the first set of centroids based on the best guesses.
• One seed represent one set of values for all the
dimension variables chosen for the customer record.
• In the next step, the algorithm assigns each customer
record in the database to a cluster based on the seed
to which it is closest.
Forming Clusters
• Closeness is based on the nearness of the values of
the set of all dimension variables in a record to the
values in the seed record.
• The first set of 15 clusters is now formed.
• Then the algorithm calculates the centroid or mean
for each of the first set of 15 clusters.
• The next iteration then starts. Each customer record
is rematched with the new set of centroids and
cluster boundaries are redrawn.
• After a few iterations the final clusters emerge.
Centroids and cluster boundaries
Task
• Design 4 cluster for the students of 10cse
selecting their final year project, list the
cluster formation rules.
Data Warehousing Lecture 21
Dr. Javed Ali Baloch
Outline
•
•
•
•
Decision Trees
Decision Tree Modeling
Decision Tree Example
Task
Decision Trees
• This technique applies to classification and
prediction.
• The major attraction of decision trees is their
simplicity. By following the tree, you can decipher
the rules and understand why a record is classified in
a certain way.
• Decision trees represent rules. You can use these
rules to retrieve records falling into a certain
category.
Decision Trees
• It is a rooted tree in which each internal node
corresponds to a decision, with a subtree at these
nodes for each possible outcome of the decision.
• Decision trees can be used to model problems in
which a series of decisions leads to a solution.
• The possible solutions of the problem correspond to
the paths from the root to the leaves of the decision
tree.
Decision Trees
• A decision tree represents a series of questions. Each
question determines what follow-up question is best
to be asked next.
• Good questions produce a short series.
• Trees are drawn with the root at the top and the
leaves at the bottom, an unnatural convention.
• The question at the root must be the one that best
differentiates among the target classes.
• A database record enters the tree at the root node.
The record works its way down until it reaches a leaf.
The leaf node determines the classification of the
record.
Decision Tree model
• A Decision Tree Model is a computational model
consisting of three parts:
– Decision Tree
– Algorithm to create the tree
– Algorithm that applies the tree to data
• Creation of the tree is the most difficult part.
• Processing is basically a search similar to that in a
binary search tree (although DT may not be binary).
Decision Tree Example
• Data
height
short
tall
tall
short
tall
tall
hair
class
blond
A
blond
B
red
A
dark
B
dark
B
blond
eyes
blue
brown
blue
blue
blue
blue
hair
dark
blond
red
short, blue = B
tall, blue = B
tall, brown= B
{tall, blue = A}
Completely classifies dark-haired
and red-haired people
short, blue = A
tall, brown = B
tall, blue = A
short, brown = B
Does not completely classify
blonde-haired people.
More work is required
hair
dark
blond
red
short, blue = B
tall, blue = B
tall, brown= B
{tall, blue = A}
Decision tree is complete because
1. All 8 cases appear at nodes
2. At each node, all cases are in
the same class (A or B)
short, blue = A
tall, brown = B
tall, blue = A
short, brown = B
eye
blue
short = A
tall = A
brown
tall = B
short = B
hair
dark
blond
red
B
A
eyes
blue
A
brown
B
Task
Design a decision tree for a customer planning to
purchase a car, make sure you use the different
deciding factors on which he would make the
decision, the example should show a proper
classification.
Data Warehousing Lecture 22
Dr. Javed Ali Baloch
Outline
• Memory based reasoning (MBR)
• MBR applications
• MBR Challenges
Memory Based Reasoning
• Would you rather go to an experienced doctor or to a
novice? Of course, the answer is obvious.
• Why? Because the experienced doctor treats you and
cures you based on his or her experience. The doctor
knows what worked in the past in several cases when
the symptoms were similar to yours.
• We are all good at making decisions on the basis of
our experiences.
• We depend on the similarities of the current
situation to what we know from past experience.
• The same principles apply to the memory-based
reasoning (MBR) algorithm.
• Our ability to reason from experience depends on
our ability to recognize appropriate examples from
the past…
– Traffic patterns/routes
– Movies
– Food
• We identify similar example(s) and apply what we
know/learned to current situation
• These similar examples in MBR are referred to as
neighbors
• MBR uses known instances of a model to predict
unknown instances.
• This data mining technique maintains a dataset of
known records.
• When a new record arrives for evaluation, the
algorithm finds neighbors similar to the new record,
then uses the characteristics of the neighbors for
prediction and classification.
• When a new record arrives at the data mining tool, first
the tool calculates the “distance” between this record
and the records in the training dataset.
• The results determine which data records in the training
dataset qualify to be considered as neighbors to the
incoming data record.
• Next, the algorithm uses a combination function to
combine the results of the various distance functions to
obtain the final answer.
• The distance function and the combination function are
key components of the memory-based reasoning
technique.
MBR Challenges
•
Choosing appropriate historical data for use in
training
•
Choosing the most efficient way to represent the
training data
•
Choosing the distance function, combination
function, and the number of neighbors
MBR Applications
• Fraud detection
• Customer response prediction
• Medical treatments
• Classifying responses – MBR can process free-text
responses and assign codes
Data Warehousing Lecture 23
Dr. Javed Ali Baloch
Outline
• Link Analysis
– Associations Discovery
– Sequential Pattern Discovery
– Similar Time Sequence Discovery
Link Analysis
•
•
•
•
•
The link analysis technique mines relationships and discovers
knowledge.
For example, if you look at the supermarket sale transactions
for one day, why are skim milk and brown bread found in the
same transaction about 80% of the time?
Is there a strong relationship between the two products in the
supermarket basket? If so, can these two products be
promoted together?
Are there more such combinations? How can we find such links
or affinities?
Link analysis techniques have 3 types of applications
1. Associations discovery
2. Sequential pattern discovery
3. Similar time sequence discovery
Associations Discovery
• Associations are affinities between items.
• Association discovery algorithms find combinations
where the presence of one item suggests the
presence of another.
• When you apply these algorithms to the shopping
transactions at a supermarket, they will uncover
affinities among products that are likely to be
purchased together.
• Association rules represent such affinities.
Associations Discovery



Figure represents an association rule and the annotated parts
of the rule.
The two parts—support factor and the confidence factor—
indicate the strength of the association.
Rules with
high support
and
confidence
factor values
are more valid,
relevant, and
useful.
Sequential Pattern Discovery
• These algorithms discover patterns where one set of
items follows another specific set.
• Time plays a role in these patterns.
• Suppose you want the algorithm to discover the buying
sequence of products.
• The sale transactions form the dataset for the data
mining operation.
• The data elements in the sale transaction may consist of
date and time of transaction, products bought during the
transaction, and the identification of the customer who
bought the items.
• A sample set of these transactions and the results of
applying the algorithm are shown in Figure.
Similar Time Sequence Discovery
• This technique, however, finds a sequence of events
and then comes up with other similar sequences of
events.
• For example, in retail department stores, this data
mining technique comes up with a second
department that has a sales stream similar to the
first.
• Finding similar sequential price movements of stock
is another application of this technique.
Data Warehousing Lecture 24
Dr. Javed Ali Baloch
Outline
•
•
•
•
•
•
•
Artificial Intelligence for Data Mining
Neural Networks
Neural Network Characteristics
Anatomy of a Neural Network
Neural Network Model
How a Neural Network Works?
Advantages and Disadvantages
Artificial Intelligence for Data Mining
• Neural networks are useful for data mining and
decision-support applications.
• People are good at generalizing from experience.
• Computers excel at following explicit instructions
over and over.
• Neural networks bridge this gap by modeling, on a
computer, the neural behavior of human brains.
Neural Networks
• Neural networks mimic the human brain by learning
from a training dataset and applying the learning to
generalize patterns for classification and prediction.
• These algorithms are effective when the data is
shapeless and lacks any apparent pattern.
• The basic unit of an artificial neural network is
modeled by looking at the neurons in the brain.
• This unit is known as a node and is one of the two
main structures of the neural network model.
• The other structure is the link that corresponds to
the connection between neurons in the brain.
Neural Network Characteristics
• Neural networks are useful for pattern recognition or
data classification, through a learning process.
• Neural networks simulate biological systems, where
learning involves adjustments to the synaptic
connections between neurons
Anatomy of a Neural Network
•Neural Networks map a set of input-nodes
to a set of output-nodes
Input 0
Input 1
...
Input n
•Number of inputs/outputs is variable
•The Network itself is composed of an
arbitrary number of nodes with an arbitrary
topology
Neural Network
Output 0
Output 1
...
Output m
Neural Network Model
How a Neural Network Works?
• The neural network receives values of the variables or
predictors at the input nodes.
• If there are 15 different predictors, then there are 15 input
nodes.
• Weights may be applied to the predictors to condition them
properly.
• There may be several inner layers operating on the predictors
and they move from node to node until the discovered result
is presented at the output node.
• The inner layers are also known as hidden layers because as
the input dataset is running through many iterations, the
inner layers rehash the predictors over and over again.
Advantages and Disadvantages
• Advantages
– Adapt to unknown situations
– Robustness: fault tolerance due to network redundancy
– Autonomous learning and generalization
• Disadvantages
– Not exact
– Large complexity of the network structure
Data Warehousing Lecture 25
Dr. Javed Ali Baloch
Link Analysis Example
Link Analysis Example
Data Warehousing Lecture 26
Dr. Javed Ali Baloch
The Wellmeadows Hospital Case
Study
Introduction
• The Wellmeadows Case Study describes a
small hospital located in Edinburgh.
• The Wellmeadows Hospital, which specializes
in health care for the elderly, requires a
database comprised of data recorded,
maintained, and accessed by the hospital.
• The objective is to create this database with as
much functionality and as little redundancy as
possible.
Introduction
• Successful projects begin with requirments gathering.
• However, in this case, Wellmeadows performed their
own requirments gathering procedures.
• These requirments are summarized on the basis of the
different entities.
• Merely reading the material that should be contained in
a database is not enough, every sentence has to be
analyzed and noted before continuance.
Identify Entity Types
• The Wellmeadows Case Study identifies
fifteen distinct entity types.
Wards
• There are 17 wards each having a unique ward
number & name (for example, Orthopaedic),
location (for example, E Block), telephone
extension and contains a total of 240 beds.
Staff
• The Staff entity is by far the most complex, consisting of
multiple personnel of different rank (for example, senior and
junior doctors, consultants, physiotherapists).
• The three main positions are: the Medical Director, the
Personnel Officer and the Charge Nurse.
• The Medical Director has overall responsibility of
management for the hospital.
• The Personnel Officer is responsible for ensuring that the
appropriate number and type of staff are associated with the
correct ward or out-patient clinic.
• The Charge Nurse is responsible for overseeing the day-to-day
operation of the ward/clinic. This includes allocating a budget
and tracking resources such as beds and supplies.
Wellmeadows Hospital
Staff Form
Staff Number: S011
Staff
Form
Personal Details
First Name
Last Name
Address
Sex
DOB
Tel. No.
NIN
Allocated to Ward
Position
Current Salary
Hours/Week
Permanent or Temporary
Salary
Scale
Paid Weekly or
Monthly
Qualification(s)
Work Experience
Type
Position
Date
Start Date
Institution
Finish Date
Organization
Note: Plz enter additional qualifications/work experience overleaf
Staff Qualifications Work Experience
• Each member of staff can have more than one
qualification and Work experience.
• There exists one to many relationship
between staff and their qualification & work
experience.
• Therefore we require two separate tables
StaffQualification and StaffWorkexperience for
Staff Qualification & StaffWorkexperience
respectively.
Patients
• Each patient has a unique patient number and
a record of their personnel information.
Wellmeadows Hospital
Patient Registration Form
Patient Number: P01234
Patient
Registration
Form
Personal Details
First Name
Last Name
Address
Sex
Tel. No.
DOB
Date
Registered
Marital Status
Next-of-Kin Details
Full Name
Relationship
Address
Tel. No.
Local Doctor Details
Full Name
Address
Tel. No.
Clinic
No.
Patient Appointments
• Each referred patient is given an appointment, which is
recorded and has a unique appointment number.
• The details of each patient’s appointment are recorded
and include the name and staff number of consultant
undertaking the examination, the data and time of the
appointment, and the examination room.
• As a result of the examination, the patient is either
recommended to attend the out-patient clinic or is
placed on a waiting list until a bed can be found in an
appropriate ward.
Out-Patients
• Each out-patient has a unique patient number
and a record of their personnel information.
In-Patients
• Each in-patient has a unique patient number
and a record of their personnel information.
In-Patient Form
Week
Beginning __________
Wellmeadows Hospital
Patient Allocation
Page _______
Personal Details
Ward Number
Charge Nurse
Ward Name
Staff Number
Location
Patient No.
Name
Tel Extn.
On Waiting
List
Expected Stay
(Days)
Date Placed
Date Leave
Actual
Leave
Bed No.
Patient Medication
• Whenever a patient is prescribed medication,
the details are recorded.
Patient Medication Form
Surgical Non-Surgical Supplies
• The Wellmeadows Hospital maintains a
central stock of surgical (e.g. syringes, sterile
dressings) and non-surgical (e.g. plastic bags,
aprons) supplies
Pharmaceutical Supplies
• For each pharmaceutical supply (e.g.
antibiotics, painkillers) there is a detailed
recording.
Ward Requisitions
• Forms used to order supplies held by the
hospital.
Requisition Form
Wellmeadows Hospital
Central Store
Requisition Form
Requisition Number: ___________
Ward Number
Requisitioned By:
Ward Name
Requisition Date:
Received By: __________________
Date Received: __________________
Suppliers
• Each supplier of surgical/non-surgical and
pharmaceutical supplies has a unique number
and details of the transaction.
Wellmeadow Hospital OLTP Model
Possible Reports from OLTP System
• Search for staff who have particular
qualifications or previous work experience.
• Produce a report listing the details of staff
allocated to each ward.
• Produce a report listing the details of patients
referred to a particular ward.
• Produce a report listing the details of patients
currently located in a particular ward.
Possible Reports from OLTP System
• Produce a report listing the details of patients
currently on the waiting list for a particular
ward.
• Produce a report listing the details of
medication given to a particular patient.
• Produce a report listing the details of supplies
provided to specific ward.