The Data Warehouse

Download Report

Transcript The Data Warehouse

Lecture 2
Themes in this session
• Knowledge discovery in databases
• Data mining
• Multidimensional analysis and OLAP
Knowledge discovery in
databases
What is Knowledge?
• Data
– symbols representing properties of events and their
environments
• Information
– is contained in descriptions, provides the answers to a number
of basic questions
• Knowledge
– basic know-how facilitates allows action
• Understanding
– achieved through diagnosis and prescription
• Wisdom
– judgement of what is efficient and effective
Characteristics of discovered knowledge
•
•
•
•
•
non-trivial
valid
novel
potential useful
understandable
• An aggregated measure is “interestingness”
–
–
–
–
validity
novelty
usefulness
simplicity
A more formal definition of
knowledge
• Pattern
– A pattern is an expression E in a language L describing facts in
a subset FE of F. E is called a pattern if it is simpler than the
enumeration of all the facts in FE
• Knowledge
– A pattern E  L is called knowledge if for some user-specified
threshold i  Mi , I(E,F,C,N,U,S) > i
– where C = validity, N = novelty, U = usefulness, S = simplicity
What is KDD?
• Knowledge Discovery in Databases involves the
extraction of implicit, previously unknown and
potentially useful information from data.
• KDD is a process
– involves the extraction, organisation and presentation of
discovered information
• KDD is effected by a human-centred system
– is in itself a knowledge intensive task consisting of complex
interactions between a human and a (large) database.
Overview of the analyst’s tasks
Goals
gains
Insight
formulates
enriches
Queries
generates
Analyses
Output
DB
Dataset
Characteristics of the KDD
process
•
•
•
•
•
highly iterative
protracted over time
numerous sub-tasks
highly complex
numerous input systems
A description of the KDD process
Task
discovery
Goal
formulation
Data
discovery
Data
cleaning
Model
development
Data
analysis
Output
generation
Goal formulation
Based on a means-ends chain extending into the workings
of the organisation
• Formulate a goal for improving the operations of the
business
• Decide what one needs to know in order to fulfil this
goal and perform the business activity in a better
manner
• On the basis of what one needs to know formulate
goals for how to discover this information by using the
KDD process
• Revise all of the goals above if needs on the basis of
iterative discovery
Data discovery
• Try and understand the domain in order to
determine which entities are relevant to the
discovery process
• Check the coverage and content of the data
– sift through the source data to see what is available
– sift through the source data to see what is not available
• Determine the quality of the data
• Determine the structure of the data
Task discovery
• Find means stipulated by the ends contained in the
knowledge discovery goals
• Find out what the real requirements on the tasks
and the performance of these tasks are
• Refine the requirements and choice of tasks until
you’re sure you’re setting about answering the
correct questions
Data cleaning
• Ensure the quality of the data that will be used in
the KDD process
• Eliminate data quality problems in the data such
as…
– inconsistencies due to differences between various
data sources
– missing data
– different forms of data representation
– data incompatibility
Model development
Involves activities concerned with forming a basic hypothesis
which can satisfy the knowledge discovery goals
• Select the parameters for the model
– formulate measures that can be used to quantify achievement of
the goal (outcome variable or dependent variable)
– select a set of independent variables which are deemed to have
relevance to the outcome variables
• Segment the data
– find possible relevant subsets in the population
• Choose an analysis model which fits the problem domain
NOTE: This whole phase demands background knowledge of the domain
Data analysis
Involves activities aimed at determining the rules/reasons
governing the behaviour of those entities focused on by
the knowledge discovery goal
• specify the chosen model
– use some form of formal expression
• fit the model to the data
– perform initial adjustments to some of the parameters
• evaluate the model
– check the soundness of the model against the data
• refine the model
– modify the model on the basis of its discrepancies with the
evidence presented by the data
Output generation
•
•
•
•
Reports of findings in the analysis
Action suggestions on the basis of the findings
Models for use in similar analysis scenarios
Monitoring mechanisms which observe the variables
covered in the analysis and “trigger” notifications
when certain conditions are noted in the data.
Developing KDD applications
Purpose: an application to answer a key business
question
• a labour intensive initial discovery of knowledge by
someone who understands the domain as well as the
specific data analysis techniques needed
• encoding of the discovered knowledge within a specific
problem solving architecture
• application of the knowledge in the context of a real
world task by a well understood class of end-users
• Installation of analysis, monitoring, and reporting
mechanisms as a base for continual evaluation of data
Data mining
What is data mining?
Rather formal definition:
• Data mining involves fitting models to, and
observing patterns from, observed data through
the application of specific algorithms.
Less formally:
• Data analysis in order to explain an aspect of a
complex reality by expressing it as an
understandable simplification
Goals for data mining
• Prediction
– involve using some variables or fields in the database
to predict unknown or future values of other
variables of interest
• Description
– focuses on finding human interpretable patterns
describing the data
Rationale for data mining
• Dramatic increase in the amount of data available
(the data explosion)
• Increasing competition in the world’s market
• The low relative value of easily discovered
information
• Increasing cleverness
• Emergence of new enabling technology
Enabling factors for data mining
•
•
•
•
Increased data storage ability
Increased data gathering ability
Increased processing power
The introduction of new computationally intensive
methods of machine learning
Background to data mining
• Inductive learning
– supervised learning
– unsupervised learning
• Statistics
• Machine learning
– Differences between DM and ML
• DM finds understandable knowledge, ML improves the
performance of an agent
• DM is concerned with large, real-world databases, ML with
smaller data sets
• ML is a broader files, not only learning by example
Data mining algorithms
Specific mix of three components:
• The model
– function
– representational form
– parameters from the data
• The model evaluation (preference) criterion
– preference of one set of models or set of parameters over
another
– based on goodness-of-fit function
• The search method
– a method for finding particular models and parameters
– Given: data, family of models, preference criterion
Primary operations in data mining
A number of basic operations can be used for
prediction and depiction
–
–
–
–
–
–
Classification
Regression
Clustering
Summarisation
Dependency modelling
Change and deviation detection
Classification
• Learning a function that maps (classifies) a data item
into one of several predefined classes
• In supervised learning it is the user that defines the
classes.
• The classification is applied in the form of one or more
attributes that denotes the class of the data item.
• These classifying attributes are known as predicted
attributes. A combination of values for the predicted
attributes defines a class
• Other attributes of the data item are known as
predicting attributes
Regression
• A common statistical technique for modelling the
relationship between two or more variables
• Learning a function which maps a data item to a realvalued prediction variable
• Simple linear regression uses the straight line model
Y = 0 + 1X +  , where Y is the prediction variable
(dependent variable) and X is the predictive variable
(independent variable)
• Multiple regression involves more than two variables and
uses the model Y = 0 + 1X1 + 2X2 +…+ nXn +  , where
Y is the prediction variable and X1… Xn are the
predictive variables
Clustering
• A common descriptive task for determining a finite
set of categories or clusters to describe the data
• Categories may be mutually descriptive and
exhaustive, or consist of richer representations
such as hierarchical or overlapping categories
• A cluster is a group of objects grouped together
because of their similarity of proximity. Data units
in a cluster are both homogeneous and differ
significantly from other groups
• Correlations and functions of distance between
elements are used in defining the clusters
Summarisation
• Methods for finding a compact description for a
subset of data
• Often relies on statistical methods such as the
calculating of means and standard derivations
• Are often applied to interactive exploratory data
analysis and automated report generation.
Dependency modelling
• Consists for finding a model which describes
significant dependencies between variables
• There are two levels of dependency in dependency
models:
• The structural level specifies which variables are
locally dependent on each other
• The quantitative level specifies the strengths of
the dependencies using some numerical scale
• Often in the form: x% of all record containing
items A and B, also contain items D and E
Change and deviation detection
• Focuses on discovering the most significant changes
in the data from previously measured or normative
values
• Often used on a long time series of records in
order to discover trends
• Often used to discover sequential patterns
occurring over extended time periods
Problems and issues in data
mining
•
•
•
•
•
•
Limited information
Noise and missing values
Uncertainty
Size of databases
Irrelevance of certain fields
Updates to databases
Multidimensional analysis and
OLAP
OLAP vs OLTP
• OLTP servers handle mission-critical production data
accessed through simple queries
• usually handles queries of an automated nature
• OLTP applications consist of a large number of relatively
simple transactions.
• Most often contains data organised on the basis of logical
relations between normalised tables
• OLAP servers handle management-critical data
accessed through an iterative analytical investigation
• usually handles queries of an ad-hoc nature
• supports more complex and demanding transactions
• contains logically organised data in multiple dimensions
What is OLAP?
Definition: The dynamic synthesis, analysis and
consolidation of large volumes of multidimensional
data.
• Flexible information synthesis
• Multiple data dimensions/consolidation paths
• Dynamic data analysis
Codd’s four data models for data
analysis
•
•
•
•
Categorical data models
Exegetical data models
Contemplative data models
Formulaic data models
Dimensionality revisited
Dimensions
Focal event
Region
Sales
Year
Quarter
Product
group
Product
type
OLAP Tool evaluation criteria (1-6)
•
•
•
•
•
•
Multidimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client-Server architecture
Generic dimensionality
OLAP Tool evaluation criteria (712)
•
•
•
•
•
•
Dynamic Sparse Matrix handling
Multi-user support
Unrestricted cross-dimensional analysis
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels
Functionality of OLAP tools
•
•
•
•
•
•
Drill-down
Drill-up
Roll-up or consolidation
“Slicing and dicing” by pivoting
Drill-through
Drill-across
An OLAP “answer set”
Column headers
(join constraints)
Product Group
Group A
Group A
Group B
Group B
Row headers
Column header
(application constraint)
Region
ABC
XYZ
ABC
XYZ
First Quarter - 1997
1245
34534
45543
34533
Answer set representing
focal event
Different forms of OLAP
• True OLAP
• ROLAP (relational OLAP)
• MOLAP (multidimensional OLAP)