Transcript Document

Historical Perspective
The Relational Model
revolutionized transaction processing systems
DBMS gave access to the data stored
OLTP's are good at putting data into databases
The data explosion
Increase in use of electronic data gathering devices e.g. point-of-sale, remote sensing devices etc.
Data storage became easier and cheaper with increasing computing power
Problems
DBMS gave access to the data stored but no analysis of data
Analysis required to unearth the hidden relationships within the data i.e. for decision support
Size of databases has increased e.g. VLDBs, need automated techniques for analysis as they have grown beyond manual extraction
Obstacles
typical scientific user knew nothing of commercial business applications
the business database programmers, knew nothing of massively parallel principles
solution was for database software producers to create easy-to-use tools and form strategic relationships with hardware manufacturers
What is data mining? the non trivial extraction of implicit, previously unknown, and potentially useful information from data
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.
The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.
It is possible to `strike gold' in unexpected places as the data mining software extracts patterns not previously discernible or
so obvious that no-one has noticed them before.
Mining analogy:
large volumes of data are sifted in an attempt to find something worthwhile
in a mining operation large amounts of low grade materials are sifted through in order to find something of value.
Books:
• Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN 1-55860-489-8.
• Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,
Morgan Kaufmann, 1999, ISBN 1-55860-552-5.
Data Mining vs. DBMS
DBMS - queries based on the data held e.g.
• last months sales for each product
• sales grouped by customer age etc.
• list of customers who lapsed their policy
Data Mining - infer knowledge from the data held to answer queries e.g.
• what characteristics do customers share who lapsed their policies and how do they differ from those who
renewed their policies?
• why is the Cleveland division so profitable?
Characteristics of a data mining system
Large quantities of data
• volume of data so great it has to be analyzed by automated techniques e.g. POS, satellite information,
credit card transactions etc.
Noisy, incomplete data
• imprecise data is characteristic of all data collection
• databases - usually contaminated by errors, cannot assume that the data they contain is entirely correct
e.g. some attributes rely on subjective or measurement judgments
Complex data structure - conventional statistical analysis not possible
Heterogeneous data stored in legacy systems
Who needs data mining?
Who(ever) has information fastest and uses it wins
Don McKeough, former president of Coke Cola
Data Mining Applications
Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.
Finance - stock market prediction, credit assessment, fraud detection etc.
Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc.
Knowledge Acquisition
Expert systems are models of real world processes
Much of the information is available straight from the process e.g.
in production systems, data is collected for monitoring the system
knowledge can be extracted using data mining tools
experts can verify the knowledge
Engineering - automotive diagnostic expert systems, fault detection etc.
Data Mining Goals
Classification
DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules
Example - customer database in a bank
Question - Is a new customer applying for a loan a good investment or not?
Typical rule formulated:
if STATUS = married and INCOME > 10000 and HOUSE_OWNER = yes
then INVESTMENT_TYPE = good
Association
Rules that associate one attribute of a relation to another
Set oriented approaches are the most efficient means of discovering such rules
Example - supermarket database
72% of all the records that contain items A and B also contain item C
the specific percentage of occurrences, 72 is the confidence factor of the rule
Sequence/Temporal
Sequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of time
Difference between sequence rules and other rules is the temporal factor
Example - retailers database
Can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven
Data Mining and Machine Learning
Data Mining (DM) or Knowledge Discovery in Databases (KDD) is about finding understandable knowledge
Machine Learning (ML) is concerned with improving performance of an agent
training a neural network to balance a pole is part of ML, but not of KDD
Efficiency of the algorithm and scalability is more important in DM or KDD
DM is concerned with very large, real-world databases
ML typically looks at smaller data sets
ML has laboratory type examples for the training set
DM deals with `real world' data. Real world data tend to have problems such as:
missing values
dynamic data
noise
Statistical Data Analysis
Ill-suited for Nominal and Structured Data Types
Completely data driven - incorporation of domain knowledge not possible
Interpretation of results is difficult and daunting
Requires expert user guidance
Stages of the Data Mining Process
Data pre-processing
• heterogeneity resolution
• data cleansing
• data warehousing
Applying Data Mining Tools: extraction of patterns from the pre-processed data
Interpretation and evaluation: the user bias can direct DM tools to areas of interest
• attributes of interest in databases
• goal of discovery
• domain knowledge
• prior knowledge or belief about the domain
Techniques
Machine Learning methods
Statistics: can be used in several data mining stages
• data cleansing i.e. the removal of erroneous or irrelevant data
• EDA, exploratory data analysis e.g. frequency counts, histograms etc.
• data selection - sampling facilities and so reduce the scale of computation
• attribute re-definition
• data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.
Visualization: enhances EDA, makes patterns more visible
Clustering (Cluster Analysis)
• Clustering and segmentation is basically partitioning the database so that each partition or group is similar according
to some criteria or metric
• Clustering according to similarity is a concept which appears in many disciplines e.g. in chemistry the clustering of
molecules
• Data mining applications make use of clustering according to similarity e.g. to segment a client/customer base
• It provides sub-groups of a population for further analysis or action - very important when dealing with very large
databases
Knowledge Representation Methods
Neural Networks
• a trained neural network can be thought of as an "expert" in the category of information it has been
given to analyze
• provides projections given new situations of interest and answers "what if" questions
• problems include:
• the resulting network is viewed as a black box
• no explanation of the results is given i.e. difficult for the user to interpret the results
• difficult to incorporate user intervention
• slow to train due to their iterative nature
Decision trees
• used to represent knowledge
• built using a training set of data and can then be used to classify new objects
• problems are:
• opaque structure - difficult to understand
• missing data can cause performance problems
• they become cumbersome for large data sets
Rules
• probably the most common form of representation
• tend to be simple and intuitive
• unstructured and less rigid
• problems are:
• difficult to maintain
• inadequate to represent many types of knowledge
• Example format: if X then Y
Related Technologies: Data Warehousing
Definition
A data warehouse can be defined as any centralized data repository which can be queried for business
benefit warehousing makes it possible to:
• extract archived operational data
• overcome inconsistencies between different legacy data formats
• integrate data throughout an enterprise, regardless of location, format, or communication
requirements
• incorporate additional or expert information
Characteristics of a data warehouse
• subject-oriented - data organized by subject instead of application e.g.
• an insurance company would organize their data by customer, premium, and claim, instead
of by different products (auto, life, etc.)
• contains only the information necessary for decision support processing
• integrated - encoding of data is often inconsistent e.g. gender might be coded as "m" and "f" or
0 and 1 but when data are moved from the operational environment into the data warehouse they
assume a consistent coding convention
• time-variant - the data warehouse is a place for storing data that are five to 10 years old, or older e.g.
• this data is used for comparisons, trends, and forecasting
• these data are not updated
• non-volatile
• data are not updated or changed in any way once they enter the data warehouse
• data are only loaded and accessed
Data warehousing Processes
• insulate data - i.e. the current operational information
• preserves the security and integrity of mission-critical OLTP applications
• gives access to the broadest possible base of data
• retrieve data - from a variety of heterogeneous operational databases
• data is transformed and delivered to the data warehouse/store based on a selected model
(or mapping definition)
• metadata - information describing the model and definition of the source data elements
• data cleansing - removal of certain aspects of operational data, such as low-level transaction
information, which slow down the query times.
• transfer - processed data transferred to the data warehouse, a large database on a high performance box
Criteria for a data warehouse
Load Performance
require incremental loading of new data on a periodic basis
must not artificially constrain the volume of data
Load Processing
data conversions, filtering, reformatting, integrity checks, physical storage, indexing, and metadata update
Data Quality Management
ensure local consistency, global consistency, and referential integrity despite "dirty" sources and massive
database size
Query Performance
must not be slowed or inhibited by the performance of the data warehouse RDBMS
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates so RDBMS must not have any architectural limitations.
It must support modular and parallel management.
Mass User Scalability
Access to warehouse data must not be limited to the elite few has to support hundreds, even thousands,
of concurrent users while maintaining acceptable query performance.
Networked Data Warehouse
Data warehouses rarely exist in isolation, users must be able to look at and work with multiple warehouses
from a single client workstation
Warehouse Administration
large scale and time-cyclic nature of the data warehouse demands administrative ease and flexibility
The RDBMS must Integrate Dimensional Analysis
dimensional support must be inherent in the warehouse RDBMS to provide the highest performance
for relational OLAP tools
Advanced Query Functionality
End users require advanced analytic calculations, sequential and comparative analysis, and consistent access
to detailed and summarized data
Data warehousing vs. OLTP
OLTP systems designed to maximize transaction capacity but they:
cannot be repositories of facts and historical data for business analysis
cannot quickly answer ad hoc queries
rapid retrieval is almost impossible
data is inconsistent and changing, duplicate entries exist, entries can be missing
OLTP offers large amounts of raw data which is not easily understood
Typical OLTP query is a simple aggregation e.g.
what is the current account balance for this customer?
Data warehouses are interested in query processing as opposed to transaction processing
Typical business analysis query e.g.
which product line sells best in middle-America and how does this correlate to demographic data?
OLAP (On-line Analytical processing)
Problem is how to process larger and larger databases
OLAP involves many data items (many thousands or even millions) which are involved in complex relationships
Fast response is crucial in OLAP
Difference between OLAP and OLTP
OLTP servers handle mission-critical production data accessed through simple queries
OLAP servers handle management-critical data accessed through an iterative analytical investigation
OLAP operations
Consolidation - involves the aggregation of data i.e. simple roll-ups or complex expressions involving inter-related data
e.g. sales offices can be rolled-up to districts and districts rolled-up to regions
Drill-Down - can go in the reverse direction i.e. automatically display detail data which comprises consolidated data
"Slicing and Dicing" - ability to look at the data base from different viewpoints e.g.
one slice of the sales database might show all sales of product type within regions;
another slice might show all sales by sales channel within each product type
often performed along a time axis in order to analyze trends and find patterns