OLAM and Data Mining: Concepts and Techniques

Download Report

Transcript OLAM and Data Mining: Concepts and Techniques

OLAM and Data Mining:
Concepts and Techniques
Introduction
• Data explosion problem:
– Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories
• We are drowning in data, but starving for
knowledge!
• Data warehousing and data mining:
– On-line analytical processing – query-driven data
analysis
– The efficient discovery of interesting knowledge (rules,
regularities, patterns, constraints) from data in large
databases
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network
DBMS
• 1970s:
– Relational data model, relational DBMS
• 1980s:
– RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.)
• 1990s:
– Data mining and data warehousing, multimedia
databases, and Web technology
What is data mining?
• Data mining: the process of efficient discovery of
previously unknown patterns, relationships, rules
in large databases and data warehouses
• Goal: help the human analyst to understand the
data
• SQL query:
– How many bottles of wine did we sell in 1st Qtr of 1999
in Poland vs Austria?
What is data mining?
• Data mining query:
– How do the buyers of wine in Poland and Austria
differ?
– What else do the buyers of wine in Austria buy along
with wine?
– How the buyers of wine can be characterized?
What is data mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting ( non-trivial, implicit,
previously unknown and potentially useful)
information from data in large databases
• Alternative names and their “inside stories”:
– Knowledge discovery in databases (KDD: SIGKDD),
knowledge extraction, data archeology, data dredging,
information harvesting, business intelligence, etc.
– Data mining: a misnomer?
• What is not data mining?
– Expert systems or small statistical programs
– OLAP
Data Mining: A KDD Process
• Steps of a KDD Process:
– Learning the application domain:
• relevant prior knowledge and goals of application
–
–
–
–
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and projection:
Find useful features, dimensionality/variable reduction, invariant
representation.
– Choosing functions of data mining
• summarization, classification, regression, association, clustering.
– Choosing the mining algorithm(s)
– Data mining: search for patterns of interest
– Interpretation: analysis of results.
• visualization, transformation, removing redundant patterns, etc.
– Use of discovered knowledge
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
End User
Data Presentation
Visualization
Business
Analyst
Data Mining
Information Discovery
Data Exploration
Statistical Analysis, Reporting
Data Warehouses/Data Marts
OLAP, MDA
Data Sources
Paper, Files, Database systems, OLTP, WWW
Data
Analyst
DBA
Mining query
Mining result
User GUI API
OLAM
Engine
OLAP
Engine
Data Cube API
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Data cleaning
Databases
Data
Data integration Warehouse
An OLAM Architecture
Data Mining: Confluence of Multiple
Disciplines
•
•
•
•
•
•
•
Database systems, data warehouse and OLAP
Statistics
Machine learning
Visualization
Information science
High performance computing
Other disciplines:
– Neural networks, mathematical modeling, information
retrieval, pattern recognition, etc.
Data Mining: On What Kind of Data?
•
•
•
•
Relational databases
Data warehouses
Transactional databases
Advanced DB systems and information
repositories
–
–
–
–
–
–
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
Data Mining Functionality
Data mining methods may be classified onto 6
basic classes:
• Associations
– Finding rules like “if the customer buys mustard,
sausage, and beer, then the probability that he/she buys
chips is 50%”
• Classifications
– Classify data based on the values of the decision
attribute, e.g. classify patients based on their “state”
• Clustering
– Group data to form new classes, cluster customers
based on their behavior to find common patterns
Data Mining Functionality
• Sequential patterns
– Finding rules like “if the customer buys TV, then, few
days later, he/she buys camera, then the probability that
he/she will buy within 1 month video is 50%”
• Time-Series similarities
– Finding similar sequences (or subsequences) in timeseries (e.g. stock analysis)
• Outlier detection
– Finding anomalies/exceptions/deviations in data