Data Mining: Concepts & Techniques - Yue

Download Report

Transcript Data Mining: Concepts & Techniques - Yue

Data Mining:
Concepts & Techniques
Motivation:
Necessity is the Mother of Invention
• Data explosion problem
– Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories
• We are drowning in data, but starving for
knowledge!
• Solution: Data warehousing and data mining
– Data warehousing and on-line analytical processing
– Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Evolution of Database Technology
What Is Data Mining?
• Data mining (knowledge discovery in databases):
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
• Alternative names and their “inside stories”:
– Data mining: a misnomer?
– Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• What is not data mining?
– (Deductive) query processing.
– Expert systems or small ML/statistical programs
Data Mining: A KDD Process
Data mining:
the core of
knowledge
discovery
process
Steps of a KDD Process
• Learning the application domain:
– relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation:
– Find useful features, dimensionality/variable reduction, invariant
representation.
• Choosing functions of data mining
– summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
Knowledge Discovery Process
• The whole process of extraction of implicit,
previously unknown and potentially useful
knowledge from a large database
– It includes data selection, cleaning,
enrichment, coding, data mining, and
reporting
– Data Mining is the key stage of Knowledge
Discovery Process
• The process of finding the desired information from
large database
Knowledge Discovery Process
• Example: the database of a magazine publisher
which sells five types of magazines – on cars,
houses, sports, music and comics
– Data mining:
• Find interesting categorical properties
– Questions:
• What is the profile of a reader of a car magazine?
• Is there any correlation between an interest in cars and an
interest in comics?
• The knowledge discovery process consists of six
stages
Data Selection
• Select the information about people who
have subscribed to a magazine
Cleaning
• Pollutions: Type errors, moving from one place
to another without notifying change of address,
people give incorrect information about
themselves
– Pattern Recognition Algorithms
Cleaning
• Lack of domain consistency
Enrichment
• Need extra information about the clients
consisting of date of birth, income, amount
of credit, and whether or not an individual
owns a car or a house
Enrichment
• The new information need to be easily
joined to the existing client records
– Extract more knowledge
Coding
• We select only those records that have
enough information to be of value (row)
• Project the fields in which we are interested
(column)
Coding
• Code the information which is too detailed
–
–
–
–
–
–
Address to region
Birth date to age
Divide income by 1000
Divide credit by 1000
Convert cars yes-no to 1-0
Convert purchase date to month numbers
starting from 1990
• The way in which we code the information will
determine the type of patterns we find
• Coding has to be performed repeatedly in order to get
the best results
Coding
• The way in which we code the information will
determine the type of patterns we find
Coding
• We are interested in the relationships
between readers of different magazines
– Perform flattening operation
Data mining
• We may find the following rules
– A customer with credit > 13000 and aged between 22
and 31 who has subscribed to a comics at time T will
very likely subscribe to a car magazine five years later
– The number of house magazines sold to customers with
credit between 12000 and 31000 living in region 4 is
increasing
– A customer with credit between 5000 and 10000 who
reads a comics magazine will very likely become a
customer with credit between 12000 and 31000 who
reads a sports and a house magazine after 12 years
Knowledge Discovery Process
Business-Question-Driven Process
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Architecture of a Typical Data
Mining System
Data Mining: On What Kind of Data?
•
•
•
•
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
–
–
–
–
–
–
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous databases
WWW