Transcript Data Mining

Data Mining
Lecture 2
Course Syllabus
• Course topics:
• Introduction (Week1-Week2)
–
–
–
–
What is Data Mining?
Data Collection and Data Management Fundamentals
The Essentials of Learning
The Emerging Needs for Different Data Analysis
Perspectives
• Data Management and Data Collection Techniques for
Data Mining Applications (Week3-Week4)
– Data Warehouses: Gathering Raw Data from Relational
Databases and transforming into Information.
– Information Extraction and Data Processing Techniques
– Data Marts: The need for building highly specialized data
storages for data mining applications
Week 2- Data vs. Knowledge
• Data:
Data
(Operation)
– raw
– atomic
– (mostly!) operational
• Information:
– processed
– re-organized
– grouped
Information
(Analytic)
Data
Knowledge
• Knowledge
– patterns, models, findings ‘behind’ Information
• Wisdom
Wisdom
– perfect orchestration of Knowledge
“Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?”
T. S. Eliot
Week 2- Evolution of Database and
Information Systems
•1960s: (focus on efficient data collection)
Data collection, database creation, IMS and network DBMS
•1970s: (focus on structured data collection)
Relational data model, relational DBMS implementation
•1980s: (focus on information extraction)
RDBMS, advanced data models (extended- relational, OO, deductive, etc.)
and application-oriented DBMS (spatial, scientific, engineering, etc.)
•1990s – 2000s: (focus on knowledge extraction and modeling)
Data Mining, Data Warehousing, Multi Dimensional Databases
Week 2- Data Collection and Data
Management Fundamentals –
What is Data Warehouse
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”
William H. Inmon
Subject-oriented: A data warehouse is organized around major subjects,
such as customer,supplier, product, and sales.Rather than concentrating
on the day-to-day operations and transaction processing of an organization,
a data warehouse focuses on the modeling and analysis of data for
decision makers
Week 2- Data Collection and Data
Management Fundamentals –
What is Data Warehouse
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”
William H. Inmon
Integrated: A data warehouse is usually constructed by integrating multiple
Heterogeneous sources, such as relational databases, flat files, and on-line
transaction records. Data cleaning and data integration techniques are applied
to ensure consistency in naming conventions, encoding structures, attribute
measures, and so on.
Week 2- Data Collection and Data
Management Fundamentals –
What is Data Warehouse
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision making process”
William H. Inmon
Time-variant: Data are stored to provide information from a historical perspective
(e.g., the past 5–10 years). Every key structure in the data warehouse contains, either
implicitly or explicitly, an element of time.
Nonvolatile: A data warehouse is always a physically separate store of data transformed
from the application data found in the operational environment. Due to this separation,
a data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms. It usually requires only two operations in data accessing:
initial loading of data and access of data.
Week 2- Data Collection and
Data Management
Fundamentals – What is Data
Warehouse
• data cleaning
• data integration
• data consolidation
Week 2- Data Collection and
Data Management
Fundamentals – What is OLAP
• object oriented methodology comes in
• entities (cubes)
• attributes (dimensions)
Week 2- Data Collection and
Data Management
Fundamentals – What is OLAP
taken from the Text Book
Week 2- Data Collection and
Data Management
Fundamentals – What is OLAP
• Multi Dimensional Database Modeling
– star schema
– snowflake schema
– fact constellation schema
• fact vs dimension
Week 2- Data Collection and
Data Management
Fundamentals – What is OLAP
taken from the Text Book
Week 2- Data Collection and
Data Management
Fundamentals – What is OLAP
taken from the Text Book
Week 2- Data Collection and
Data Management
Fundamentals – What is OLAP
taken from the Text Book
Week 2- Data Collection and
Data Management
Fundamentals – OLAP
Operations
•roll-up
•drill-down
•slice
•dice
•pivot (rotation)
taken from the Text Book
Week 2- Data Collection and Data Management Fundamentals –
OLAP Operations
Week 2- Data Collection and
Data Management
Fundamentals – What is Data
Mart ?
data warehouse
information about subjects that span the entire organization,
its scope is enterprise-wide.
which modeling schema ?
the fact constellation schema is commonly used, since it can model
multiple, interrelated subjects.
data mart
a department subset of the data warehouse that focuses on selected subjects,
its scope is departmentwide.
which modeling schema ?
the star or snowflake schema are commonly used, since both are
geared toward modeling single subjects
Week2-OLAP vs Data Mining
On-Line Analytical Processing
provides the ability to pose statistical and summary queries
interactively (traditional On-Line Transaction Processing (OLTP)
databases may take minutes or even hours to answer these queries)
Advantages relative to data mining
Can obtain a wider variety of results
Generally faster to obtain results
Disadvantages relative to data mining
User must “ask the right question”
Generally used to determine high-level statistical summaries,
rather than specific relationships among instances
Week2-Reporting vs Data Mining
Reporting
•Last months sales for each service type
•Sales per service grouped by customer sex or age bracket
•List of customers who lapsed their policy
Data Mining
•What characteristics do customers that lapse their policy have in
common and how do they differ from customers who renew their
policy?
•Which motor insurance policy holders would be potential
customers for my House Content Insurance policy?
Week2- Data to Knowledge
Pyramid
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Week 2- Data Mining Perspective
to Knowledge Discovery
Interpretation/
Evaluation
Knowledge
Data Mining
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Week2- Data Mining Process Flow
Visualization and
Human Computer
Interaction
Plan
for
Learning
Generate
and Test
Hypotheses
Goals for Learning
Discover
Knowledge
Knowledge Base
Discovery Algorithms
Determine
Knowledge
Relevancy
Evolve
Knowledge/
Data
Database(s)
Background Knowledge
“In order to discover anything, you must be looking for something”
Laws of Serendipity
Week2-Simplified view of Data
Mining Process Flow
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
Week 2- Extended Perspective on
Data Mining Process Flow
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
Meta Data
Filtering&Integration
Database API
MDDB
Filtering
Layer1
Data cleaning
Databases
Data integration
Data
Warehouse
Data Repository
Week 2- Essentials of Learning
Learning ?
•can we formalize it?
•is it just a chemical activation?
•is it memorization?
•is it continous node connecting/disconnecting
on dynamically changing brain network
topology?
Week 2- Essentials of Learning
The Artifical Intelligence View:
•central to human knowledge and intelligence,
essential for building intelligent machines.
•years of effort in AI has shown that trying to build
intelligent computers by programming all the rules
cannot be done; automatic learning is crucial. For
example, we humans are not born with the ability to
understand language — we learn it — and it makes
sense to try to have computers learn language
instead of trying to program it all it
Week 2- Essentials of Learning
The Software Engineering View:
• Machine Learning allows us to program computers by example,
which can be easier than writing code the traditional way.
The Stats View:
• Machine Learning is the marriage of computer science and statistics
•computational techniques are applied to statistical problems. Machine
Learning has been applied to a vast number of problems in many
contexts, beyond the typical statistics problems. Machine Learning is
often designed with different considerations than statistics (e.g., speed
is often more important than accuracy).
Week 2-End
• Please check the web site for Learning
Theory and its Esssentials:
http://www.infed.org/biblio/b-learn.htm
• read
– Course Text Book Chapter 3