PowerPoint 簡報 - Yue

Download Report

Transcript PowerPoint 簡報 - Yue

Data Mining
Books:
1.
2.
3.
Data Mining, 1996
Pieter Adriaans and Dolf Zantinge
Addison-Wesley
Discovering Data Mining, 1997
From Concept to Implementation
Cabena and et al.
Prentice Hall
Data Mining, 2000
Concept and Techniques
Jiawei Han and Micheline Kamber
Morgan Kaufmann
1

Proceedings
1.
2.
3.
4.
5.
6.
7.
Proceedings of the International Conference on Data Mining
(ICDM)
Proceedings of the International Conference on Data
Engineering (ICDE)
Proceedings of ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining
Proceedings of the International Conference on Very Large Data
Bases (VLDB)
Proceedings of ACM SIGMOD International Conference on
Management of Data
Proceedings of the International Conference on Database
Systems for Advanced Applications (DASFAA)
Proceedings of the International Conference on Database and
Expert Systems Applications (DEXA)
2
8.
Proceedings of the International Conference on Data Warehousing
and Knowledge Discovery (DaWak)
9. Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD)
10. European Conference on Principles of Data Mining and
Knowledge Discovery (PKDD)

Journals
1.
2.
3.
4.
5.
6.
7.
8.
IEEE Transactions on Knowledge and Data Engineering (TKDE)
Journal of Intelligent Information Systems
Data Mining and Knowledge Discovery
ACM SIGMOD Record
The International Journal on Very Large Database
Knowledge and Information Systems
Data & Knowledge Engineering
International Journal of Cooperative Information Systems
3
Outline
 Introduction
 Knowledge Discovery in Databases (KDD)
 Data Mining and Query Tools
 Basic Data Mining Techniques
 Data Mining and Data Warehouse
 Association Rules
4
 A short story
• The library of Babel (infinite)
Books must be somewhere in the library
People wander round this library until they die
The library contains an infinite amount of data but
no information
• Today’s environment
Too much data but too little information
 Challenge
• Find the required information from huge
amounts of data
• The amount of data is growing  increasingly
difficult to find the meaningful information
5
 Knowledge Discovery in Database (KDD)
• The whole process of extraction of implicit,
previously unknown and potentially useful
knowledge as a production factor from a large
data sets
• Include data selection, cleaning, coding, data
mining, and reporting
 Data Mining
• The key stage of Knowledge Discovery in
Database (KDD)
• The process of finding the desired information
from large database
6
 KDD is not a new technique but rather a multidisciplinary field of research
7
 AI, machine learning (1950)
 It is extremely difficult to create computer
that has an intelligent close to that of human
beings
• Lack of creativity and self-learning
 1960: stop researching about learning
• Neural network fail (XOR)
 1980 ~: neural network changes architecture,
new machine learning algorithm (decision
tree, genetic algorithm, etc.), powerful
computer, focus on simple and practical
problem
8
 Why learning
• Even for simple problem, such as timetable
planning  extremely hard to solve with a
computer but easily solved by experienced
human
 Using expert system to solve problem
• Even for simple systems, a great many rules
existed . It is difficult to find the right rules.
• Need to interview relevant experts many times
and integrate them to obtain the expert
knowledge
Knowledge acquisition: using learning algorithms
to generate rules automatically
9
 Why interest in data mining
• In the 1980s, all organizations begin to build
database. Until now, they contain gigabytes of data
with much ‘hidden’ information that cannot easily
be traced using SQL
SQL is just a query language under the constraints that
you already know
• As the use of networks, it will become increasingly
easy to connect database
Discover more information
• Machine learning techniques have been improved
Easier to find interesting information
• Client/server environment
Electronic commerce
10
 Data mining tool & Query tool
• Suppose a large database containing millions of
records that describe customers’ purchases
Who bought which product on what date?
What is the average turnover in July?
What is an optimal segmentation of clients
What are the most important trends in customer
behavior
• If you know exactly what you are looking for,
use query tool
• If you know only vaguely what you are looking
for, use data mining tool
11
 Data mining in electronic commerce
• The success of KDD come primarily from
marketing
• Prediction
Customer buying baby clothes today may buy
computer games in ten years, and fifteen years later
a motorcycle
12
• Suppose a company keeps the data about what
products they bought
Mail to everyone  only 3% ~ 4% interest
Analyze user behavior, and cluster customers
according to their interests  can save 50% of
mailing costs
13
 The problems of data mining
• Lack of long-term vision
What do we want to get from the database in the future?
• Not all files are up to date
Example: the price of computer
•
•
•
•
Struggle between departments
Poor cooperation between users and EDP dept.
Legal and privacy restrictions
Data model need to be transformed for different
data mining technique
• Timing problems: integrate data from different
sources
• Interpretation problems
14