Από τη διαχείριση πληροφορίας στη διαχε
Download
Report
Transcript Από τη διαχείριση πληροφορίας στη διαχε
Εξόρυξη Γνώσης
(data mining)
Χ. Παπαθεοδώρου
Εργαστήριο Ψηφιακών Βιβλιοθηκών &
Ηλεκτρονικής Δημοσίευσης
Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας,
Ιόνιο Πανεπιστήμιο
1
Data Mining
Εξόρυξη γνώσης από πολύ μεγάλες συλλογές
δεδομένων
Γνώση: κανόνες, πρότυπα συμπεριφοράς και
συσχετίσεις μεταξύ αντικειμένων (όχι προφανής,
λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη)
Αντικείμενο: Αποτελείται από ένα σύνολο
χαρακτηριστικών
Δεν είναι:
(Deductive) query processing.
Expert systems, small machine learning /statistical
programs
2
Why Data Mining?
Potential Applications
Database analysis and decision support
Market analysis and management
Risk analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and Web
analysis.
Intelligent query answering
3
Market Analysis and Management
(1)
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
4
Market Analysis and Management (2)
Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central
tendency and variation)
5
Corporate Analysis and Risk
Management
Finance planning and asset evaluation
Resource planning:
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financialratio, trend analysis, etc.)
summarize and compare the resources and spending
Competition:
monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive market
6
Steps of a KDD Process
Learning the application domain:
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of
effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
relevant prior knowledge and goals of application
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns,
etc.
7
Data Mining: A KDD Process
Pattern Evaluation
Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Data pre-processing
Data preparation is a big issue for data mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still
an active area of research
9
Data pre-processing
10
Clustering
Partition data set into clusters, and one
can store cluster representation only
Can have hierarchical clustering and be
stored in multi-dimensional index tree
structures
There are many choices of clustering
definitions and clustering algorithms
11
Cluster Analysis
12
Classification
Classification is an extensively studied problem (mainly
in statistics, machine learning & neural networks)
Classification is probably one of the most widely used
data mining techniques with a lot of extensions
Scalability is still an important issue for database
applications: thus combining classification with
database techniques should be a promising topic
Research directions: classification of non-relational
data, e.g., text, spatial, multimedia, etc..
13
Classification process
Model construction: describing a set of predetermined
classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
14
Classification Process (1):
Model Construction
Training
Data
NAME RANK
YEARS TENURED
Mike Assistant Prof
3
no
Mary Assistant Prof
7
yes
Bill
Professor
2
yes
Jim
Associate Prof
7
yes
Dave Assistant Prof
6
no
Anne Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
15
Classification Process (2):
Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
16
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
17
Document category modelling
Example:
Filtering spam email.
Task: classify incoming email as spam
and legitimate (2 document categories).
Simple blacklist and keyword-based
methods have failed.
More intelligent, adaptive approaches
are needed (e.g. naive Bayesian
category modeling).
18
Document category modelling
Step 1 (linguistic pre-processing): Tokenization,
removal of stopwords, stemming/lemmatization.
Step 2 (vector representation): bag-of-words or
n-gram modeling (n=2,3).
Step 3 (feature selection): information gain
evaluation.
Step 4 (machine learning): Bayesian modeling,
using word/n-gram frequency.
19
What Is Association Mining?
Association
rule mining:
Finding
frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Applications:
Basket
data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
Example.
form: "Body Head [support, confidence] .
buys(x, "diapers ) buys(x, "beers ) [0.5%, 60%]
Rule
20
Association Rule: Basic
Concepts
Given: (1) database of transactions, (2) each transaction is
a list of items (purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories also
get automotive services done
Applications
* Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
Home Electronics * (What other products should the store stocks
up?)
21
Rule Measures: Support and
Confidence
Custome
r
buys
both
Customer
buys
diaper
Find all the rules X & Y Z with
minimum confidence and support
Customer
buys beer
support, s, probability that a transaction
contains {X & Y & Z}
confidence, c, conditional probability
that a transaction having {X & Y} also
contains Z
Find the rules with support and confidence equal or grater than a
given threshold
22
Mining Association Rules An
Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A C:
support = support({A =>C}) = 50%
confidence = support({A =>C})/support({A}) =
66.6%
23
References
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,
1996.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2000.
T. Imielinski and H. Mannila. A database perspective on knowledge
discovery. Communications of ACM, 39:58-64, 1996.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to
knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.),
Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT
Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in
Databases. AAAI/MIT Press, 1991.
24