Από τη διαχείριση πληροφορίας στη διαχε

Download Report

Transcript Από τη διαχείριση πληροφορίας στη διαχε

Εξόρυξη Γνώσης
(data mining)
Χ. Παπαθεοδώρου
Εργαστήριο Ψηφιακών Βιβλιοθηκών &
Ηλεκτρονικής Δημοσίευσης
Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας,
Ιόνιο Πανεπιστήμιο
1
Data Mining
Εξόρυξη γνώσης από πολύ μεγάλες συλλογές
δεδομένων
 Γνώση: κανόνες, πρότυπα συμπεριφοράς και
συσχετίσεις μεταξύ αντικειμένων (όχι προφανής,

λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη)
Αντικείμενο: Αποτελείται από ένα σύνολο
χαρακτηριστικών
 Δεν είναι:



(Deductive) query processing.
Expert systems, small machine learning /statistical
programs
2
Why Data Mining?
Potential Applications

Database analysis and decision support

Market analysis and management


Risk analysis and management



target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications

Text mining (news group, email, documents) and Web
analysis.

Intelligent query answering
3
Market Analysis and Management
(1)

Where are the data sources for analysis?


Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing

Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time


Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis

Associations/co-relations between product sales

Prediction based on the association information
4
Market Analysis and Management (2)

Customer profiling

data mining can tell you what types of customers
buy what products (clustering or classification)

Identifying customer requirements

identifying the best products for different customers

use prediction to find what factors will attract new
customers

Provides summary information

various multidimensional summary reports

statistical summary information (data central
tendency and variation)
5
Corporate Analysis and Risk
Management

Finance planning and asset evaluation




Resource planning:


cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financialratio, trend analysis, etc.)
summarize and compare the resources and spending
Competition:



monitor competitors and market directions
group customers into classes and a class-based
pricing procedure
set pricing strategy in a highly competitive market
6
Steps of a KDD Process

Learning the application domain:




Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of
effort!)
Data reduction and transformation:




Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining


relevant prior knowledge and goals of application
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns,
etc.
7
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Data pre-processing

Data preparation is a big issue for data mining

Data preparation includes


Data cleaning and data integration

Data reduction and feature selection

Discretization
A lot a methods have been developed but still
an active area of research
9
Data pre-processing
10
Clustering

Partition data set into clusters, and one
can store cluster representation only

Can have hierarchical clustering and be
stored in multi-dimensional index tree
structures

There are many choices of clustering
definitions and clustering algorithms
11
Cluster Analysis
12
Classification

Classification is an extensively studied problem (mainly
in statistics, machine learning & neural networks)

Classification is probably one of the most widely used
data mining techniques with a lot of extensions

Scalability is still an important issue for database
applications: thus combining classification with
database techniques should be a promising topic

Research directions: classification of non-relational
data, e.g., text, spatial, multimedia, etc..
13
Classification process

Model construction: describing a set of predetermined
classes




Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects

Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
14
Classification Process (1):
Model Construction
Training
Data
NAME RANK
YEARS TENURED
Mike Assistant Prof
3
no
Mary Assistant Prof
7
yes
Bill
Professor
2
yes
Jim
Associate Prof
7
yes
Dave Assistant Prof
6
no
Anne Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
15
Classification Process (2):
Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Tenured?
16
Supervised vs. Unsupervised
Learning


Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations

New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
17
Document category modelling
 Example:
Filtering spam email.
 Task: classify incoming email as spam
and legitimate (2 document categories).
 Simple blacklist and keyword-based
methods have failed.
 More intelligent, adaptive approaches
are needed (e.g. naive Bayesian
category modeling).
18
Document category modelling

Step 1 (linguistic pre-processing): Tokenization,
removal of stopwords, stemming/lemmatization.
 Step 2 (vector representation): bag-of-words or
n-gram modeling (n=2,3).
 Step 3 (feature selection): information gain
evaluation.
 Step 4 (machine learning): Bayesian modeling,
using word/n-gram frequency.
19
What Is Association Mining?
 Association
rule mining:
 Finding
frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Applications:
 Basket
data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
 Example.
form: "Body Head [support, confidence] .
 buys(x, "diapers ) buys(x, "beers ) [0.5%, 60%]
 Rule
20
Association Rule: Basic
Concepts


Given: (1) database of transactions, (2) each transaction is
a list of items (purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items


E.g., 98% of people who purchase tires and auto accessories also
get automotive services done
Applications


* Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
Home Electronics * (What other products should the store stocks
up?)
21
Rule Measures: Support and
Confidence
Custome
r
buys
both
Customer
buys
diaper

Find all the rules X & Y Z with
minimum confidence and support


Customer
buys beer
support, s, probability that a transaction
contains {X & Y & Z}
confidence, c, conditional probability
that a transaction having {X & Y} also
contains Z
Find the rules with support and confidence equal or grater than a
given threshold
22
Mining Association Rules An
Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A C:
support = support({A =>C}) = 50%
confidence = support({A =>C})/support({A}) =
66.6%
23
References

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,
1996.

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2000.

T. Imielinski and H. Mannila. A database perspective on knowledge
discovery. Communications of ACM, 39:58-64, 1996.

G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to
knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.),
Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT
Press, 1996.

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in
Databases. AAAI/MIT Press, 1991.
24