A Data Structure for Data Mining - CACS

Download Report

Transcript A Data Structure for Data Mining - CACS

Data Mining
Ryan Benton
Center for Advanced Computer Studies
University of Louisiana at Lafayette
Lafayette, La., USA
January 13, 2011
Important Note
This presentation was obtained from Dr.
Vijay Raghavan
Obtained on January 12, 2011.
 Dr. Raghavan is a member of the Center for
Advanced Computer Studies, University of
Louisiana at Lafayette, Lafayette, La., USA

They are being used with his permission.

Some modifications have been made.
2
CONTENTS
The Motivation
Knowledge Discovery in Databases (KDD)
Data Mining



Related Fields
Research Issues
Tasks
Association Mining Problem
Classification Mining Problem
Conclusions
3
THE MOTIVATION
“We are drowning in information,
but starving for knowledge.”
John Naisbett
4
KNOWLEDGE DISCOVERY
IN DATABASES- Definition
A hot buzzword for a class of database
applications that look for patterns or
relationships in data that are:
Hidden,
 Previously unknown and
 Potentially useful

5
KDD: Definition
Extract (discover):
interesting and
 previously unknown

knowledge from very large real world databases.
6
KDD: Definition
More formally:
Valid,
 Novel, Potentially useful or Desired
 Ultimately understandable.

7
KDD- PROCESS
Note, order of these steps
And what is include in
Each step may vary depending
On who you talk to.
And maybe
even this
8
KDD vs. DATA MINING
Synonyms (?)
KDD
More than just finding pattern
 Mining, dredging and fishing

9
KDD- Related Fields
Data Warehousing
On-Line Analytical Processing (OLAP)
Database Marketing
Exploratory Data Analysis (EDA)
10
Data Warehousing
A data warehouse is a subject-oriented,
integrated, time-variant and nonvolatile
collection of data in support of
management’s decision making process.
11
OLAP and Data Warehousing
Relational
Data Marts
WEB
External
Source
Data
Warehouse
MDD
Data Marts
Query
Reporting
Tools
OLAP
Tools
User
GIS
Tools
GeoReference
Data
File
Server
Meta Data
12
Data Mining: Related Areas
Database
Management
Systems
Other Areas:
1. Neural Networks
2. Evolutionary Methods
3. Information Retrieval
4. Etc.
The ‘Parent’
of Data Mining
Visualization
Sometimes considered
an subarea of AI
Machine
Learning
Data
Mining
Artificial
Intelligence
Statistics
Sometimes considered
Expert an subarea of AI
Systems
Pattern
Recognition
13
Database versus Data Mining
Query


DB: Well Defined & SQL
DM: Poorly Defined & Various Languages
Data


DB: Operational (and generally relational)
DM: Not Operational.
Output


DB: Precise, subset of the database.
DM: Varies.
14
Examples
Database


Find all people with last name Raghavan.
Identify all customers who have bought more than
10,000 dollars
Data Mining




Find those who have poor credit
Find all those who like the same cars
Find all items that are often (frequently) purchased with
milk.
Predict the value of the housing market.
15
Statistics
Simple descriptive models
Traditionally:

A model created from a sample of the data to the
entire dataset.
Exploratory Data Analysis:
 Data can actually drive the creation of the
model
 Opposite of traditional statistical view.
Presupposes a distribution
16
Machine Learning
Machine Learning: area of AI that examines how to write
programs that can learn.
Types of models



Classification
Prediction (Regression)
Clustering
Types of Learning:


Supervised
Unsupervised
Traditionally


Small Datasets
‘Complete’ Data



Changed since about mid-1990’s
Examples on non-complete from earlier
Field isn’t static (ideas flow between)
17
Data Mining: Research Issues
Ultra large data
Noisy data
Null values
Incomplete data

Note, somewhat related to NULL values
Redundant data
Dynamic aspects of data
18
Data Mining: Tasks
Association
Classification
Generally, implication that you are seeking
Clustering
Relationships and/or can manipulate the data
interactively.
Estimation
Data Visualization
Deviation Analysis
etc
19
Data Mining Models and Tasks
20
ASSOCIATION MINING
PROBLEM
Deriving association rules from data:
Given a set of items I = {i1,i2, . . . , in} and a
set of transactions S = {s1, s2, . . ., sm}, each
transaction si S, such that si  I,
an association rule is defined as X  Y,
where X  I, Y  I and X  Y = ,
describes the existence of a relationship
between the two itemsets X and Y.
21
Measurements
Measures to define the strength
of the relationship between two
itemsets X and Y
22
Measure of Confidence
P( X , Y )
Confidence(X  Y ) 
P( X )
The percentage of transactions that contain Y
among those transaction containing X.
23
Applications of Associations
I = Products, S = Baskets
I = Cited Articles, S = Technical Articles
I = Incoming Links, S = Web pages
I = Keywords, S = Documents
I = Term papers, S = Sentences
24
Classification Mining Problem
Pattern Recognition and Machine Learning
communities
Generally aimed at models of the data.
Often includes both


Categorization
Prediction (Regression)
Supervised.
25
Clustering Mining Problem
Assumption: Data, naturally, falls into
groups.

Overlapping or Non-Overlapping
What are the groups?

And what data falls within each group.
Unsupervised.
26
Measures
Error

Categorization


Number Bad Assignments/Total Assignments
Prediction

Mean Squared Error
In truth, a number of measures have been
proposed.
27
Note about ‘Data’
Various types:
Text
 Strings
 Numeric
 Sound
 Image
 Relations
 Etc.

28
CONCLUSIONS
KDD has interesting problems
It is an inter-disciplinary field
No matter your expertise, you can find an
interesting niche
Many high-demand applications


Customer Relationship Management
Suggestive Sales


Search


Amazon
Ebay
Stock Prediction

Well, would be nice.
29