Data Mining: Introduction

Download Report

Transcript Data Mining: Introduction

Introduction to Data Mining
(CS 422)
Fall 2010
© Goharian & Grossman 2003
1
Course Outline



Introduction
Data Pre-processing
Data Mining Algorithms
» Recommenders
» Classifiers
– Naïve Bayes
– Decision Trees
– Support Vector Machines
» Clustering
– K-Means
– LDA
» Other
– Association Rules
© Goharian & Grossman 2003
2
Introduction

Mining useful facts from a large amount of data.
Examples:
– Product A sells a lot when we sell
– People who take a loan are more likely to default if they have
the following characteristics
– A person committing credit card fraud is likely to do X
– A person who likes these movies will probably like movies
about X
© Goharian & Grossman 2003
3
Different Data Sources
Relational Database
 Data Warehouse
 Flat Files
 Web
 Object Oriented database
 Multi Media

© Goharian & Grossman 2003
4
Data Warehouse




Many enterprises consolidate data from their different
homogeneous and heterogeneous data repositories into one
common data source called Data Warehouse (DW).
Data Warehouse contains current and historical data to be
used for planning and forecasting in Decision Support
Systems (DSS).
Traditional Databases are operational databases that are
day-to-day data.
Star-schema, Snow-Flakes, Galaxy are modeling schemes
in DW.
© Goharian & Grossman 2003
5
Data Warehouse (Cont’d)





To improve the performance in DW different techniques
such as Summarization and Denormalization are used.
Usually but not always DW is accessed by On-Line
Analytical Processing (OLAP).
SQL gives a precise answer to a user query.
OLAP gives a multi-dimensional view of data and is as
extension of some aggregate functions in SQL.
OLAP Operations are Slice, Dice, Roll-up, and Drilldown.
© Goharian & Grossman 2003
6
OLAP vs. Data Mining

OLAP is a data summarization/ aggregation tool
that facilitates the data analysis for the user by
providing a multi-dimensional view of the data.

Data Mining Tool provides an automated discovery
of knowledge and gives more in-depth knowledge
about data and hidden information.

OLAM (OLAP Mining) is the integration of OLAP
with Data Mining.
© Goharian & Grossman 2003
7
OLAM Architecture
Graphical user interface
OLAP
Data Mining/OLAM
Data warehouse
DB
DB
© Goharian & Grossman 2003
DB
8
DM vs Statistics and ML

Data Mining (DM), Statistics and ML have lots in
common:
» Finding patterns
» Building models to make predictions



Statistics largely relies on probabilistically rational
mathmatical models.
ML is extremely useful, but tends not to care about large
data sets. If a process runs out of memory, its not an ML
problem.
DM wants accuracy but is designed for very large datasets.
© Goharian & Grossman 2003
9
Scalability

Statistical approach deal with small data sets.
» Believe that all data must be cleaned and reduced.

Machine Learning deal with small data sets.
» Goal is to make machine learn.
» Applications such as Chess Playing rather than
applications that deal market analysis.

Real life data to be mined can be huge, thus need
scalable algorithms.
© Goharian & Grossman 2003
10
DM Algorithms

Supervised (Classification / Categorization)
»
»
»
»

Bayesian
Neural Network
Decision Tree
Others: Genetic Algorithms, Fuzzy Set, K-Nearest
Neighbor
Unsupervised
» Association Rules
» Clustering

Collaborative Filtering
© Goharian & Grossman 2003
11
Supervised vs. Unsupervised

Supervised algorithms
» Learning by example:
– Use training data which has correct answers (class label
attribute)
– Create a model by running the algorithm on the training data
– Identify a class label for the incoming new data

Unsupervised algorithms
» Do not use training data.
» Classes may not be known in advance.
© Goharian & Grossman 2003
12
Supervised Algorithms
Test Data
3
Training Data
1
Classifier
Algorithm
2
Model
4
Classification
© Goharian & Grossman 2003
13
Collaborative Filtering

Use user recommendations such as
» Ratings
» Clicks
» Purchases
Provide recommendations to other users
 All users “collaborate” in order to “filter” the
search for the best item.

© Goharian & Grossman 2003
14
Introduction to Evaluation
(Evaluating Supervised Algorithms)

Goal: We want to measure the effectiveness of a
classification (supervised) algorithm.
» Take the training dataset
» Build model
» Test using the training dataset

This usually leads to a very optimistic result that
has little ability to predict the real accuracy of the
model.
© Goharian & Grossman 2003
15
Cross-Validation (Cont’d)
Another approach is to take the training data set,
cut it in half and use half on training and half for
testing.
 This leads to potential errors in estimating the real
classification rate because the half we hold for
testing may be very different that the half we used
for training.

© Goharian & Grossman 2003
16
10-fold Cross Validation

Take the data set and use the first 90 percent
of the data for training and then test on the
final ten percent. Then use the next 10
percent for testing, etc.
Run 1
Run 10
Run 2
Test
Train
Train
…
Train
Test
Test
© Goharian & Grossman 2003
Train
17
10-fold Cross Validation
Each run will result in a particular classification
rate.
 Ex: If we classified 50/100 of the test records
correctly our classification rate for that run is 50%.
 Choose the model that generated the highest
classification rate. The final classification rate for
the model is the average of the ten classification
rates.

© Goharian & Grossman 2003
18
DM Open Source

Weka
» Pros
– Weka has numerous algorithms and plenty of data mining classes uses it.
– Has a great tool to run experiments, try lots of algorithms and see which
one works.
» Cons
– Not typically use for large production systems. Lacks large-scale
distributed implementation.

Mahout
» Runs on top of Hadoop - - that’s where it got its name. It’s a
highly scalable system. Hadoop is meant for widespread
distributed processing. More on it later.
» Its real -- used by real, production systems.
© Goharian & Grossman 2003
19
Open Source



A reasonable goal of this course is to make it so you can
meaningfully contribute to open source.
This will ensure you understand the concepts being taught
and it also will help you as you progress.
If you do that the conversation about you will change
from:
» “oh I did a cool data mining project in school and got an A”

To:
» “I designed and implemented a new, highly scalable algorithm
that is now downloaded and used by 200 major systems. People
have gone through it and found a few bugs, but most were
relatively small, and I work to improve it from time to time”.
© Goharian & Grossman 2003
20
Summary
Data Mining algorithms are used to detect the
information that we did not know.
 There are various data sources, types,formats and
applications for Data Mining.
 Scalability differentiates DMfrom statistics and
Machine Learning.
 Mahout is a robust, open source framework for
implementation of data mining algorithms.

© Goharian & Grossman 2003
21