Transcript Document

Data Mining
“Application of Information and Communication Technology to
Production and Dissemination of Official statistics”
10 May – 11 July 2006
M Q Hasan
Lecturer/ Statistician
UN Statistical Institute for Asia and the Pacific
Chiba, Japan
Email : [email protected]
1
Objectives
 Understanding
data mining
 Basis
for future planning and
development
2
Contents

What is data mining

Evolution of data mining

Technology and techniques involved

Software packages

References

Exercises
3
What is “data mining” :

“The nontrivial extraction of implicit,
previously unknown, and potentially useful
information from data"

“The science of extracting useful information
from large data sets or databases".
Wikipedia, the free encyclopaedia
4
What is “data mining” :

Also term as “data discovery”

Process of analyzing data to identify patterns or
relationship

Extraction of pattern or information from stored
information
5
What is “data mining” ….
 Prediction
of future events,
behaviors, estimating value etc.
– Accuracy.
 Confidence
level.
6
What is “data mining” ….
 Process
of data mining
– the initial exploration of available data
– model building or pattern identification
with validation
– the application of the model to new
data in order to generate predictions
7
What is “data mining” ….
 Requirements
–Data
–Concepts
–Instances
–Parameters
8
What is NOT data mining :
Data warehousing
 SQL / ad hoc queries / reporting
 Software agents
 Online analytical processing (OLAP)
 Data visualization

9
Why DM now ? …

Development and refinement of three technologies
over the years.
– Massive data collection and storage facility.

Databases of terabyte order.
 Includes
publicly available data
– Powerful multiprocessor computers.
 Parallel
processing technology, distributed
technology, speed.
– Data mining algorithms.
 Statistical,
Data Modeling etc.
10
Evolutionary
Step
Business Question
Data Collection “What was my total
(1960s)
revenue in the last
five years?”
Enabling
Technologies
Characteristics
Computers, tapes,
disks
Retrospective,
static data
delivery
Data Access
(1980s)
“What were unit
sales in New
England last
March?”
RDBMS, SQL,
ODBC
Retrospective,
dynamic data
delivery at record
level
Data
Warehousing &
Decision
Support
(1990s)
“What were unit
sales in New
England last March?
Drill down to
Boston."
On-line analytic
processing (OLAP),
multidimensional
databases, data
warehouses
Retrospective,
dynamic data
delivery at
multiple levels
Data Mining
(Ememrged)
“What’s likely to
happen to Boston
unit sales next
month? Why?”
Advanced
algorithms,
multiprocessor
computers, massive
databases
Prospective,
proactive
information
delivery
11
Tools

Case based reasoning.
• Case-based reasoning tools provide a means to find records
similar to a specified record or records. These tools let the
user specify the "similarity" of retrieved records.

Data visualization.
• Data visualization tools let the user easily and quickly view
graphical displays of information from different
perspectives.
12
1+1=1
Is it possible ?
13
 Let a = b
 Then a2 = ab
 Then 2a2 = a2 + ab
 Then 2a2 – 2ab = a2 – ab
 Then 2(a2 – ab) = 1(a2 – ab)
 Then (1 + 1)(a2 – ab) = 1(a2 – ab)
 Canceling (a2 – ab) from both sides
1+1=1
Where is the FALASY ?
14
In data mining think from all sides ?
Avoid the FALASIES
15
Thinking Hat techniques

White hat:
.
With this thinking hat you focus on the data available. Look
at the information you have, and see what you can learn
from it. Look for gaps in your knowledge, and either try
to fill them or take account of them.
This is where you analyse past trends, and try to
extrapolate from historical data.
16
Thinking Hat techniques

Red hat:
'Wearing' the red hat, you look at problems using
intuition, gut reaction, and emotion. Also try to
think how other people will react emotionally. Try
to understand the responses of people who do
not fully know your reasoning.
17
Thinking Hat techniques
Black hat: using black hat thinking.






Look at all the bad points of the decision.
Look at it cautiously and defensively.
Try to see why it might not work.
Helps to make plans 'tougher' and resilient.
Help you to spot fatal flaws and risks.
Helps sometime successful people get so used
to thinking positively that often they cannot see
problems in advance.
18
Thinking Hat techniques
Yellow hat: using yellow hat thinking.

Helps “think positively.”

Helps you to see all the benefits of the
decision and the value in it.

Helps you to keep going when everything
looks gloomy and difficult.
19
Thinking Hat techniques
Green hat: the green hat stands for
creativity.

This is time to develop creative solutions
to a problem.

Little criticism of ideas.

A whole range of creativity tools can help.
20
Thinking Hat techniques
Blue hat: the blue hat stands for process control.
.
This is the hat worn by people chairing meetings.
When running into difficulties because ideas are
running dry, they may direct activity into green
hat thinking. When contingency plans are
needed, they will ask for black hat thinking, etc.
21
Some DM terms :

Instances

Attributes

Objects

Class

Relationships

Rule indications
22
Machine learning
23
Some DM techniques :

Decision Trees

Neural Networks

Genetic Algorithms

Nearest neighbor methods

Rule indications
24
Some DM techniques

Decision trees
– Tree shaped structure with branches
– 2 main types:
Classification trees label records and assign them to the
proper class
 Regression trees estimate the value of a target variable

– Various algorithms
Chi square automatic interaction detection (CHAID)
 Classification & regression trees (CART)
 Etc

25
Some DM techniques

Neural Networks
– Learn through training
– Resemble to biological networks in structure
– Can produce very good predictions
– Not easy to use and to understand
– Cannot deal with missing data
26
Some DM techniques

Genetic Algorithms
– Optimization techniques
 Genetic
combinations
 Natural
selections
 Concepts
of evolution
 Etc
27
Some DM techniques

Nearest neighbor methods
– K-nearest neighbor technique
– Classification trees based on combination of
classes
28
Some DM techniques

Rule indications
– Extraction of if , then , else rules from data based
on statistical significance
29
How DM works ?

Modeling

– Predicting FUTURE !!!!
Build once
– apply /use many
30
How DM works ?

Test validity modeling
– Known cases with known data
31
Data Mining Software

Numap7, freeware for fast development,
validation, and application of regression
type networks including the multi layer
perception, functional link net, piecewise
linear network, self organizing map and kmeans.
– http://www-ee.uta.edu/eeweb/ip/Software/Software.htm
32
Data Mining Software

Tiberius, MLP Neural Network for
classification and regression problems.
– http://www.philbrierley.com/
33
Data Mining Software

Eurostat-funded research projects
–
–
–
–
–
SODAS – symbolic official data analysis
System => ASSO
KESO – knowledge extraction for statistical
Offices
Spin! – Spatial mining for data of public interest
34
Data Mining Software

SAS data mining tools
– Enterprise miner and text miner
– Applications relevant to national statistical offices
– Build a model of real world based on various
– Data
– Use the model to produce patterns
– Reveal trends
– Explain known outcomes
– Predict the future outcomes
– Forecast resource demands
– Identify factors to secure a desired effect
– Produce new knowledge to better inform
– Decision makers before they act
– Predict new opportunities
35
Data Mining Software


SAS data mining process : A framework for data mining:
sample, explore, modify, model, assess
Integrated models and algorithms:
–
–
–
–
–
–
–
–
–
Decision trees
Neural networks
Regression
Memory based reasoning
Bagging and boosting ensembles
Two-stage models
Clustering
Time series
Associations
36
Data Mining Software

SPSS Clementine
– Data mining workbench
– Applications relevant to national statistical offices
 Find useful relationships in large data sets
 Develop predictive models
 Improve decision making
– Modeling









Prediction and classification: neural networks, decision
Trees and rule induction, linear regression, logistic
Regression, multinomial logistic regression
Clustering and segmentation: Kohonen network, Kmeans,
And two steps
Association detection: GRI, apriori, and sequence
Data reduction: factor analysis and principle
Components analysis
Meta-modeling – combination of models
37
Data Mining Software
Open source data mining
– Www.Cs.waikato.Ac.nz/ml/weka - Weka (Waikato
– Environment for knowledge analysis)
– Data mining software in java
– Collection of machine learning algorithms for data
– Mining tasks:
 Data pre-processing
 Classification
 Regression
 Clustering
 Association rules
 Visualization
– Platforms: Linux, windows and Macintosh
– Apply directly to a dataset or call from java code
– Online documentation:
 Tutorial
 User guide
 API documentation
38
References :








Statistical Data Mining Tutorials
– http://www-2.cs.cmu.edu/~awm/tutorials/
Data Mining Glossary
– http://www.twocrows.com/glossary.htm
Mind tools - Decision Tree Analysis
– http://www.mindtools.com/dectree.html
Welcome to TheDataMine
– http://www.the-data-mine.com/
An Introduction to Data Mining - Discovering hidden value in your data
warehouse
– http://www.thearling.com/text/dmwhite/dmwhite.htm
An Introduction to Data Mining
– http://www.thearling.com/dmintro/dmintro.pdf
Data Mining for Official Statistics, Phan Tuan Pham (UNSD)
– SIAP ICT, Chiba, 7 – 9 June 2004
Wikipedia, the free encyclopaedia
– http://en.wikipedia.org/wiki/Data_mining
39