Data Mining - Department of Computer Science

Download Report

Transcript Data Mining - Department of Computer Science

Data Mining
CS157B Section 2
Larry Varela
What is Data Mining?
Data Mining is "The science of extracting
useful information from large data sets or
databases“. -- http://en.wikipedia.org/wiki/Data_mining
 Data mining is the process of analyzing
data from different perspectives and
summarizing it into useful information
within a particular context.

History of Data Mining



Although data mining is a relatively new term the
technology has been around for more than 20
years.
Companies have used powerful computers to sift
through volumes of supermarket scanner data
and analyze market research reports for years.
Recent innovations in computer processing
power, disk storage, and statistical software are
dramatically increasing the accuracy of analysis
while driving down the cost.
Data Mining History cont…

Data mining was derived from three previously defined
disciplines.



Classical statistics - embrace concepts such as regression
analysis, standard distribution, standard deviation, standard
variance, discriminant analysis, cluster analysis, and confidence
intervals, all of which are used to study data and data
relationships.
Artificial intelligence - attempts to apply human-thought-like
processing to statistical problems. AI concepts have been
adopted by some high-end commercial products, such as query
optimization modules for Relational Database Management
Systems (RDBMS).
Machine learning - attempts to let software learn about the data
they study, such that future decisions are based on the quality of
the studied data.
What is it used for?




Data Mining enables businesses to automatically explore and
understand their data while identifying patterns, relationships,
and dependencies that impact business outcomes.
(Descriptive application)
 Business Outcomes include: revenue growth, profit
improvement, cost containment, and risk management.
Data Mining enables the uncovering and identification of
relationships expressed as business rules, or predictive
models.
These outputs can then be communicated in traditional
reporting formats to guide business planning and strategies.
In addition, these outputs can also be expressed as
programming code that can then be deployed into business
software to generate predictions of future outcomes.
(Predictive application)
Common Types of Relationships




Classes: Stored data is used to locate information in predetermined groups.
For example, a coffee chain could mine customer purchase data to
determine when customers arrive and what they typically purchase. This
information could be used to increase traffic by having daily specials.
Clusters: Data items can be grouped according to logical relationships. For
example, data can be mined to identify technology market segments or
recent consumer purchasing trends.
Associations: Data can be mined to identify associations between items
purchased or queried. For example the beer-diaper example Dr. Lee
mentioned during last class is an example of associative mining.
Sequential patterns: Data is mined to anticipate or predict behavior
patterns and trends. For example, a Corvette dealer could predict the
likelihood of power-folding convertible tops being purchased based on
recent increased purchases of convertible style vehicles.
How does data mining work?

Data mining consists of five major elements:
 Extract,
transform, and load transaction data onto
the data warehouse system.
 Store and manage the data in a multidimensional
database system.
 Provide data access to business analysts and/or
information technology professionals.
 Analyze the data using application software.
 Present the data in a readable format.
-- info quoted from http://www.anderson.ucla.edu
Data mining Techniques

Classical Techniques
 Statistics
 Neighborhoods

and Clustering
Next Generation Techniques
 Trees
 Networks
and Rules
Trees



Within a decision tree each branch is a classification
question and the leaves of the tree are partitions of the
dataset with their classification.
Decision trees can be viewed as segmentations of the
original dataset where each segment would be one of
the leaves of the tree.
The decision tree technology can be used for exploration
of datasets and/or business problems. This is often
done by looking at the predictors and values that are
chosen for each split of the tree. Often times these
predictors provide usable insights or propose questions
that need to be answered.
Type of Decision Trees



Classification tree analysis is a term used when the
predicted outcome is the class to which the data
belongs.
Regression tree analysis is a term used when the
predicted outcome can be considered a real number
(e.g. the price of a house, or a patient’s length of stay in
a hospital).
CART analysis is a term used to refer to both of the
above procedures. The name CART is an acronym from
the words Classification And Regression Trees, and was
first introduced by Breiman et al. [BFOS84].
-- info quoted from http://en.wikipedia.org/wiki/Decision_tree
Decision Tree Example

Angelo is the manager of a children's’ zoo. Recently Angelo
has been experiencing customer attendance problems. Some
days lots of visitors arrive wanting to tour the park when the
staff is overworked. Yet on other days no visitors arrive and
zoo staff has too much unproductive free time. Angelo’s
objective is to optimize staff availability by trying to predict
when people will visit the park. To accomplish this Angelo
needs to understand why people decide to visit on particular
days. He assumes that weather must be an important
underlying factor, so he decides to use the weather forecast
for the upcoming week. Angelo records the following:





Weather Outlook (sunny, cloudy, or rainy)
Temperature
Percent Humidity
Whether it was windy or not.
Zoo attendance on that particular day
Decision Tree Example
INDEPENDENT VARIABLES
OUTLOOK TEMP HUMIDITY WINDY
DEPENDENT VARIABLE
VISITOR ATTENDENCE
sunny
85
85
FALSE
no visits
sunny
80
90
TRUE
no visits
overcast
83
78
FALSE
visits
rain
70
96
FALSE
visits
rain
68
80
FALSE
visits
rain
65
70
TRUE
no visits
overcast
64
65
TRUE
visits
sunny
72
95
FALSE
no visits
sunny
69
70
FALSE
visits
rain
75
80
FALSE
visits
sunny
75
70
TRUE
visits
overcast
72
90
TRUE
visits
overcast
81
75
FALSE
visits
rain
71
80
TRUE
no visits
Decision Tree Example cont…

Angelo then applies a decision tree model to solve his
problem.
Visits = 9
No Visits = 5
OUTLOOK
?
sunny
Visits = 2
No Visits = 3
overcast
rain
Visit = 4
No Visit = 0
Visits = 3
No Visits = 2
HUMIDITY
?
<=70
Visit = 2
No Visit = 0
WINDY?
>70
Visit = 0
No Visit = 3
TRUE
Visit = 0
No Visit = 2
FALSE
Visit = 3
No Visit = 0
Decision Tree Example cont…




The decision tree created is a model of the data that encodes the
distribution of the class label in terms of the predictor attributes.
The top node represents all the data. The classification tree
algorithm finds out that the best way to explain the dependent
variable, VISIT, is by using the variable OUTLOOK.
Angelo’s first conclusion: if the OUTLOOK is OVERCAST people
always visit the zoo, and there exist some crazy people who visit
the zoo even in the rain.
But then again he divided the sunny group in two groups and
realized that people don't like to visit the zoo if the humidity is
higher than seventy percent.
Finally he divided the rain category into two and found that
visitors will also not visit the zoo if it is windy.
Decision Tree Example Conclusion
 Angelo
dismisses most of the staff on days
that are sunny and humid or on rainy and
windy because almost no one is going to visit
the zoo on those days. On days when a lot of
people will visit, he hires extra staff.
 The conclusion is that the decision tree
helped Angelo turn a complex data
representation into a much easier structure.
Decision Tree Advantages



Decision trees are simple to understand and interpret.
Data preparation for a decision tree is basic or
unnecessary.
Is able to handle both nominal and categorical data.



Other techniques are usually specialised in analysing datasets
that have only one type of variable.
It is possible to validate a model using statistical tests.
Is robust, perform well with large data in a short time.
Data Mining Pitfalls
Sometime data mining may imposing
patterns on data where none exist. This
imposition of irrelevant correlation is
termed data dredging or data fishing.
 Large data sets invariably happen to have
some exciting relationships peculiar to that
data. Therefore any conclusions reached
are likely to be highly suspect.

References






Wikipedia.org (2006) Data mining. Retrieved on 3/20/2006 from
www.wikipedia.com
Wikipedia.org (2006) Data mining. Retrieved on 3/20/2006 from
www.wikipedia.com
Bill Palace (1996) What is Data Mining? Retrieved on 3/20/2006 at
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/techn
ologies/palace/datamining.htm
Data-Mining-Software.com (2006) Data Mining History. Retrieved on
3/20/2006 at http://www.data-miningsoftware.com/data_mining_history.htm
Alex Berson, Stephen Smith, and Kurt Thearling (1999) An Overview of
Data Mining Techniques. Retrieved on 3/20/2006 from
http://www.thearling.com/text/dmtechniques/dmtechniques.htm
[BFOS84] L. Breiman, J. Friedman, R. A. Olshen and C. J. Stone,
Classification and regression trees. Wadsworth, 1984.