Data Mining - SFU computer science

Download Report

Transcript Data Mining - SFU computer science

Supporting Decision Making
A Framework for IS Management
Introduction (2)

Most computer systems support decision making
because all software programs involve
automating decision steps that people would
take

Decision making is a process that involves a
variety of activities, most of which handle
information

A wide variety of computer-based tools and
approaches can be used to confront the problem
at hand and work through its solution
Introduction (3)

Computer technologies that support decision
making






Decision support system (DSSs)
Data mining
Executive information systems (EISs)
Expert systems (ESs)
Agent-based modeling
Multidisciplinary foundations for DS technologies

Database research, artificial intelligence, statistical
inference, human-computer interaction, simulation
methods, software engineering etc.
Case Example---A Problem-Solving Scenario

Using an EIS to discover a sales shortfall in one
region

Investigate several possible causes





Economic conditions
Competitive analysis
Written sales reports
A data mining analysis
Result: no clear problems revealed
Decision Support Systems---History

Two contributing areas of research in 1950s1960s






Organizational decision making in CMU
Interactive computer systems in MIT
Middle 1970s: single user and model-oriented
DSS
Middle and late 1980s: EIS, GDSS, ODSS
1990s: Data warehousing and OLAP
Late 1990s-2000s


Data mining
Web-based analytical applications
What is a DSS?

A DSS aims to use IT to relieve humans of some
decision making or help us make more informed
decisions


Systems that support, not replace, managers in their
decision-making activities
DSSs are defined as:





Computer-based systems
That help decision makers
Confront ill-structured problems
Through direct interaction
With data and analysis models
DSS Architecture (1)
DSS Architecture (2)

The Dialog Component


The Data Component


Linking the user to the system
Data sources --- use all the important data sources
within and outside the organization in the form of
summarized data (DW & DM)
The Model Component

Models provide the analysis capabilities for a DSS

Using a mathematical representation of the problem,
algorithmic processes are employed to generate information to
support decision making
A Taxonomy of DSS

Using the mode of assistance as the criterion





A model-driven DSS
A communication-driven DSS
A data-driven DSS or data-oriented DSS
A document-driven DSS
A knowledge-driven DSS
Executive Information System (1)

The emphasis of EIS is on graphical displays
and easy-to-use user interfaces

EIS can be viewed as a DSS that:



Provides access to summary performance data
Uses graphics to display and visualize the data in
an easy-to-use fashion, and
Has a minimum of analysis for modeling beyond
the capability to "drill down" in summary data to
examine components
Executive Information System (2)

EISs aim to provide both internal and external
information relevant to meeting the strategic
goals of the organization


Gauge company performance
Scan the environment

EIS and data warehousing technologies are
converging in the marketplace

The term EIS has lost popularity in favor of
Business Intelligence
Data Mining: Motivations

The explosive growth of data: from TB to PB

Data collection and data availability


Major sources of abundant data




Automated data collection tools, database systems, Web,
computerized society
Business: Web, e-commerce, transactions, stocks, …
Science: remote sensing, bioinformatics, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for
knowledge!

“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
What Is Data Mining?

Data mining (knowledge discovery from data)


Alternative names


Extraction of interesting patterns or knowledge from huge amount
of data
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?

Simple search and query processing

(Deductive) expert systems
Knowledge Discovery (KDD) Process

Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Architecture: A Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Data Mining Engine
Database or Data
Warehouse Server
data cleaning, integration, and selection
Data World-Wide Other Info
Database Warehouse
Web
Repositories
KnowledgeBase
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
Why Not Traditional Data Analysis?

Tremendous amount of data


High-dimensionality of data


Micro-array may have tens of thousands of dimensions
High complexity of data







Algorithms must be highly scalable to handle TB of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Multi-Dimensional View of Data Mining (1)

Data to be mined


Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, timeseries, text, multi-media, heterogeneous, legacy,
WWW
Knowledge to be mined


Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at
multiple levels
Multi-Dimensional View of Data Mining (2)

Techniques utilized


Database-oriented, data warehouse (OLAP),
machine learning, statistics, visualization, etc.
Applications adapted

Retail, telecommunication, banking, fraud
analysis, bio-data mining, stock market analysis,
text mining, Web mining, etc.
Data Mining Functionalities (1)

Multidimensional concept description:
characterization and discrimination


Frequent patterns, association, correlation vs.
causality


Generalize, summarize, and contrast data characteristics,
e.g., dry VS. wet regions
Diaper  Beer [0.5%, 75%]
Classification and prediction

Construct models (functions) that describe and distinguish
classes or concepts for future prediction


E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
Predict some unknown or missing numerical values
Data Mining Functionalities (2)

Cluster analysis



Outlier analysis



Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity
Outlier: Data object that does not comply with the general
behavior of the data
Noise or exception? Useful in fraud detection, rare events
analysis
Trend and evolution analysis


Trend and deviation: e.g., regression analysis
Periodicity analysis
Major Issues in Data Mining (1)

Mining methodology







Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing
one: knowledge fusion
Major Issues in Data Mining (2)

User interaction




Data mining query languages and ad-hoc mining
Expression and visualization of data mining
results
Interactive mining of knowledge at multiple levels
of abstraction
Applications and social impacts


Domain-specific data mining & invisible data
mining
Protection of data security, integrity, and privacy
Artificial Intelligence (1)

AI is a group of technologies that attempts to mimic
our senses and emulate certain aspects of human
behavior such as reasoning and communication

1956, a conference in Dartmouth College




John McCarthy, Marvin Minsky, Allen Newell and Herbert Simon (
MIT, CMU and Stanford)
1965, H. A. Simon: "machines will be capable, within
twenty years, of doing any work a man can do"
1967, Marvin Minsky: "Within a generation ... the problem
of creating 'artificial intelligence' will substantially be solved"
Heavily funded by DARPA
Artificial Intelligence (2)

They had failed to recognize the difficulty of some of
the problems they faced:






The lack of raw computing power
The intractable combinatorial explosion of their algorithms,
The difficulty of representing commonsense knowledge and
doing commonsense reasoning,
The incredible difficulty of perception and motion
The failings of logic
First AI Winter

In 1974, DARPA cut off all undirected, exploratory
research in AI
Artificial Intelligence (3)

In the early 80s, the field was revived by the
commercial success of expert systems



By 1985 the market for AI had reached more than
a billion dollars.
Minsky and others warned the community that
enthusiasm for AI had spiraled out of control and
that disappointment was sure to follow
Second AI Winter

The collapse of the Lisp Machine market in 1987
Artificial Intelligence (4)


In the 90s AI achieved its greatest successes
Artificial intelligence was adopted throughout
the technology industry, providing the heavy
lifting for




Data mining
Logistics
Medical diagnosis
…
Expert System

An expert system is an automated type of
analysis or problem-solving model that deals
with a problem the way an "expert" does

The process involves consulting a base of
knowledge or expertise to reason out an answer
based on the characteristics of the problem
Architecture of an ES
Description
of a problem
Inference
Engine
User
User
Interface
Advice and
explanation
Knowledge
Base
Knowledge Representation

In AI, the primary aim of knowledge
representation is to store knowledge so that
programs can process it and achieve the
verisimilitude of human intelligence


The representation theory has its origin in cognitive
science
Knowledge can be represented in a number of
ways



Case-based reasoning
Artificial neural networks
Stored as rules
Case-based Reasoning (1)

Case-based reasoning


The process of solving new problems based on
the solutions of similar past problems
A case consists of a problem, its solution, and,
typically, annotations about how the solution was
derived
Case-based Reasoning (2)

Case-based reasoning as a four-step process




Retrieve: given a target problem, retrieve cases
from memory that are relevant to solving it
Reuse: map the solution from the previous case to
the target problem
Revise: test the new solution, if necessary, revise
it.
Retain: After the solution has been successfully
adapted to the target problem, store the resulting
experience as a new case in memory
Supervised vs. Unsupervised Learning

Supervised learning



Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning


The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Artificial Neural Network (1)

An interconnected group of artificial neurons



Using a mathematical or computational model for
information processing based on a connectionistic
approach to computation.
An adaptive system that changes its structure based on
external or internal information that flows through the
network.
ANNs can be used to model complex relationships
between inputs and outputs or to find patterns in
data

Non-linear statistical data modeling or decision making
tools
Artificial Neural Network (2)
Training set:
(1) high salary, owns a house, has a dog,
[profitable customer]
(2) less than 3 years on job, prior bankruptcy,
owns a dog, [deadbeat]
......
Rule-based Systems (1)

Knowledge stored as rules



The most commonly used form of rules is the ifthen statement
e.g. IF some condition THEN some action
A rule-based inference model: decision tree



Each internal node (non-leaf node) denotes a test
on an attribute
Each branch represents an outcome of the test
Each leaf node holds a class label
Rule-based Systems (2)
Training
dataset for
decision
tree
buys_computer
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Rule-based Systems (3)
Decision tree
buys_computer
age?
<=30
overcast
31..40
student?
no
no
yes
yes
yes
>40
credit rating?
excellent
fair
yes
Agent-based Modeling

Simulate the behavior that emerges from the
decisions of a large number of distinct
individuals


Computer generated agents, each making
decisions typical of the decisions an individual
would make in the real world
Trying to understand the mysteries of why
businesses, markets, consumers, and other
complex systems behave as they do
Toward the Real-Time Enterprise

The essence of the phrase real-time
enterprise is that organizations can know how
they are doing at the moment

Digitization and automation of some crucial
enterprise activities traditionally completed by
people


Esp. information analysis
Better sense-and-response
Real-time Reporting

Real-time reporting is occurring on a whole host
of fronts including:

Enterprise nervous systems



Straight-through processing


To reduce distortion in supply chains
Real-time CRM


A network that connects people, applications and devices
To coordinate company operations
To automate decision making relating to customers, and
Communicating objects


To gain real-time data about the physical world
E.g. radio frequency identification device (RFID)
The Dark Side of Real Time

Object-to-object communication could
compromise privacy


Knowing the exact location of a company truck
every minute of the day is an invasion the driver's
privacy
In the era of speed, a situation can become
very bad very fast

E.g. "circuit breaker" to stop deep dives in NYSE