Data Mining - SFU computer science
Download
Report
Transcript Data Mining - SFU computer science
Supporting Decision Making
A Framework for IS Management
Introduction (2)
Most computer systems support decision making
because all software programs involve
automating decision steps that people would
take
Decision making is a process that involves a
variety of activities, most of which handle
information
A wide variety of computer-based tools and
approaches can be used to confront the problem
at hand and work through its solution
Introduction (3)
Computer technologies that support decision
making
Decision support system (DSSs)
Data mining
Executive information systems (EISs)
Expert systems (ESs)
Agent-based modeling
Multidisciplinary foundations for DS technologies
Database research, artificial intelligence, statistical
inference, human-computer interaction, simulation
methods, software engineering etc.
Case Example---A Problem-Solving Scenario
Using an EIS to discover a sales shortfall in one
region
Investigate several possible causes
Economic conditions
Competitive analysis
Written sales reports
A data mining analysis
Result: no clear problems revealed
Decision Support Systems---History
Two contributing areas of research in 1950s1960s
Organizational decision making in CMU
Interactive computer systems in MIT
Middle 1970s: single user and model-oriented
DSS
Middle and late 1980s: EIS, GDSS, ODSS
1990s: Data warehousing and OLAP
Late 1990s-2000s
Data mining
Web-based analytical applications
What is a DSS?
A DSS aims to use IT to relieve humans of some
decision making or help us make more informed
decisions
Systems that support, not replace, managers in their
decision-making activities
DSSs are defined as:
Computer-based systems
That help decision makers
Confront ill-structured problems
Through direct interaction
With data and analysis models
DSS Architecture (1)
DSS Architecture (2)
The Dialog Component
The Data Component
Linking the user to the system
Data sources --- use all the important data sources
within and outside the organization in the form of
summarized data (DW & DM)
The Model Component
Models provide the analysis capabilities for a DSS
Using a mathematical representation of the problem,
algorithmic processes are employed to generate information to
support decision making
A Taxonomy of DSS
Using the mode of assistance as the criterion
A model-driven DSS
A communication-driven DSS
A data-driven DSS or data-oriented DSS
A document-driven DSS
A knowledge-driven DSS
Executive Information System (1)
The emphasis of EIS is on graphical displays
and easy-to-use user interfaces
EIS can be viewed as a DSS that:
Provides access to summary performance data
Uses graphics to display and visualize the data in
an easy-to-use fashion, and
Has a minimum of analysis for modeling beyond
the capability to "drill down" in summary data to
examine components
Executive Information System (2)
EISs aim to provide both internal and external
information relevant to meeting the strategic
goals of the organization
Gauge company performance
Scan the environment
EIS and data warehousing technologies are
converging in the marketplace
The term EIS has lost popularity in favor of
Business Intelligence
Data Mining: Motivations
The explosive growth of data: from TB to PB
Data collection and data availability
Major sources of abundant data
Automated data collection tools, database systems, Web,
computerized society
Business: Web, e-commerce, transactions, stocks, …
Science: remote sensing, bioinformatics, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for
knowledge!
“Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
What Is Data Mining?
Data mining (knowledge discovery from data)
Alternative names
Extraction of interesting patterns or knowledge from huge amount
of data
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
(Deductive) expert systems
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Architecture: A Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Data Mining Engine
Database or Data
Warehouse Server
data cleaning, integration, and selection
Data World-Wide Other Info
Database Warehouse
Web
Repositories
KnowledgeBase
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
Why Not Traditional Data Analysis?
Tremendous amount of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Algorithms must be highly scalable to handle TB of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
Multi-Dimensional View of Data Mining (1)
Data to be mined
Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, timeseries, text, multi-media, heterogeneous, legacy,
WWW
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at
multiple levels
Multi-Dimensional View of Data Mining (2)
Techniques utilized
Database-oriented, data warehouse (OLAP),
machine learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud
analysis, bio-data mining, stock market analysis,
text mining, Web mining, etc.
Data Mining Functionalities (1)
Multidimensional concept description:
characterization and discrimination
Frequent patterns, association, correlation vs.
causality
Generalize, summarize, and contrast data characteristics,
e.g., dry VS. wet regions
Diaper Beer [0.5%, 75%]
Classification and prediction
Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
Predict some unknown or missing numerical values
Data Mining Functionalities (2)
Cluster analysis
Outlier analysis
Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass
similarity
Outlier: Data object that does not comply with the general
behavior of the data
Noise or exception? Useful in fraud detection, rare events
analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Periodicity analysis
Major Issues in Data Mining (1)
Mining methodology
Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing
one: knowledge fusion
Major Issues in Data Mining (2)
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining
results
Interactive mining of knowledge at multiple levels
of abstraction
Applications and social impacts
Domain-specific data mining & invisible data
mining
Protection of data security, integrity, and privacy
Artificial Intelligence (1)
AI is a group of technologies that attempts to mimic
our senses and emulate certain aspects of human
behavior such as reasoning and communication
1956, a conference in Dartmouth College
John McCarthy, Marvin Minsky, Allen Newell and Herbert Simon (
MIT, CMU and Stanford)
1965, H. A. Simon: "machines will be capable, within
twenty years, of doing any work a man can do"
1967, Marvin Minsky: "Within a generation ... the problem
of creating 'artificial intelligence' will substantially be solved"
Heavily funded by DARPA
Artificial Intelligence (2)
They had failed to recognize the difficulty of some of
the problems they faced:
The lack of raw computing power
The intractable combinatorial explosion of their algorithms,
The difficulty of representing commonsense knowledge and
doing commonsense reasoning,
The incredible difficulty of perception and motion
The failings of logic
First AI Winter
In 1974, DARPA cut off all undirected, exploratory
research in AI
Artificial Intelligence (3)
In the early 80s, the field was revived by the
commercial success of expert systems
By 1985 the market for AI had reached more than
a billion dollars.
Minsky and others warned the community that
enthusiasm for AI had spiraled out of control and
that disappointment was sure to follow
Second AI Winter
The collapse of the Lisp Machine market in 1987
Artificial Intelligence (4)
In the 90s AI achieved its greatest successes
Artificial intelligence was adopted throughout
the technology industry, providing the heavy
lifting for
Data mining
Logistics
Medical diagnosis
…
Expert System
An expert system is an automated type of
analysis or problem-solving model that deals
with a problem the way an "expert" does
The process involves consulting a base of
knowledge or expertise to reason out an answer
based on the characteristics of the problem
Architecture of an ES
Description
of a problem
Inference
Engine
User
User
Interface
Advice and
explanation
Knowledge
Base
Knowledge Representation
In AI, the primary aim of knowledge
representation is to store knowledge so that
programs can process it and achieve the
verisimilitude of human intelligence
The representation theory has its origin in cognitive
science
Knowledge can be represented in a number of
ways
Case-based reasoning
Artificial neural networks
Stored as rules
Case-based Reasoning (1)
Case-based reasoning
The process of solving new problems based on
the solutions of similar past problems
A case consists of a problem, its solution, and,
typically, annotations about how the solution was
derived
Case-based Reasoning (2)
Case-based reasoning as a four-step process
Retrieve: given a target problem, retrieve cases
from memory that are relevant to solving it
Reuse: map the solution from the previous case to
the target problem
Revise: test the new solution, if necessary, revise
it.
Retain: After the solution has been successfully
adapted to the target problem, store the resulting
experience as a new case in memory
Supervised vs. Unsupervised Learning
Supervised learning
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Artificial Neural Network (1)
An interconnected group of artificial neurons
Using a mathematical or computational model for
information processing based on a connectionistic
approach to computation.
An adaptive system that changes its structure based on
external or internal information that flows through the
network.
ANNs can be used to model complex relationships
between inputs and outputs or to find patterns in
data
Non-linear statistical data modeling or decision making
tools
Artificial Neural Network (2)
Training set:
(1) high salary, owns a house, has a dog,
[profitable customer]
(2) less than 3 years on job, prior bankruptcy,
owns a dog, [deadbeat]
......
Rule-based Systems (1)
Knowledge stored as rules
The most commonly used form of rules is the ifthen statement
e.g. IF some condition THEN some action
A rule-based inference model: decision tree
Each internal node (non-leaf node) denotes a test
on an attribute
Each branch represents an outcome of the test
Each leaf node holds a class label
Rule-based Systems (2)
Training
dataset for
decision
tree
buys_computer
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Rule-based Systems (3)
Decision tree
buys_computer
age?
<=30
overcast
31..40
student?
no
no
yes
yes
yes
>40
credit rating?
excellent
fair
yes
Agent-based Modeling
Simulate the behavior that emerges from the
decisions of a large number of distinct
individuals
Computer generated agents, each making
decisions typical of the decisions an individual
would make in the real world
Trying to understand the mysteries of why
businesses, markets, consumers, and other
complex systems behave as they do
Toward the Real-Time Enterprise
The essence of the phrase real-time
enterprise is that organizations can know how
they are doing at the moment
Digitization and automation of some crucial
enterprise activities traditionally completed by
people
Esp. information analysis
Better sense-and-response
Real-time Reporting
Real-time reporting is occurring on a whole host
of fronts including:
Enterprise nervous systems
Straight-through processing
To reduce distortion in supply chains
Real-time CRM
A network that connects people, applications and devices
To coordinate company operations
To automate decision making relating to customers, and
Communicating objects
To gain real-time data about the physical world
E.g. radio frequency identification device (RFID)
The Dark Side of Real Time
Object-to-object communication could
compromise privacy
Knowing the exact location of a company truck
every minute of the day is an invasion the driver's
privacy
In the era of speed, a situation can become
very bad very fast
E.g. "circuit breaker" to stop deep dives in NYSE