Transcript Document

Database Access and Techniques,
most notably Data Mining.
DSS
Humans:
Decision
Making
Process
Tools:
Computers and
IT. VB, VBA,
Excel, InterDev,
Etc.
DSS
Data:
Facts pertinent
to the decision
at hand.
Algorithms:
Math/Flow Chart
stuff that helps the
tools help the humans
make decisions.
Warehouses & Marts
• Data Warehouse
– a database designed to support decision-making in an organization.
It is batch-updated and structured for fast online queries and
exploration. Data warehouses may aggregate enormous amounts
of data from many different operational systems.
• Data Mart
– a database focused on addressing the concerns of a specific
problem or business unit (e.g. Marketing, Engineering). Size
doesn’t define data marts, but they tend to be smaller than data
warehouses.
Data Warehouses & Data Marts
3rd party data
Data Mart
(Marketing)
TPS
& other
operational
systems
Data
Warehouse
= operational clients
= query, OLAP, mining, etc.
Data Mart
(Engineering)
Data Warehousing
• Physical separation of operational and decision
support environments
• Purpose: to establish a data repository making
operational data accessible
• Transforms operational data to relational form
• Only data needed for decision support come from
the TPS
• Data are transformed and integrated into a
consistent structure
• Data warehousing (or information warehousing): a
solution to the data access problem
• End users perform ad hoc query, reporting analysis
and visualization
Data Warehousing Benefits
• Increase in knowledge worker productivity
• Supports all decision makers’ data
requirements
• Provide ready access to critical data
• Insulates operation databases from ad hoc
processing
• Provides high-level summary information
• Provides drill down capabilities
Yields
–
–
–
–
–
Improved business knowledge
Competitive advantage
Enhances customer service and satisfaction
Facilitates decision making
Help streamline business processes
DW Suitability
For organizations where
• Data are in different systems
• Information-based approach to management in
use
• Large, diverse customer base
• Same data have different representations in
different systems
• Highly technical, messy data formats
Transform Data from TPS to
Warehouse
• Consolidate data
– e.g. from multiple TPS around the country/world
• “Scrub” the data
– keep definitions consistent (e.g. translate part
numbers/product names if they differ per country)
• Calculate fields (decrease processor load)
• Summarize fields (decrease processor load)
• De-normalize data (ease of use)
OLAP: Data Access and Mining,
Querying and Analysis
Online Analytical processing (OLAP)
– DSS and EIS computing done by end-users in
online systems
– Versus online transaction processing (OLTP)
OLAP Activities
–
–
–
–
Generating queries
Requesting ad hoc reports
Conducting statistical analyses
Building multimedia applications
OLAP uses the data warehouse
and a set of tools, usually with
multidimensional capabilities
• Query tools
• Spreadsheets
• Data mining tools
• Data visualization tools
Using SQL for Querying
• SQL (Structured Query Language)
Data language
English-like, nonprocedural, very user friendly
language
Free format
Example:
SELECT
FROM
WHERE
Name, Salary
Employees
Salary >2000
Sample of SQL Statements
Natural Language
SQL
List of all purchases of L.B.
University since January of 1996, in
terms of products, prices, and
quantities
SELECT PRODUCTS
PURCH PRICE
QUANTITY FROM
PURCHASE-HIST
WHERE CUST-NAME EQ
L.B. UNIVERSITY AND
PURCH-DATE GE 01/01/96
List the price of cotton shirts,
SELECT PRICE,
medium size, with short sleeves and AMOUNT- AVAIL FROM
white color
PRODUCT WHERE
PROD-NAME EQ
COTTON SHIRT AND
SIZE EQ MEDIUM AND
STYLE EQ SHORT
SLEEVES AND COLOR
EQ WHITE
Query Tools & OLAP
• Query Tools
– user-lead discovery. Can return individual records or summaries.
Requests are formulated in advance (e.g. “show me all delinquent
accounts in the northeast region during Q1”).
• OLAP - Online Analytical Processing
– user-lead discovery. Data is explored via “drill down” into the data
by selecting variables to summarize on. Results are usually
reported in a cross-tab report or graph (e.g. “show me a tabular
breakdown of sales by business unit, product type, and year”).
OLAP
• Online Analytical Processing. (example of
cross-tab results presented below)
1. business unit
2. product type
3. year
D a ta
S o u rc e s
Bu sin ess
Com m u n ica t ion
Qu er yin g
In t er n a l
Da t a
Sou r ces
E xt er n a l
Da t a
Sou r ces
Da t a
Acqu isit ion ,
E xt r a ct ion ,
Deliver y
Tr a n sfor m a t ion
Repor t
Gen er a t ion
Da t a
Wa r eh ou se
Spr ea dsh eet
F or eca st in g
An a lysis
Modelin g
Da t a
P r esen t a t ion
and
Visu a liza t ion
Mu lt im edia
E IS,
Ot h er s
On lin e An a lyt ica l P r ocessin g
Data Mining
• automated information discovery process,
uncovers important patterns in existing data
– can use neural networks or other approaches.
Requires ‘clean’, reliable, consistent data.
Historical data must reflect the current
environment.
• e.g. “What are the characteristics that identify
when we are likely to lose a customer?”
Data Mining
For
•
•
•
•
•
•
•
Knowledge discovery in databases
Knowledge extraction
Data archeology
Data exploration
Data pattern processing
Data dredging
Information harvesting
Data Mining Examples
•
•
•
•
•
•
Market Segmentation - e.g. Dayton Hudson
Direct Marketing - e.g. Chase
Market basket analysis - e.g. Wal-Mart
Customer Churn - e.g. Fleet Bank
Fraud Detection - e.g. Bank of America
Cost Reduction Prospecting - e.g. Merk
Medco.
Major Data Mining
Characteristics and Objectives
• Data are often buried deep
• Client/server architecture
• Sophisticated new tools--including advanced
visualization tools--help to remove the information “ore”
• Massaging and synchronizing data
• Usefulness of “soft” data
• End-user minor is empowered by “data drills” and other
power query tools with little or no programming skills
• Often involves finding unexpected results
• Tools are easily combined with spreadsheets etc.
• Parallel processing for data mining
Data Mining Application Areas
•
•
•
•
•
•
•
•
•
•
•
•
•
Marketing
Banking
Retailing and sales
Manufacturing and production
Brokerage and securities trading
Insurance
Computer hardware and software
Government and defense
Airlines
Health care
Broadcasting
Law Enforcement
… almost everywhere!!!
Stupid Data-Miner Tricks
• Ad-Hoc Theories
– when an oddity jumps out of the data, it’s tempting to
develop a theory for it. Sometimes findings are just
statistical flukes.
• Using Too Many Variables
– the more factors considered, the more likely a relationship
will be found - valid or not.
• Not Taking No for an Answer
– it’s OK to stop looking if you can’t find anything. There are
no silver bullets.
• Limited or incorrect interpretation