Transcript Slide 1
Chapter 1: Introduction to Data Mining,
Warehousing, and Visualization
Modern Data Warehousing, Mining,
and Visualization: Core Concepts
by George M. Marakas
© 2003, Prentice-Hall
Chapter 1 - 1
1-1: The Modern Data Warehouse
A data warehouse is a copy of transaction
data specifically structured for querying,
analysis and reporting
Note that the data warehouse contains a
copy of the transactions. These are not
updated or changed later by the transaction
system.
Also note that this data is specially structured,
and may have been transformed when it was
placed in the warehouse
© 2003, Prentice-Hall
Chapter 1 - 2
1-2: Data Warehouse Roles and Structures
The DW has the following primary functions:
It is a direct reflection of the business rules of
the enterprise.
It is the collection point for strategic information.
It is the historical store of strategic information.
It is the source of information later delivered to
data marts.
It is the source of stable data regardless of how
the business processes may change.
© 2003, Prentice-Hall
Chapter 1 - 3
Position of the Data Warehouse Within the Organization
© 2003, Prentice-Hall
Chapter 1 - 4
Data Marts
A data mart is a smaller, more focused data
warehouse. It reflects the business rules of a
specific business unit.
The data mart does not need to cleanse its
data because that was done when it went into
the warehouse.
It is a set of tables for direct access by users.
These tables are designed for aggregation.
It typically is not a source for traditional
statistical analysis.
© 2003, Prentice-Hall
Chapter 1 - 5
Position of the Data Mart Within the Organization
Data Delivery
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
© 2003, Prentice-Hall
Chapter 1 - 6
1-3: What Can a Data Warehouse Do?
Some of the benefits of a DW are:
Immediate information delivery
Data integration from across and even outside
the organization
Future vision from historical trends
Tools for looking at data in new ways
Freedom from IS department resource
limitations (you don’t need programmers to use
a data warehouse)
© 2003, Prentice-Hall
Chapter 1 - 7
Examples of Common DW Applications
Sales Analysis
Determine real-time product sales to make vital pricing and distribution decisions.
Analyze historical product sales to determine success or failure attributes.
Evaluate successful products and determine key success factors.
Use corporate data to understand the margin as well as the revenue implications of a decision.
Rapidly identify a preferred customer segments based on revenue and margin.
Quickly isolate past preferred customers who no longer buy.
Identify daily what product is in the manufacturing and distribution pipeline.
Instantly determine which salespeople are performing, on both a revenue and margin basis, and which are
behind.
Financial Analysis
Compare actual to budgets on an annual, monthly and month-to-date basis.
Review past cash flow trends and forecast future needs.
Identify and analyze key expense generators.
Instantly generate a current set of key financial ratios and indicators.
Receive near-real-time, interactive financial statements.
Human Resource Analysis
Evaluate trends in benefit program use.
Identify the wage and benefits costs to determine company-wide variation.
Review compliance levels for EEOC and other regulated activities.
Other Areas
Warehouses have also been applied to areas such as: logistics, inventory, purchasing, detailed transaction
analysis and load balancing.
© 2003, Prentice-Hall
Chapter 1 - 8
What Does All This Mean?
On a daily basis, organizations turn to their
data warehouses to answer a limitless variety
of questions.
Nothing is free, however, and these benefits
do come with a cost.
The value of a data warehouse is a result of
the new and changed business processes it
enables.
There are limitations, though. A DW cannot
correct problems with the data, although it
may help to clearly identify them.
© 2003, Prentice-Hall
Chapter 1 - 9
Comparison of Typical DW Costs and Benefits
Costs
Hardware, software, development personnel and consultant costs.
Operational costs like ongoing systems maintenance.
Benefits
Added Revenue
Will the new (business objective) process generate new customers (what is the
estimated value?)
Will the new (business objective) process increase the buying propensity of
existing customers (by how much?)
Is the new process necessary to ensure that the competition doesn't offer a
demanded service that you can't match?
Reduced costs
What costs of current systems will be eliminated?
Is the new process intended to make some operation more efficient? If so, how
and what is the dollar value?
© 2003, Prentice-Hall
Chapter 1 - 10
1-4: The Cost of Warehousing Data
Expenditures can be categorized as one-time
initial costs or as recurring, ongoing costs.
The initial costs can further be identified as
for hardware or software.
Expenditures can also be categorized as
capital costs (associated with acquisition of
the warehouse) or as operational costs
(associated with running and maintaining the
warehouse)
© 2003, Prentice-Hall
Chapter 1 - 11
Expenditures Associated with Building a DW
Recurring Costs
One-Time Costs
Capital
Hardware maintenance
Software maintenance
Terminal analysis
Middleware
Hardware
Disk
CPU
Network
Terminal analysis
Operational
Ongoing refreshment
Integration transformation
Data model maintenance
Record identification maintenance
Metadata infrastructure maintenance
Archival of data
Data aging within the DW
© 2003, Prentice-Hall
Software
DBMS
Terminal analysis
Middleware
Network
Log utility
Processing
Metadata
Infrastructure
Integration/transformation processing
specification
Metadata infrastructure population
System of record definition
Data dictionary language definition
Network transfer definition
CASE/Repository interface
Initial data warehouse population
Data model definition
Database design definition
Chapter 1 - 12
Cost Are Highly Variable
A company that spends less money for their
data warehouse is often happier with it.
The main justification for the development
expense is that a DW reduces the cost of
accessing the information owned by the
organization.
Since information has to be retrieved just
once (when it is placed in the warehouse),
DW users see a lower cost on each report
generated.
© 2003, Prentice-Hall
Chapter 1 - 13
Typical Multidatabase Report and Screen Generation
Data download
and
transformation
contribute to
retrieval costs
for every report
or screen
generated
© 2003, Prentice-Hall
Source
System
A
Source
System
B
Source
System
C
Source
System
D
Chapter 1 - 14
Typical DW Report and Screen Generation
Data upload
and
transformation
costs occur just
once. Retrieval
costs are lower.
© 2003, Prentice-Hall
Source
System
A
Source
System
B
Organizational
Data
Warehouse
Source
System
C
Source
System
D
Chapter 1 - 15
Farmers and Explorers
Every corporation has two types of DW users.
Farmers know what they want before they set
out to find it. They submit small queries and
retrieve small nuggets of information.
Explorers are quite unpredictable. They often
submit large queries. Sometimes they find
nothing, sometimes they find priceless
nuggets.
Cost justification for the DW is usually done
on the basis of the results obtained by
farmers since explorers are unpredictable.
© 2003, Prentice-Hall
Chapter 1 - 16
1-5: Data Marts and the Data Warehouse
Legacy
systems feed
data to the
warehouse.
Legacy Systems
Finance
Data Mart
Sales
Data Mart
Operational
Data Store
The
warehouse
feeds
specialized
information to
departments.
© 2003, Prentice-Hall
Marketing
Data Mart
Accountin
g
Data Mart
Operational
Data Store
Operational
Data Store
Organizational
Data
Warehouse
Operational
Data Store
Chapter 1 - 17
The Data Mart is More Specialized
Organizational Data
Warehouse
The data
mart serves
the needs of
one business
unit, not the
organization.
Corporate
Highly granular data
Normalized design
Robust historical data
Large data volume
Data Model driven data
Versatile
General purpose DBMS
technologies
Finance
Data Mart
Marketing
Data Mart
Accting
Data Mart
Data Marts
Organizational
Data
Warehouse
© 2003, Prentice-Hall
Sales
Data Mart
Departmentalized
Summarized, aggregated
data
Star join design
Limited historical data
Limited data volume
Requirements driven data
Focused on departmental
needs
Multi-dimensional DBMS
technologies
Chapter 1 - 18
1-6: Foundations of Data Mining
Data mining is the process of using raw data
to infer important business relationships.
Despite a consensus on the value of data
mining, a great deal of confusion exists about
what it is.
It is a collection of powerful techniques
intended for analyzing large datasets.
There is no single data mining approach, but
rather a set of techniques that can be used in
combination with each other.
© 2003, Prentice-Hall
Chapter 1 - 19
1-7: The Roots of Data Mining
The approach has roots in practice dating
back over 30 years.
In the early 1960s, data mining was called
statistical analysis, and the pioneers were
statistical software companies such as SAS
and SPSS.
By the 1980s, the traditional techniques had
been augmented by new methods such as
fuzzy logic, heuristics and neural networks.
© 2003, Prentice-Hall
Chapter 1 - 20
A General Approach
Although all data mining endeavors are unique,
they possess a common set of process
steps:
1. Infrastructure preparation – choice of
hardware platform, the database system and
one or more mining tools
2. Exploration – looking at summary data,
sampling and applying intuition
3. Analysis – each discovered pattern is
analyzed for significance and trends
© 2003, Prentice-Hall
Chapter 1 - 21
A General Approach (continued)
Interpretation – Once patterns have been
discovered and analyzed, the next step is to
interpret them. Considerations include
business cycles, seasonality and the
population the pattern applies to.
5. Exploitation – this is both a business and a
technical activity. One way to exploit a
pattern is to use it for prediction. Others are
to package, price or advertise the product in
a different way.
4.
© 2003, Prentice-Hall
Chapter 1 - 22
1.8: The Approach to Data Exploration and Data Mining
A
B
The basis
for all
data mining
activities is
correlation.
A Perfect Correlation
A
B
A Strong Correlation
A
B
A Weak Correlation
© 2003, Prentice-Hall
Chapter 1 - 23
The Spectrum of Correlation
1
Perfect
Correlation
.5
Moderate
Correlation
0
No
Correlation
In general, a correlation coefficient is a
number between 0 and 1 that shows strength
of a relationship.
Some types of correlation are signed (±) to
also show the direction of the relationship.
Even a weak correlation can be interesting,
however, if it shows a trend over time.
© 2003, Prentice-Hall
Chapter 1 - 24
Methods to Determine Correlation
The method
used
depends on
the type of
elements
being
correlated.
A vs. B
A vs.
A vs.
A vs.
B BB
B B B BBB
B B B
A vs.
A vs.
© 2003, Prentice-Hall
Data element vs. data element
Data element vs. unit of time
Data element vs. data element groups
Data element vs. geography
Data element vs. external trends
Data element vs. demographics
Chapter 1 - 25
The Data Warehouse and Data Mining
Data mining does not require the use of a
warehouse, but it may be the best foundation
for mining.
If multiple analyses are run in sequence, the
data need to be held constant (as in a DW).
In an operational database, data change
often.
Also important is that the data in the DW is
integrated and stable
© 2003, Prentice-Hall
Chapter 1 - 26
Volumes of Data – The Biggest Challenge
The largest challenge a data miner may face
is the sheer volume of data in the warehouse.
It is quite important, then, that summary data
also be available to get the analysis started.
A major problem is that this sheer volume
may mask the important relationships the
analyst is interested in.
The ability to overcome the volume and
visualize the data becomes quite important.
© 2003, Prentice-Hall
Chapter 1 - 27
1.9: Foundations of Data Visualization
One of the earliest known examples of data
visualization was in London during the 1854
cholera epidemic. A map (next slide) helped
to identify the source of the disease.
Modern visualization techniques grew from
the twin technologies of computer graphics
and high performance computing in the
1970s and 1980s.
One computer scientist who saw this trend
arising was Douglas Engelbart in the 1950s.
© 2003, Prentice-Hall
Chapter 1 - 28
Dr. John
Snow used
a map to
show the
source of
cholera was
a water
pump, thus
proving the
disease
was water
borne.
Broad Street
Pump
© 2003, Prentice-Hall
Chapter 1 - 29
Opportunity and Timing
Alternative input devices (light pen, sketch pad
and mouse) began to appear in the 1960s.
In the 1970s, flight simulators became much
more realistic when graphics replaced film.
In the same decade, special effects computers
became entrenched in the entertainment
industry.
In the 1980s, visualization grew more dynamic
with applications like the animation of Los
Angeles smog patterns.
© 2003, Prentice-Hall
Chapter 1 - 30
One of today’s
more useful
types of
visualization is
in simulators
(both in games
and in practice).
This is the only
way most of us
will ever fly a
Boeing 747.
© 2003, Prentice-Hall
Chapter 1 - 31
It is now both
cheaper and
safer to train
commercial
pilots on
simulators.
With good
software, pilots
can be placed in
situations they
may not ever
see – until too
late – in the
cockpit.
© 2003, Prentice-Hall
Chapter 1 - 32
A Sequence of Frames Animating LA Smog
Day 2 Offshore Winds – Moderate Smog Particles
Day 1 Swirling Winds – Light Smog Particles
Day 3 Head-on View of Smog Particles and Streamlines
© 2003, Prentice-Hall
Chapter 1 - 33
Number Crunching With a Difference
In the 1990s, rapid advances in chip
technology, both at the CPU and the graphics
processor, put data visualization everywhere.
Imagine trying to understand DNA sequences
from just the numbers!
On the next slide, a Mapuccino display helps
us see where the results from a text search
come from.
© 2003, Prentice-Hall
Chapter 1 - 34
© 2003, Prentice-Hall
Chapter 1 - 35