data warehouse fundamentals
Download
Report
Transcript data warehouse fundamentals
CHAPTER 7
Databases and Data
Warehouses
Opening Case:
It Takes a Village to Write
an Encyclopedia
McGraw-Hill-Ryerson
©2011 The McGraw-Hill Companies, All Rights Reserved
7-2
Chapter Seven Overview
• SECTION 7.1 – DATABASES
–
–
–
–
–
–
Organizational Data and Information
Storing Transactional Information
Relational Database Fundamentals
Relational Database Advantages
Database Management Systems
Integrating Data Among Multiple Databases
• SECTION 7.2 – DATA WAREHOUSING
–
–
–
–
–
–
History of Data Warehousing
Data Warehouse Fundamentals
Business Intelligence
Operational, Tactical, and Strategic BI
Data Mining
Business Benefits of BI
Copyright © 2011 McGraw-Hill Ryerson Limited
7-3
LEARNING OUTCOMES
1.
Understand the defining value characteristics of both
transactional data and analytical information, and the
need for organizations to have data and information
that are timely and of high quality.
2.
Describe relational database fundamentals and
advantages.
3.
Understand how users interact with a database
management system, the advantage of data-driven
Web sites, and the primary methods of integrating
data and information across multiple databases in
organizations.
Copyright © 2011 McGraw-Hill Ryerson Limited
7-4
LEARNING OUTCOMES
4.
Describe data warehouse fundamentals and
advantages.
5.
Understand business intelligence, data mining, and
the relationship between business intelligence and
data warehousing.
Copyright © 2011 McGraw-Hill Ryerson Limited
SECTION 7.1
DATABASES
McGraw-Hill-Ryerson
©2011 The McGraw-Hill Companies, All Rights Reserved
7-6
ORGANIZATIONAL DATA AND
INFORMATION
• Data are raw facts that describe the
characteristics of an event.
• Information is data converted into a
meaningful and useful context.
Copyright © 2011 McGraw-Hill Ryerson Limited
7-7
ORGANIZATIONAL DATA AND
INFORMATION
• Information granularity – refers to the
extent of detail within the information (fine
and detailed or coarse and abstract)
– Levels
– Formats
– Granularities
Copyright © 2011 McGraw-Hill Ryerson Limited
7-8
ORGANIZATIONAL DATA AND
INFORMATION
Copyright © 2011 McGraw-Hill Ryerson Limited
7-9
The Value of Transactional Data
and Analytical Information
• Transactional data encompasses all of the data
contained within a single business process or unit
of work, and its primary purpose is to support the
performing of daily operational tasks.
• Analytical information encompasses all
organizational information, and its primary
purpose is to support the performing of higherlevel analysis tasks.
Copyright © 2011 McGraw-Hill Ryerson Limited
7-10
The Value of Transactional Data
and Analytical Information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-11
The Value of Timely Data and
Information
• Real-time is immediate
• Real-time data
• Real-time information
• Real-time system
Copyright © 2011 McGraw-Hill Ryerson Limited
7-12
The Value of Quality Data and
Information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-13
The Value of Quality Data and
Information
Low-quality information example
Copyright © 2011 McGraw-Hill Ryerson Limited
7-14
The Value of Quality Data and
Information
•
The four primary sources of low-quality
information include:
1. Online customers intentionally enter inaccurate
information to protect their privacy
2. Data or information from different systems have
different entry standards and formats
3. Call centre operators enter abbreviated or erroneous
information by accident or to save time
4. Third party and external information contains
inconsistencies, inaccuracies, and errors
Copyright © 2011 McGraw-Hill Ryerson Limited
7-15
Understanding the Costs of
Poor Information
• Potential business effects resulting from
low quality information include:
– Inability to accurately track customers
– Difficulty identifying valuable customers
– Inability to identify selling opportunities
– Marketing to nonexistent customers
– Difficulty tracking revenue due to inaccurate
invoices
– Inability to build strong customer relationships
Copyright © 2011 McGraw-Hill Ryerson Limited
7-16
Understanding the Benefits of
Good Information
• High-quality information can significantly
improve the chances of making a good
decision
• Good decisions can directly impact an
organization's bottom line
Copyright © 2011 McGraw-Hill Ryerson Limited
7-17
RELATIONAL DATABASE
FUNDAMENTALS
• Information is everywhere in an
organization
• Information is stored in databases
– Database – maintains information about
various types of objects (inventory), events
(transactions), people (employees), and places
(warehouses)
Copyright © 2011 McGraw-Hill Ryerson Limited
7-18
RELATIONAL DATABASE
FUNDAMENTALS
• Database models include:
– Hierarchical database model – information is
organized into a tree-like structure (using
parent/child relationships) in such a way that it
cannot have too many relationships
– Network database model – a flexible way of
representing objects and their relationships
– Relational database model – stores
information in the form of logically related twodimensional tables
Copyright © 2011 McGraw-Hill Ryerson Limited
7-19
Entities, Entity Classes, and
Attributes
• Entity – a person, place, thing, transaction, or
event about which information is stored
– The rows in each table contain the entities
– In Figure 7.5 CUSTOMER includes Dave’s Sub Shop
and Pizza Palace entities
• Entity class (table) – a collection of similar
entities
– In Figure 7.5 CUSTOMER, ORDER, ORDER LINE,
DISTRIBUTOR, and PRODUCT entity classes
Copyright © 2011 McGraw-Hill Ryerson Limited
7-20
Entities, Entity Classes, and
Attributes
• Attributes (fields, columns) – characteristics
or properties of an entity class
– The columns in each table contain the attributes
– In Figure 7.5 attributes for CUSTOMER include:
•
•
•
•
Customer ID
Customer Name
Contact Name
Phone
Copyright © 2011 McGraw-Hill Ryerson Limited
7-21
Entities, Entity
Classes, and
Attributes
Potential relational
database for CocaCola
Copyright © 2011 McGraw-Hill Ryerson Limited
7-22
Keys and Relationships
• Primary keys and foreign keys identify the
various entity classes (tables) in the
database
– Primary key – a field (or group of fields) that
uniquely identifies a given entity in a table
– Foreign key – a primary key of one table that
appears as an attribute in another table and
acts to provide a logical relationship between
the two tables
Copyright © 2011 McGraw-Hill Ryerson Limited
7-23
RELATIONAL DATABASE
ADVANTAGES
• Database advantages from a business
perspective include
– Increased flexibility
– Increased scalability and performance
– Reduced redundancy
– Increased integrity (quality)
– Increased security
Copyright © 2011 McGraw-Hill Ryerson Limited
7-24
Increased Flexibility
• A well-designed database should:
– Handle changes quickly and easily
– Provide users with different views
– Have only one physical view
• Physical view – deals with the physical storage of
information on a storage device
– Have multiple logical views
• Logical view – focuses on how users logically
access information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-25
Increased Scalability and
Performance
• A database must scale to meet increased
demand, while maintaining acceptable
performance levels
– Scalability – refers to how well a system can
adapt to increased demands
– Performance – measures how quickly a
system performs a certain process or
transaction
Copyright © 2011 McGraw-Hill Ryerson Limited
7-26
Reduced Redundancy
• Databases reduce information
redundancy
– Redundancy – the duplication of information
or storing the same information in multiple
places
• Inconsistency is one of the primary
problems with redundant information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-27
Increased Integrity (Quality)
• Information integrity – measures the quality of
information
• Integrity constraint – rules that help ensure the
quality of information
– Relational integrity constraint – rule that enforces
basic and fundamental information-based constraints
– Business-critical integrity constraint – rule that
enforces business rules vital to an organization’s
success and often requires more insight and knowledge
than relational integrity constraints
Copyright © 2011 McGraw-Hill Ryerson Limited
7-28
Increased Security
• Information is an organizational asset and must
be protected
• Databases offer several security features
including:
– Password – provides authentication of the user
– Access level – determines who has access to the
different types of information
– Access control – determines types of user access,
such as read-only access
Copyright © 2011 McGraw-Hill Ryerson Limited
7-29
DATABASE MANAGEMENT
SYSTEMS
• Database management systems (DBMS) –
software through which users and application
programs interact with a database
Copyright © 2011 McGraw-Hill Ryerson Limited
7-30
Data-Driven Web Sites
• Data-driven Web site – is an interactive Web site kept
updated and relevant to the needs of its customers.
Copyright © 2011 McGraw-Hill Ryerson Limited
7-31
Data-Driven Web Site Advantages
Copyright © 2011 McGraw-Hill Ryerson Limited
7-32
Querying Data-Driven Web Sites
Copyright © 2011 McGraw-Hill Ryerson Limited
7-33
INTEGRATING DATA AMONG
MULTIPLE DATABASES
• Integration – allows separate systems to
communicate directly with each other
– Forward integration – takes information
entered into a given system and sends it
automatically to all downstream systems and
processes
– Backward integration – takes information
entered into a given system and sends it
automatically to all upstream systems and
processes
Copyright © 2011 McGraw-Hill Ryerson Limited
7-34
INTEGRATING DATA
AMONG MULTIPLE DATABASES
• Forward and backward integration
Copyright © 2011 McGraw-Hill Ryerson Limited
7-35
INTEGRATING DATA
AMONG MULTIPLE DATABASES
• Building a central repository specifically
for integrated information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-36
OPENING CASE QUESTIONS
It Takes a Village to Write an Encyclopedia
1.
2.
3.
4.
5.
6.
Determine if an entry in Wikipedia is an example of
transactional information or analytical information.
What is the impact to Wikipedia if the information
contained in its database is of low quality?
Review the five common characteristics of high-quality
information and rank them in order of importance.
How is Wikipedia resolving the problem of poor
information?
Identify the different types of entities that might be stored
in Wikipedia’s database.
Why is database technology so important to Wikipedia’s
business model?
Copyright © 2011 McGraw-Hill Ryerson Limited
SECTION 7.2
DATA WAREHOUSING
McGraw-Hill-Ryerson
©2011 The McGraw-Hill Companies, All Rights Reserved
7-38
HISTORY OF DATA
WAREHOUSING
• Data warehouses extend the transformation of
data into information
• In the 1990’s executives became less
concerned with the day-to-day business
operations and more concerned with overall
business functions
• The data warehouse provided the ability to
support decision making without disrupting the
day-to-day operations
Copyright © 2011 McGraw-Hill Ryerson Limited
7-39
DATA WAREHOUSE
FUNDAMENTALS
• Data warehouse – a logical collection of
information – gathered from many different
operational databases – that supports business
analysis activities and decision-making tasks
• The primary purpose of a data warehouse is to
aggregate information throughout an
organization into a single repository for
decision-making purposes
Copyright © 2011 McGraw-Hill Ryerson Limited
7-40
DATA WAREHOUSE
FUNDAMENTALS
• Extraction, transformation, and loading
(ETL) – a process that extracts information from
internal and external databases, transforms the
information using a common set of enterprise
definitions, and loads the information into a data
warehouse
• Data mart – contains a subset of data
warehouse information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-41
DATA WAREHOUSE
FUNDAMENTALS
Copyright © 2011 McGraw-Hill Ryerson Limited
7-42
Multidimensional Analysis
• Databases contain information in a series
of two-dimensional tables
• In a data warehouse and data mart,
information is multidimensional; it contains
layers of columns and rows
– Dimension – a particular attribute of
information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-43
Multidimensional Analysis
• Cube – common term for the representation
of multidimensional information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-44
Information Cleansing or
Scrubbing
• An organization must maintain highquality data in the data warehouse
• Information cleansing or scrubbing – a
process that weeds out and fixes or
discards inconsistent, incorrect, or
incomplete information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-45
Information Cleansing or
Scrubbing
• Contact information in an operational system
Copyright © 2011 McGraw-Hill Ryerson Limited
7-46
Information Cleansing or
Scrubbing
• Standardizing Customer name from Operational Systems
Copyright © 2011 McGraw-Hill Ryerson Limited
7-47
Information Cleansing or
Scrubbing
Copyright © 2011 McGraw-Hill Ryerson Limited
7-48
Information Cleansing or
Scrubbing
• Accurate and complete information
Copyright © 2011 McGraw-Hill Ryerson Limited
7-49
BUSINESS INTELLIGENCE
• Business intelligence – information that
people use to support their decisionmaking efforts
Copyright © 2011 McGraw-Hill Ryerson Limited
7-50
BUSINESS INTELLIGENCE
• BI information analysis
Copyright © 2011 McGraw-Hill Ryerson Limited
7-51
BUSINESS INTELLIGENCE
• How BI can answer tough customer questions
Copyright © 2011 McGraw-Hill Ryerson Limited
7-52
OPERATIONAL, TACTICAL, AND
STRATEGIC BI
Copyright © 2011 McGraw-Hill Ryerson Limited
7-53
OPERATIONAL, TACTICAL, AND
STRATEGIC BI
• The three forms of BI must work towards a common goal
Copyright © 2011 McGraw-Hill Ryerson Limited
7-54
BI’s Operational Value
• The latency between a “business event” and an “action taken”
Copyright © 2011 McGraw-Hill Ryerson Limited
7-55
DATA MINING
• Data mining – the process of analyzing data to
extract information not offered by the raw data
alone
• To perform data mining users need data-mining
tools
– Data-mining tool – uses a variety of techniques to
find patterns and relationships in large volumes of
information and infers rules that predict future
behaviour and guide decision making
Copyright © 2011 McGraw-Hill Ryerson Limited
7-56
DATA MINING
•
Common forms of data-mining analysis
capabilities include:
– Cluster analysis
– Association detection
– Statistical analysis
Copyright © 2011 McGraw-Hill Ryerson Limited
7-57
Cluster Analysis
•
Cluster analysis – a technique used to divide
an information set into mutually exclusive
groups such that the members of each group
are as close together as possible to one
another and the different groups are as far
apart as possible
•
CRM systems depend on cluster analysis to
segment customer information and identify
behavioural traits
Copyright © 2011 McGraw-Hill Ryerson Limited
7-58
Cluster Analysis
Copyright © 2011 McGraw-Hill Ryerson Limited
7-59
Association Detection
•
Association detection – reveals the
degree to which variables are related
and the nature and frequency of these
relationships in the information
– Market basket analysis – analyzes such
items as Web sites and checkout scanner
information to detect customers’ buying
behaviour and predict future behaviour by
identifying affinities among customers’
choices of products and services
Copyright © 2011 McGraw-Hill Ryerson Limited
7-60
Statistical Analysis
•
Statistical analysis – performs such
functions as information correlations,
distributions, calculations, and variance
analysis
– Forecast – predictions made on the basis
of time-series information
– Time-series information – time-stamped
information collected at a particular
frequency
Copyright © 2011 McGraw-Hill Ryerson Limited
7-61
BUSINESS BENEFITS OF BI
Categories of BI benefits:
1. Direct quantifiable benefits
2. Indirect quantifiable benefits
3. Unpredictable benefits
4. Intangible benefits
Copyright © 2011 McGraw-Hill Ryerson Limited
7-62
OPENING CASE QUESTIONS
It Takes a Village to Write an Encyclopedia
7. How could Wikipedia use a data warehouse to
improve its business operations?
8. Why must Wikipedia cleanse or scrub the
information in its data warehouse?
9. How could a company use information from
Wikipedia to gain business intelligence?
10. Choose one of the three common forms of
data-mining analysis and explain how
Wikipedia could use it to gain BI.
11. How can Wikipedia use tactical, operational
and strategic BI?
Copyright © 2011 McGraw-Hill Ryerson Limited
7-63
CLOSING CASE ONE
Scouting for Quality
1. Explain the importance of high-quality
information for Scouts Canada.
2. Review the five common characteristics of high
quality information and rank them in order of
importance for Scouts Canada.
3. How could data warehouses and data marts be
used to help Scouts Canada improve the
efficiency and effectiveness of its operations?
Copyright © 2011 McGraw-Hill Ryerson Limited
7-64
CLOSING CASE ONE
Scouting for Quality
4. What kinds of data marts might Scouting
Canada want to build to help it analyze its
operational performance?
5. Do the managers at Scouting Canada actually
have all of the information they require to
make an accurate decision? Explain the
statement “it is never possible to have all of
the information required to make the best
decision possible.”
Copyright © 2011 McGraw-Hill Ryerson Limited
7-65
CLOSING CASE TWO
Google
1.
2.
3.
4.
5.
How did the Web site RateMyProfessor.com solve its
problem of low-quality information?
Review the five common characteristics of high-quality
information and rank them in order of importance to
Google’s business.
What would be the ramifications of Google’s business
if the search information it presented to its customers
was of low quality?
Describe the different types of databases. Why
should Google use a relational database?
Identify the different types of entities, entity classes,
attributes, keys, and relationships that might be stored
in Google’s AdWords relational database.
Copyright © 2011 McGraw-Hill Ryerson Limited
7-66
CLOSING CASE TWO
Google
6. How could Google use a data warehouse to
improve its business operations?
7. Why would Google need to scrub and cleanse
the information in its data warehouse?
8. Identify a data mart that Google’s marketing
and sales department might use to track and
analyze its AdWords revenue.
Copyright © 2011 McGraw-Hill Ryerson Limited
7-67
CLOSING CASE THREE
Harrah’s
1.
Identify the effects poor information might have on
Harrah’s service-oriented business strategy
2.
How does Harrah’s use database technologies to
implement its service-oriented strategy?
3.
Harrah’s was one of the first casino companies to find
value in offering rewards to customers who visit
multiple Harrah’s locations. Describe the effects on
the company if it did not build any integrations among
the databases located at each of its casinos. How
could Harrah’s use distributed databases or a data
warehouses to synchronize customer information?
Copyright © 2011 McGraw-Hill Ryerson Limited
7-68
CLOSING CASE THREE
Harrah’s
4.
Estimate the potential impact to Harrah’s business if
there is a security breach in its customer information.
5.
Identify three different types of data marts Harrah’s
might want to build to help it analyze its operational
performance.
Copyright © 2011 McGraw-Hill Ryerson Limited
7-69
CLOSING CASE THREE
Harrah’s
6.
What might occur if Harrah’s fails to clean or scrub its
information before loading it into its data warehouse?
7.
Describe cluster analysis, association detection, and
statistical analysis and explain how Harrah’s could
use each one to gain insights into its business.
Copyright © 2011 McGraw-Hill Ryerson Limited