Databases and Data Warehouses

Download Report

Transcript Databases and Data Warehouses

CHAPTER 6
DATABASES
AND DATA
WAREHOUSES
McGraw-Hill/Irwin
©2008 The McGraw-Hill Companies, All Rights Reserved
6-2
UNDERSTANDING INFORMATION
• Information is everywhere in an organization
– Data are raw facts that describe the characteristics of an
event
• Sales event – date, item number, item description, quantity
ordered, customer name, shipping details
– Information is data converted into a menaingful and
useful context
• Sales event – best/worst selling item, best/worst customer
• Employees must be able to obtain and analyze the
many different levels, formats, and granularities of
organizational information to make decisions
• Successfully collecting, compiling, sorting, and
analyzing information can provide tremendous
insight into how an organization is performing
6-3
UNDERSTANDING INFORMATION
• Information granularity – refers to the
extent of detail within the information (fine
and detailed or coarse and abstract)
– Levels
– Formats
– Granularities
6-4
UNDERSTANDING INFORMATION
Information Types
Range
Examples
Information Levels
Individual
Individual knowledge, goals and strategies
Department
Departmental goals, revenues, expenses, processes and
strategies
Enterprise
Enterprise-wide revenues, expenses, processes and
strategies
Document
Letters, memos, faxes, e-mails, reports, marketing
materials
Presentation
Product, strategy, process, financial, customer and
competitor presentations
Spreadsheet
Sales, marketing, industry, financial, competitor, customer,
and order spreadsheets
Database
Customer, employee, sales, order, supplier and
manufacturer
Detail (Fine)
Reports for each salesperson, product and part
Summary
Reports for all sales personnel, all products and all parts
Aggregate (Coarse)
Reports across departments, organizations and companies
Information Formats
Information Granularities
6-5
Information Quality
• Business decisions are only as good as the quality of the
information used to make the decisions
• Characteristics of high quality information include:
– Accuracy Are all the values correct? Is the name spelled correctly?
Is the dollar amount recorded properly?
– Completeness Are any of the values missing? Is the address
complete including street, city, state, and zip code?
– Consistency Is aggregate or summary information in agreement
with detailed information?
• Do all total fields equal the true total of the individual fields?
– Uniqueness Is each transaction, entity, and event represented only
once in the information?
• Are there any duplicate customers?
– Timeliness Is the information current with respect to the business
requirements? Is information updated weekly, daily, or hourly?
6-6
Information Quality
• Low quality information example
6-7
Information Quality
•
•
•
•
•
•
Issue 1: Without a first name it would be impossible to correlate this customer
with customers in other databases (Sales, Marketing, Billing, Customer Service)
to gain a compete customer view (CRM)
Issue 2: Without a complete street address there is no possible way to
communicate with this customer via mail or deliveries. An order might be sitting
in a warehouse waiting for the complete address before shipping. The company
has spent time and money processing an order that might never be completed
Issue 3: If this is the same customer, the company will waste money sending
out two sets of promotions and advertisements to the same customers. It might
also send two identical orders and have to incur the expense of one order being
returned
Issue 4: This is a good example of where cleaning data is difficult because this
may or may not be an error. There are many times when a phone and a fax
have the same number. Since the phone number is also in the e-mail address
field, chances are that the number is inaccurate
Issue 5: The business would have no way of communicating with this customer
via e-mail
Issue 6: The company could determine the area code based on the customer’s
address. This takes time, which costs the company money. This is a good
reason to ensure that information is entered correctly the first time. All incorrect
information needs to be fixed, which costs time and money
6-8
Understanding the Costs of
Poor Information
•
The four primary sources of low quality
information include:
1. Online customers intentionally enter inaccurate
information to protect their privacy
2. Information from different systems have
different entry standards and formats
3. Call center operators enter abbreviated or
erroneous information by accident or to save
time
4. Third party and external information contains
inconsistencies, inaccuracies, and errors
6-9
Understanding the Costs of
Poor Information
• Potential business effects resulting from
low quality information include:
– Inability to accurately track customers
– Difficulty identifying valuable customers
– Inability to identify selling opportunities
– Marketing to nonexistent customers
– Difficulty tracking revenue due to inaccurate
invoices
– Inability to build strong customer relationships
6-10
Understanding the Costs of
Poor Information
• Poor information could cause the SCM
system to order too much inventory from a
supplier based on inaccurate orders
• Poor information could cause a CRM
system to send an expensive promotional
item (such as a fruit basket) to the wrong
address of one of its best customers
• What occurs when you have the inability to
build strong customer relationships?
– Decreased seller power
6-11
Understanding the Benefits of
Good Information
• High quality information can significantly
improve the chances of making a good
decision
• Good decisions can directly impact an
organization's bottom line
6-12
DATABASE FUNDAMENTALS
• Information is everywhere in an
organization
• Almost every business decision is based
on information
• Information is stored in databases
– Database – maintains information about
various types of objects (inventory), events
(transactions), people (employees), and
places (warehouses)
6-13
DATABASE FUNDAMENTALS
• Database models include:
– Hierarchical database model – information is
organized into a tree-like structure (using
parent/child relationships) in such a way that it
cannot have too many relationships
– Network database model – a flexible way of
representing objects and their relationships
– Relational database model – stores information
in the form of logically related two-dimensional
tables
6-14
DATABASE ADVANTAGES
• Database advantages from a business
perspective include
–
–
–
–
–
–
Increased flexibility
Increased scalability and performance
Reduced information redundancy
Increased information integrity (quality)
Increased information security
Spreadsheet limitations
• Limited number of rows and columns (Excel - 65,536 rows by
256 columns) Once you use more than 65,536 rows you
have outgrown your spreadsheet
• Only one users can access the spreadsheet
• Users can view all information in the spreadsheet
• Users can change all information in the spreadsheet
6-15
Increased Flexibility
• A well-designed database should:
– Handle changes quickly and easily
– Provide users with different views
– Have only one physical view
• Physical view – deals with the physical storage of
information on a storage device
– Have multiple logical views
• Logical view – focuses on how users logically
access information
6-16
Increased Scalability and
Performance
• A database must scale to meet increased
demand, while maintaining acceptable
performance levels
– Scalability – refers to how well a system can
adapt to increased demands
– Performance – measures how quickly a
system performs a certain process or
transaction
6-17
Reduced Redundancy
• Databases reduce information redundancy
by recording each piece of information in
only one place
– Redundancy – the duplication of information
or storing the same information in multiple
places; can lead to low quality information
• Inconsistency is one of the primary
problems with redundant information
6-18
Increased Integrity (Quality)
• Information integrity – measures the quality of
information
• Integrity constraint – rules that help ensure the
quality of information
– Relational integrity constraint – rule that enforces
basic and fundamental information-based constraints
• Users cannot create an order for a nonexistent customer
• An order cannot be shipped without an address
– Business-critical integrity constraint – rule that
enforce business rules vital to an organization’s success
and often require more insight and knowledge than
relational integrity constraints
• Product returns are not accepted for fresh product 15 days after
purchase
• A discount maximum of 20 percent
6-19
Increased Security
• Information is an organizational asset and must
be protected
• Databases offer several security features
including:
– Password – provides authentication of the user
– Access level – determines who has access to the
different types of information
– Access control – determines types of user access,
such as read-only access
6-20
Increased Security
• Why you would want to define access level
security?
– Access levels will typically mimic the hierarchical
structure of the organization and protect
organizational information from being viewed
and manipulated by individuals who should not
have access to the sensitive or confidential
information
• Low level employees typically have the lowest levels
of access
• High level employees typically have access to all
types of database information
6-21
Increased Security
– For example: You would not want analysts
viewing all salary information for the entire
company - in general:
• Analysts can usually only view their own salary
• Managers have higher access and can view the
salaries of all their team members, but cannot view
other managers’ salaries
• Directors can view all of their managers’ and
analysts’ salaries, but not other directors’ salaries
• The CFO and CEO can view every employee’s
salary
6-22
RELATIONAL DATABASE
FUNDAMENTALS
• Entity – a person, place, thing, transaction, or
event about which information is stored
– The rows in each table contain the entities
– In Figure 6.5 CUSTOMER includes Dave’s Sub Shop
and Pizza Palace entities
• Entity class (table) – a collection of similar
entities
– In Figure 6.5 CUSTOMER, ORDER, ORDER LINE,
DISTRIBUTOR, and PRODUCT entity classes
6-23
RELATIONAL DATABASE
FUNDAMENTALS
• Attributes (fields, columns) – characteristics or
properties of an entity class
– The columns in each table contain the attributes
– In Figure 6.5 attributes for CUSTOMER include:
•
•
•
•
Customer ID
Customer Name
Contact Name
Phone
– Possible other attributes:
•
•
•
•
Address
Fax
E-mail
Cell phone
6-24
RELATIONAL DATABASE
FUNDAMENTALS
• Primary keys and foreign keys identify the various
entity classes (tables) in the database
– Primary key – a field (or group of fields) that uniquely
identifies a given entity in a table
– Foreign key – a primary key of one table that appears an
attribute in another table and acts to provide a logical
relationship among the two tables
– Example
• Hawkins Shipping in the DISTRIBUTOR table has a primary key
called Distributor ID – DEN8001
• Hawkins Shipping (Distributor ID DEN8001) is responsible for
delivering orders 34561 and 345652
• Therefore, Distributor ID in the ORDER table creates a logical
relationship (who shipped what order) between ORDER and
DISTRIBUTOR
6-25
Potential relational
database for CocaCola
6-26
RELATIONAL DATABASE
FUNDAMENTALS
• How many orders have been placed for T’s Fun Zone?
– Ans: 1 Order IT 34563
• How many orders have been placed for Pizza Palace?
– Ans: None
• How many items are included in Dave’s Sub Shop’s two
orders?
– Ans: Order 34561 has 3 items and order 34562 has one item for a
total of 4 items in both orders.
• Who is responsible for distributing Dave’s Sub Shop’s
orders?
– Ans: Hawkins Shipping
• Which products are included in Order 34562?
– Ans: 300 Vanilla Coke
6-27
DATABASE MANAGEMENT SYSTEMS
• Database management systems (DBMS) –
software through which users and application
programs interact with a database
6-28
DATABASE MANAGEMENT SYSTEMS
• Direct interaction –
– The user interacts directly with the DBMS
– The DBMS obtains the information from the
database
• Indirect interaction
– User interacts with an application (i.e., payroll
application, manufacturing application, sales
application)
– The application interacts with the DBMS
– The DBMS obtains the information from the
database
6-29
DATABASE MANAGEMENT SYSTEMS
• Four components of a DBMS
6-30
Data Definition Component
• Data definition component – creates and
maintains the data dictionary and the structure
of the database
• The data definition component includes the data
dictionary
– Data dictionary – a file that stores definitions of
information types, identifies the primary and foreign
keys, and maintains the relationships among the
tables
– The data dictionary is an important part of the DBMS
because users can consult the dictionary to
determine the different types of database information
6-31
Data Definition Component
• Data dictionary essentially defines the logical properties of
the information that the database contains
Business integrity constraint
Relational integrity constraint
6-32
Data Manipulation Component
• Data manipulation component – allows users to
create, read, update, and delete information in a
database
• A DBMS contains several data manipulation tools:
– View – allows users to see, change, sort, and query the
database content
– Report generator – users can define report formats
– Query-by-example (QBE) – users can graphically
design the answers to specific questions
– Structured query language (SQL) – query language
6-33
Data Manipulation Component
• Sample report using Microsoft Access Report Generator
6-34
Data Manipulation Component
• Sample report using Access Query-By-Example (QBE) tool
6-35
Data Manipulation Component
• Results from the query in previous QBE
6-36
Data Manipulation Component
• SQL version of the QBE Query in Figure 6.10
6-37
Application Generation and Data
Administration Components
• Application generation component – includes
tools for creating visually appealing and easy-touse applications
• Data administration component – provides
tools for managing the overall database
environment by providing faculties for backup,
recovery, security, and performance
• IT specialists primarily use these components
6-38
INTEGRATING DATA AMONG MULTIPLE
DATABASES
• Integration – allows separate systems to
communicate directly with each other
– Forward integration – takes information
entered into a given system and sends it
automatically to all downstream systems and
processes
– Backward integration – takes information
entered into a given system and sends it
automatically to all upstream systems and
processes
6-39
INTEGRATING DATA AMONG MULTIPLE
DATABASES
• One of the biggest benefits of integration is that organizations
only have to enter information into the systems once and it is
automatically sent to all of the other systems throughout the
organization
• This feature alone creates huge advantages for organizations
because it reduces information redundancy and ensures
accuracy and completeness
• Without integrations an organization would have to enter
information into every single system that requires the information
from marketing and sales to billing and customer service
– Entering the same customer information into multiple systems is
redundant, and chances of making a mistake in one of the systems is
high
– For example, customer information would have to be manually entered
into the marketing, sales, ordering, inventory, billing, and shipping
databases. (Each of these systems are separate and would have their
own database – if the company doesn’t have a complete ERP
installed.)
6-40
INTEGRATING DATA
AMONG MULTIPLE DATABASES
• Forward and backward integration
6-41
INTEGRATING DATA
AMONG MULTIPLE DATABASES
• Sales enters the information when it is negotiating the sale
(looking for opportunities)
• The information is then passed to the order entry system
when the order is actually placed
• The order fulfillment system picks the products from the
warehouse, packs the products, labels boxes, etc
• Once the order is filled and shipped, the customer is billed
• What would happen if users could enter order information
directly into the billing system?
– The systems would quickly become out-of-sync. There might be bills
for nonexistent orders, or orders that do not have any bills (if
someone deleted a bill)
– For this reason organizations typically place a business-critical
integrity constraint on integrated systems: With a forward integration
the information must be entered in the sales system, you could not
enter information directly into the billing system
6-42
INTEGRATING DATA
AMONG MULTIPLE DATABASES
• Integrations are expensive to build and maintain and difficult
to implement
• For these reasons many organizations only build forward
integrations and use business-critical integrity constraints to
ensure all information is always entered only at the start of
the integration (one source of record)
• Why would an organization want to build both forward and
backward integrations?
– This allows users to enter information at any point in the business
process and the information is automatically sent upstream and
downstream to all other systems
– For example, if order fulfillment determined that they could not fulfill
an order (the product had been discontinued), they could simply
enter this information into the database and it would be sent
automatically upstream to the sales representative who could contact
the customer and downstream to billing to remove the item from the
bill
6-43
INTEGRATING DATA
AMONG MULTIPLE DATABASES
• Building a central repository specifically
for integrated information
6-44
INTEGRATING DATA
AMONG MULTIPLE DATABASES
• Users can create, read, update, and delete
in the main customer repository, and it is
automatically sent to all of the other
databases
• Business-critical integrity constraints still
need to be built to ensure information is
only ever entered into the customer
repository, otherwise the information will
become out-of-sync
6-45
HISTORY OF DATA WAREHOUSING
• Bill Inmon, is recognized as the "father of the
data warehouse" and co-creator of the
"Corporate Information Factory."
• Data warehouses extend the transformation of
data into information
• In the 1990’s executives became less
concerned with the day-to-day business
operations and more concerned with overall
business functions
• The data warehouse provided the ability to
support decision making without disrupting the
day-to-day operations
6-46
DATA WAREHOUSE FUNDAMENTALS
• Data warehouse – a logical collection of information –
gathered from many different operational databases – that
supports business analysis activities and decision-making
tasks
• The primary purpose of a data warehouse is to aggregate
information throughout an organization into a single
repository for decision-making purposes
– Database store information for a single application whereas a data
warehouse stores information from multiple databases, or multiple
applications, and external information such as industry information
• This enables cross-functional analysis, industry analysis, market
analysis, etc., all from a single repository
– Data warehouses support online analytical processing (OLAP)
6-47
DATA WAREHOUSE FUNDAMENTALS
• Extraction, transformation, and loading (ETL) – a
process that extracts information from internal and
external databases, transforms the information using a
common set of enterprise definitions, and loads the
information into a data warehouse
– ETL process also gathers data from the data warehouse and
passes it to the data marts
• Data mart – contains a subset of data warehouse
information
– A data warehouse has an enterprise-wide organizational focus,
while a data mart focuses on a subset of information for a given
business unit such as finance
6-48
DATA WAREHOUSE FUNDAMENTALS
6-49
Multidimensional Analysis
• Databases contain information in a series of
two-dimensional tables
• In a data warehouse and data mart, information
is multidimensional, it contains layers of
columns and rows
– Dimension – a particular attribute of information –
such as Products, Promotions, Stores, Category,
Region, Stock price, Date, Time, Weather
• The ability to look at information from different
dimensions can add tremendous business
insight
– By slicing-and-dicing the information a business can
uncover great unexpected insights
6-50
Multidimensional Analysis
• Cube – common term for the representation
of multidimensional information
6-51
Multidimensional Analysis
• Users can slice and dice the cube to drill down into
the information
– Cube A represents store information (the layers),
product information (the rows), and promotion
information (the columns)
– Cube B represents a slice of information
displaying promotion II for all products at all
stores
– Cube C represents a slice of information
displaying promotion III for product B at store 2
6-52
Multidimensional Analysis
• Data mining – the process of analyzing data to
extract information not offered by the raw data
alone
– Data mining can begin at a summary information level
(coarse granularity) and progress through increasing
levels of detail (drilling down), or the reverse (drilling up)
• To perform data mining users need data-mining
tools
– Data-mining tool – uses a variety of techniques to find
patterns and relationships in large volumes of
information and infers rules that predict future behavior
and guide decision making
• Data-mining tools include query tools, reporting tools,
multidimensional analysis tools, statistical tools, and intelligent
agents
6-53
Multidimensional Analysis
• What might an accountant discover through
the use of data-mining tools to drill down into
the details of all of the expense and
revenue?
– Which employees are spending the most
amount of money on long-distance phone calls
– Which customers are returning the most
products
6-54
Information Cleansing or Scrubbing
• An organization must maintain high-quality data in
the data warehouse
• What would happen if the information contained in
the data warehouse was only about 70 percent
accurate?
– Would you use this information to make business
decisions?
– Is it realistic to assume that an organization could get to
a 100% accuracy level on information contained in its
data warehouse?
• Information cleansing or scrubbing – a process
that weeds out and fixes or discards inconsistent,
incorrect, or incomplete information
6-55
Information Cleansing or Scrubbing
• Customer information exists in several operational
systems with different detail information
– Determining which contact information is accurate and correct for this
customer depends on the business process that is being executed
6-56
Information Cleansing or Scrubbing
• Standardizing Customer name from Operational Systems
6-57
Information Cleansing or Scrubbing
Typical events that occur during information cleansing
6-58
Information Cleansing or Scrubbing
• Accurate and complete information
6-59
Information Cleansing or Scrubbing
• Why do you think most businesses cannot achieve 100%
accurate and complete information?
– Some companies are willing to go as low as 20% complete just to
find business intelligence
– Few organizations will go below 50% accurate – the information is
useless if it is not accurate
• Achieving perfect information is almost impossible
– The more complete and accurate an organization wants to get its
information, the more it costs
– The tradeoff between perfect information lies in accuracy verses
completeness
– Accurate information means it is correct, while complete information
means there are no blanks
– Most organizations determine a percentage high enough to make
good decisions at a reasonable cost, such as 85% accurate and
65% complete
6-60
BUSINESS INTELLIGENCE
• Business intelligence – information that
people use to support their decisionmaking efforts
• Principle BI enablers include:
– Technology
– People
– Culture
6-61
BUSINESS INTELLIGENCE
• Technology
– Even the smallest company with BI software can do sophisticated analyses
today that were unavailable to the largest organizations a generation ago.
The largest companies today can create enterprise-wide BI systems that
compute and monitor metrics on virtually every variable important for
managing the company.
– Technology is the most significant enabler of business intelligence.
• People
– Understanding the role of people in BI allows organizations to
systematically create insight and turn these insights into actions.
Organizations can improve their decision making by having the right people
making the decisions. This usually means a manager who is in the field and
close to the customer rather than an analyst rich in data but poor in
experience.
– In recent years “business intelligence for the masses” has been an
important trend, and many organizations have made great strides in
providing sophisticated yet simple analytical tools and information to a much
larger user population than previously possible.
6-62
BUSINESS INTELLIGENCE
• Culture
– A key responsibility of executives is to shape and manage corporate
culture. The extent to which the BI attitude flourishes in an
organization depends in large part on the organization’s culture.
– Perhaps the most important step an organization can take to
encourage BI is to measure the performance of the organization
against a set of key indicators. The actions of publishing what the
organization thinks are the most important indicators, measuring
these indicators, and analyzing the results to guide improvement
display a strong commitment to BI throughout the organization
6-63
DATA MINING
• Data-mining software includes many forms of AI such
as neural networks and expert systems
6-64
DATA MINING
•
•
•
Data-mining tools apply algorithms to
information sets to uncover inherent trends
and patterns in the information
Analysts use this information to develop new
business strategies and business solutions
Common forms of data-mining analysis
capabilities include:
–
–
–
Cluster analysis
Association detection
Statistical analysis
6-65
Cluster Analysis
•
Cluster analysis – a technique used to divide an
information set into mutually exclusive groups
such that the members of each group are as
close together as possible to one another and the
different groups are as far apart as possible
–
–
–
–
•
Consumer goods by content, brand loyalty or similarity
Product market typology for tailoring sales strategies
Retail store layouts and sales performances
Corporate decision strategies using social preferences
CRM systems depend on cluster analysis to
segment customer information and identify
behavioral traits
6-66
Association Detection
•
Association detection – reveals the degree to
which variables are related and the nature and
frequency of these relationships in the information
–
–
–
Maytag uses association detection to ensure that each
generation of appliances is better than the previous
generation
Maytag’s warranty analysis tool automatically detects
potential issues, provides quick and easy access to
reports, and performs multidimensional analysis on all
warranty information
Market basket analysis – analyzes such items as
Web sites and checkout scanner information to detect
customers’ buying behavior and predict future behavior
by identifying affinities among customers’ choices of
products and services
6-67
Statistical Analysis
•
Statistical analysis – performs such
functions as information correlations,
distributions, calculations, and variance
analysis
– Forecast – predictions made on the basis
of time-series information
– Time-series information – time-stamped
information collected at a particular
frequency
6-68
Statistical Analysis
•
•
•
•
Kraft uses statistical analysis to assure consistent
flavor, color, aroma, texture, and appearance for all of
its lines of foods
Kraft evaluates every manufacturing procedure, from
recipe instructions to cookie dough shapes and sizes to
ensure that the billions of Kraft products that reach
consumers each year taste great (and the same) with
every bite
Nestle Italiana uses data mining and statistical analysis
to determine production forecasts for seasonal
confectionery products
The company’s data-mining solution gathers, organizes,
and analyzes massive volumes of information to
produce powerful models that identify trends and predict
confectionery sales
6-69
Mining the Data Warehouse
•
Ben & Jerry’s tracks the ingredients and
life of each pint in a data warehouse. If a
consumer calls in with a complaint, the
consumer affairs staff matches up the
pint with which supplier’s mile, eggs, or
cherries, etc. did not meet the
organization’s near-obsession with
quality.
6-70
BI at Harrah’s
•
The Total Rewards program allows Harrah’s to give every
single customer the appropriate amount of personal
attention, whether it’s leaving sweets in the hotel room or
offering free meals.
–
Total Rewards works by providing each customer with an account
and a corresponding card that the player swipes each time he or
she plays a casino game. The program collects information, via a
database, on the amount of time the customers gamble, their total
winnings and losses, and their betting strategies.
•
–
Customers earn points based on the amount of time they spend
gambling, which they can then exchange for comps such as free
dinners, hotel rooms, tickets to shows, and even cash.
Without database integration among its hotels and casinos,
Harrah’s would be unable to determine what a customer’s true
value is to the company.
•
For example, a customer that spend $500,000 dollars at one casino
might be treated like royalty. This same customer could visit another
Harrah’s location, but since the information is not integrated, the new
location would have no idea that they had a high-rolling customer on
the premises and they might not treat the customer accordingly.