Data Warehouse
Download
Report
Transcript Data Warehouse
Designing the Data Warehouse
and Data Mart
Methodologies and Techniques
Data Warehouse
• A data warehouse is the main repository of an organization's
historical data, its corporate memory. It contains the raw
material for management's decision support system.
• The critical factor leading to the use of a data warehouse is
that a data analyst can perform complex queries and analysis,
such as data mining, on the information without slowing down
the operational systems. (http://www.sqlpower.ca)
• Bill Inmon, an early and influential practitioner, has formally
defined a data warehouse in the following terms:
“Building the Data Warehouse” - by W. H. Inmon
Data Warehouse
• Subject-oriented
– The data in the database is organized so that all the data
elements relating to the same real-world event or object are
linked together;
• Time-variant
– The changes to the data in the database are tracked and
recorded so that reports can be produced showing changes
over time;
• Non-volatile
– Data in the database is never over-written or deleted - once
committed, the data is static, read-only, but retained for
future reporting; and
• Integrated
– The database contains data from most or all of an
organization's operational applications, and that this data is
made consistent.
Basic Principles
ETL: Extracting, Transforming (or Transporting) and Loading.
Life Cycle of the DW
First time load
Operational Databases
Warehouse Database
Refresh
Refresh
Purge or Archive
Refresh
Data Transfers into a Database
• First time system implementation
– From a manual system
•
•
•
•
Data warehousing projects
Database version upgrade
ERP projects
Migration
– From old to new system
Data Transfers between Systems
• Dynamic data (eg. sales orders)
– Interface required?
• Static data (eg. customers)
– Conversion required?
What Can go Wrong
• Data not available
– feature activated from implementation onwards
– Massive data entry
– Eg: different account structure
•
•
•
•
•
Data incomplete
Data inconsistent (eg: engineering vs accounts)
Wrong level of granularity
Data not clean
New system requires changes – new product codes
Data Cleaning must Address
• Different department record same info under
different codes
• Multiple records of same company (under
different names)
• Fields missing in input tables (eg: c/o)
• Different depts. Record different addresses for
same customer
• Use of different units for time periods
Labour Intensive Tasks
•
•
•
•
Data entry
Data checks
Working on solving conflicts
Allocating new codes
• Solution = introduce as much automation as possible
–
–
–
–
–
SQL / SQL loader (Oracle)
Custom conversion programmes to extract, modify and upload data
Filtering
Parsing (eg: excel)
Staging areas for conversion in progress
Data Utilities
• ORACLE is king of data handling
• Export: to transfer data between DBs
– Extract both table structure and data content into
dump file
• Import: corresponding facility
• SQL*loader automatic import from a variety
of file formats into DB files
– Needs a control file
Control files: using SQLloader
• Data transfers in and out of DB can be automated
using the loader
– Create a data file with the data(!)
– Create a control file to guide the operation
• Load creates two files
– Log file
– “bad transactions” file
• Also a discard file if control file has selection criteria
in it
Example 1 – The Supplier File
New supplier code to include city where firm is based
Assignation of category based on amounts purchased
OLD
Sup code
4 digits
Sup name
Sup address
City
Phone
Example 1 – The Supplier File
New supplier code to include city where firm is based
Assignation of category based on amounts purchased
OLD
Sup code
4 digits
Sup name
Sup address
City
Phone
NEW
Sup code
3 letters +
4 digits
Sup name
Sup address… Phone Cat
1,2,3 depending
on total purchases
last year
Example 2 – New Cost Accounting
Structure
Maintenance department expenditure:
1 account => separate accounts for different production activities
OLD
Intervention code
Desc. Date
Labour Parts Total
Example 2 – New Cost Accounting
Structure
Maintenance department expenditure:
1 account => separate accounts for different production activities
OLD
Intervention code
Desc. Date
Labour Parts Total
NEW
Intervention code
Desc. Date
labour Parts Total Account
Example 3: Merging Files
• Complete customer file based on Accounts and
Sales and Shipping
OLD (finance)
CustID name
address city
account number credit limit
balance
discount rates
rep_name
OLD (sales)
CustID* name
address city
sales_to_date
OLD (Shipping)
CustID**
name
address city
Preferred haulier
Example 4: Change of Business
Practices
• Payment by bank draft for international customers
• Automatic payment into account for national customers
• Payment direct into account for all customers
Data Staging Area
•
•
•
•
The construction site for the warehouse
Required by most scenarios
Connected to wide variety of sources
Clean / aggregate / compute / validate data
Operational
system
Extract
Data
staging
area
Transform
Transport
(Load)
Warehouse
Remote Staging Model
Data staging area within the warehouse environment
Warehouse environment
Oper. envt.
Operational
system
Extract,
transform,
transport
Data
staging
area
Transform
Transport
(Load)
Warehouse
Data staging area in its own environment, avoiding negative
impact on the warehouse environment
Staging envt.
Oper. envt.
Operational
system
Extract,
transform,
transport
Data
staging
area
Transform
Warehouse envt.
Transport
(Load)
Warehouse
Onsite Staging Model
Data staging area within the operational
environment,
possibly affecting the operational system
WH envt.
Operational environment
Operational
system
Extract
Data
staging
area
Transform
Transport
(Load)
Warehouse
Data Mart
• A subset of a data warehouse that supports the
requirements of a particular department or business
function.
• Characteristics include:
– Do not normally contain detailed operational data
unlike data warehouses.
– May contain certain levels of aggregation
Dependent Data Mart
Flat Files
Operational
Systems
Marketing
Marketing
Sales
Finance
Human Resources
Data
Warehouse
Sales
Finance
Data Marts
External Data
Independent Data Mart
Operational
Systems
Flat Files
Sales or Marketing
External Data
Reasons for Creating a Data Mart
• To give users more flexible access to the
data they need to analyse most often.
• To provide data in a form that matches the
specific needs of a group of users
• To improve end-user response time.
• Potential users of a data mart are clearly
defined and can be targeted for support
Why Create a Data Mart?
• To provide appropriately structured data as
dictated by the requirements of the end-user
access tools.
• Building a data mart is simpler (and much
quicker) compared with establishing a
corporate data warehouse.
• The cost of implementing data marts is far
less than that required to establish a data
warehouse.