- SCT Web Site
Download
Report
Transcript - SCT Web Site
ITEC 423 Data Warehousing and Data Mining
Lecture 3
Architecture
Architecture is the art and science of designing
buildings and other structures;
Architecture is as a system design decision that
is usually not easily changed.
There are many different architectural choices
available with different solutions for
Data transfer
Data Staging Area
Data storage
Information Delivery
A General Data Warehouse Architecture
A General Data Warehouse Architecture
with Staging Area
A General Data Warehouse Architecture
with Staging Area and Data Marts
Architectural Types
Architectural Types :Centralized Data Warehouse
Takes into account the enterprise-level
information requirements
Atomic level data at the lowest level of
granularity is stored
Some summarized data may be included
Queries and applications access the central
data warehouse.
No separate data marts
Architectural Types- Independent Data Marts
Evolves in companies where the organizational
units develop their own data marts for their own
specific purposes
Each data mart serves a particular organizational
unit
More than one version of the truth may be found
Data marts are independent of one another
Different data marts may have inconsistent data
definitions and standards
Such variances hinder analysis of data across
data marts.
Architectural Types-Federated
An existing legacy of an assortment of DSS in
the form of operational systems, extracted
datasets, primitive data marts, …
May not be possible to discard investment and
start from scratch
Practical solution is a federated architectural
type
data may be physically or logically integrated
through shared key fields, overall global
metadata , distributed queries, and such other
methods
No one overall data warehouse
Architectural Types- Data-Mart Bus
Conformed supermarts approach
Analyzing requirements for a specific business subject such as orders,
shipments, billings, insurance claims, car rentals ...
Build the first data mart (supermart) using business dimensions and
metrics
These business dimensions will be shared in the future data marts.
Conform dimensions among the various data marts
Result would be logically integrated supermarts that will provide an
enterprise view of the data
Data marts contain atomic data organized as a dimensional data model
Results from adopting an enhanced bottom-up approach to data
warehouse development
Architectural Types- Hub and Spoke
Similar to the centralized data warehouse architecture:
enterprise-wide data warehouse
Atomic data is stored in the centralized data warehouse
The centralized data warehouse feeds data to the dependent data
marts on the spokes
Dependent data marts may be developed for departmental analytical
needs, specialized queries, data mining ...
Dependent data mart may have normalized, denormalized,
summarized, or dimensional data structures based on individual
requirements
Most queries are directed to the dependent data marts
Centralized data warehouse may also be used for querying
Results from adopting a top-down approach to data warehouse
development.
Building Blocks of Data Warehouses
Production
Internal
Source
Data
Archived
External
Production Data
Problems
• variations in the data formats.
• data residing on different
hardware platforms
• data supported by different
DBMS/OS
Choose
appropriate data
from different
operational
systems based on
DW requirements
comes from the
various operational
systems of the
enterprise
• financial systems
• manufacturing systems
• the supply chain
Internal Data
Determine
strategies for
collecting
data
• Start with most significant
data
• Limit to the most important
portions
Increases
complexity of
transformation
and integration
process
Held by individuals
and departments in
private files
•
•
•
•
Spreadsheets
Documents
customer profiles
departmental databases
Archived Data
Backup data
old data of
the
operational
databases
are stored
in archived
files.
Decisions
related to
archiving
how often
Different methods
of archiving
Recent data archived
to a separate archival
database that may still
be online.
which
portions
Older data archived to
flat files on disk storage.
Oldest data archived to
tape cartridges or
microfilm may be kept
off-site.
Data warehouse
keeps historical
snapshots of data.
need historical data for
analysis over time.
Look into your
archived data sets.
Depending on your
data warehouse
requirements, you
have to include
sufficient historical
data.
External Data
External data is
used especially by
decision makers
statistics relating to their
industry produced by
external agencies and
national statistical offices.
market share data of
competitors.
standard values of financial
indicators for their business
to check on their
performance.
Production data
and archived data
give you a picture based on
what you are doing or have
done in the past.
Is not enough for
understanding industry
trends and compare
performance
Extraction
Transformation
Loading
Data
Staging
Component
Data Extraction
Deals with numerous data sources
• Source data may be from different source machines in diverse data
formats.
Tools are available on the market for data extraction.
• Outside tools suitable for certain data sources
• Develop in-house programs to do the data extraction.
After extraction where to keep the data for further preparation?
Perform the extraction function in the legacy system
Extract the source into a separate physical environment from which
moving the data into the data warehouse would be easier.
• extract the source data into a
• group of flat files
• a data -staging relational data base
• a combination of both.
Data Transformation
Data for a data warehouse comes
from many disparate sources
• Clean the data from each source:
misspellings, resolution, missing
data, duplicates
• Standardize data elements: data
types, lengths,
synonyms/homonyms
• Combine related information
• Purge useless data
• Choose appropriate keys
• Summarize if necessary
Data feed is not just
an initial load.
• Same (maybe
slightly adapted)
transformation
process will be
applied
periodically.
Data Loading
The initial load moves large
volumes of data
• very time consuming.
Periodically
Extract/Transform/Load
• Yearly
• Quarterly
• Monthly
• Daily
Data Storage Component
A separate repository
Large volumes of historical data for analysis
not for quick retrieval of individual pieces of
information
multidimensional databases store data aggregated
at different levels
Information Delivery Component
complex queries,
multidimensional
analysis, and
statistical analysis
Ad hoc
reports
casual user
need
prepackaged
information
novice user once in a while
need
prefabricated
reports and
preset queries
business
analyst
need ability to
do complex
analysis using
the information
in the data
warehouse
power user
need to be able to
navigate
throughout the
data warehouse,
pick up interesting
data, format his or
her own queries,
drill through the
data layers, and
create custom
reports and ad hoc
queries.
Information Delivery Component
Information Delivery Component
Information fed into executive information systems (EIS) is
meant for senior executives and high-level managers.
Some data warehouses also provide data to data mining
applications.
• knowledge discovery systems where the mining algorithms help to
discover trends and patterns from the data
In your data warehouse , you may include several
information delivery mechanisms.
• online queries and reports
• scheduled reports through e-mail or intranet
• information delivery over the Internet
Metadata Component
Similar to the data dictionary or the data catalog in a DBMS
Data about the data in the data warehouse.
key architectural component of the data warehouse.
Types of Metadata:
• Operational metadata
• Extraction and transformation metadata
• End-user metadata
Importance of Metadata
• connects all parts of the data warehouse .
• provides information about the contents and structures to the developers.
• makes the contents recognizable to the end users.
Management and Control Component
sits on top of all the other
components.
coordinates the services
and activities within the
data warehouse.
controls the data
transformation and the
data transfer into the
data warehouse storage.
moderates the
information delivery to
the users.
works with the database
management systems
and enables data to be
properly stored in the
repositories.
monitors the movement
of data into the staging
area and from there into
the data warehouse
storage itself.
interacts with the
metadata component to
perform the
management and control
functions.