Building Data Warehouse - UB Computer Science and Engineering
Download
Report
Transcript Building Data Warehouse - UB Computer Science and Engineering
Building Data Warehouse
Zhenhao Qi
Department of Biochemistry
& Department of Computer Science and Engineering
State University of New York at Buffalo
March 23rd, 2000
Outline:
1. Migrating data from legacy systems: an iterative,
incremental methodology.
2. Building high data quality into data warehouse.
3. Optimal machine architectures for parallel query scalability
The difficulties of migrating data from legacy systems:
1. The same data is presented differently in different system.
2. The schema for a single database may not be consistent over time.
3. Data may simply be bad.
4. The data values are not represented in a form that is meaningful to
end users.
5. Conversions and migrations in heterogeneous environments typically
involve data from multiple incompatible DBMS and hardware platform.
6. The execution windows for data conversion programs must be
coordinated carefully in order to provide the new applications with a
consistent view of data without impacting production system.
The need for an iterative, incremental methodology
IT organizations failed to use the same serial methodology for large
migration problems that they used for relatively discrete projects.
1. In a large organization, the complexity of the analysis and design
can involve person-years of effort without any demonstrable results.
2. The sheer complexity of both the data analysis and the design of
the target system has prevented effective progress.
3. The rate of change in operational systems has outstripped the
migration team’s ability to keep current.
Implications for metadata and the data warehouse
3 sources of change that the data warehouse team must anticipate
• Those arising from the normal regular changes to operational system.
• Those that result from using an iterative, incremental methodology.
• Those that result from external business drivers like acquisition.
The key to dealing with change cost effectively lies in metadata.
Four Dimensions of Metadata:
Another way of looking at the sources of technical challenge
with respect to the type of metadata required to minimize
the impact of change. The need to adapt to change, error and
complexity is regular over time can be seen like waves on
a beach.
change
time
complexity
error
Metadata that capture the current environment
• The record and data element definitions.
• Inter- and intra- database relationships.
• A definition of each interface program used to build or
refresh the warehouses
a. which inter-database joins it uses;
b. the timing and direction of the execution;
c. any execution parameters
d. dependencies on any other interface programs
e. the use or production of other ancillary database
f. the name and location of the file that contains the source
code for this data interface program.
g. the tool and session name if this interface program was
automatically generated.
Metadata required to reduce the cost of errors
The meta-model should allow the inclusion of the information
discovered at the level of data element, record, database, and
join. This information includes but is not limited to:
•legal ranges and values
•any exception logic that the data interface program should
take if an illegal value is found
Metadata that can reduce the cost of complexity
Other types of metadata may be needed to reduce the complexity of
specifying, maintaining, and executing the data interface program
Factoring in time
1. Information about each database that can effect execution
time such as:
a. Database size and volatility.
b. The time window during which each database can be accessed.
c. The mechanism that should be used for changed data capture.
2. Versioning
a. Design the meta-model to anticipate change.
b. Choose tools that provide sufficient versioning support to
facilitate input analysis.
On the lack of an integrated development environment
Operational
systems
IMS
Query tools
Data discover
and cleansing
tools
Raw
data
DB2
IDMS
Case Tools
Datamart
Servers
Corrected
data
Data conversion
Tool
Data
Data Warehouse
DBMS
Servers
Metadata
Repositories
Replication
Tools
Developing an evaluation grid
The best strategy would be to
• create a list of the types of change the organization is
most likely to encounter.
•
determine the types of metadata required to respond
to this change cost effectively.
From this data one should be able to determine a set of
requirements regarding
1. The number of systems and tools that must be interfaced.
2. The types of metadata required for the meta-model.
3. The best versioning strategy for performing impact analysis.
4. The desired set of functionality for automating this process.
The greatest cost benefit of high data quality
•
Data quality assures previously unavailable competitive advantage
and strategic capability
1. Improved accuracy, timeliness, and confidence in decision making.
2. Improved customer service and retention.
3. Unprecedented sales and marketing opportunities.
4. Support for business reengineering initiatives
•
High data quality improves productivity
1. Enables smart corporate-wide purchasing strategies.
2. Streamlines work process.
• High data quality reduces costs
1. Reduces physical inventory by identifying anomalies and
redundancies within manufacturing parts, pharmaceutical
prescription drugs, and so on.
2. Simplifies database management and reduces storage requirements
for information system.
3. Reduces mailing and production costs.
4. Reduces clerical staff.
5. Spares costly redesigns of data models.
• The absence of high data quality precludes effective
use of new systems
• High data quality avoids the compounding effect of
data contamination
What is high quality data ?
1. Addressability.
2. Domain integrity.
3. Be accurate.
4. Be properly integrated to attain entity integrity.
5. Adhere to business rules.
6. Satisfy business needs.
7. Integrity.
8. Be consistent.
9. Data redundancy must be intentional.
10. Be complete.
11. Correct cardinality.
Data reengineering: a four-phase process to
attain high data quality
External
Files
Legacy
Application
Historical
Extracts
•Customer
information
System
1. Data investigation
2. Data conditioning
& Standardization
3. Data Integration
4. Data survivorship
and Formatting
•Data
Warehouses
•Client/Server
Application
•EISs
• Data investigation
Parsing
Lexical analysis
Pattern investigation
Data typing
• Data integration
The integration phase identifies and consolidates related records
lacking common keys through statistical matching technique
Flexible construction of search keys to optimize machine resources
Flexible definition of match fields to increase data points for
statistical analysis
Variable weights and penalties for each data point to take into
account an organization’s business rules and produces scores
relative to probability.
Why uniform data access times are optimal for
parallel query execution?
• Algorithmic parallelism is achieved using a paradigm similar
to the division of labor
• The material (data) must be evenly distributed among the personnel
(CPUs) or the effect of parallelism is lost
Symmetric multiprocessor (SMP)
•A classical SMP is a tightly coupled connection model where
all components connected to a single bus are equidistant.
•Disadvantage: very short buses limit scalability.
CPUs
One hop
One hop
Shared symmetric system bus
Shared
memory
Disks
Loosely Coupled Architectures
• The 2-D mesh
It has a connection density of 4, in that each node is attached to
at most four of its neighbors.
CPU Mem
Disk
CPU Mem
Disk
Auxiliary
Disk
CPU Mem
Disk
CPU Mem
Disk
Auxiliary
Disk
CPU Mem
Disk
CPU Mem
Disk
Auxiliary
Disk
• Crossbar Switch
Borrowed from telephony, this technology creates a direct,
point-to-point connection between every node, with only one
hop through the switch to get from one node to any other node.
All nodes are only one hop
away from one another
Switch
element
Switch
element
Switch
element
Switch
element
A direct connection between
each node and
every other node
Node
Connections
Are typically
Bi-directional
And
Non-blocking
A query parse tree example
SELECT * FROM Table_a ORDER BY Column_2
Concatenation merge of sorted runs into result
Parallel sort on column_2 (ORDER BY)
Parallel full table scan of Table_a (SELECT *)
The Machine architecture is causing
the data skew
FTS
FTS
Disk
1
3
FTS
2
Sort
Disk
5
Sort
Sort
Disk
4
6
•Sort(5) receives data from FTS(1) in two hops (there are 50As)
•Sort(5) receives data from FTS(2) in three hops (there are 100As)
•Sort(5) receives data from FTS(3) in one hops (there are 10As)
SMPs: No machine-architecture-induced data skew
FTS
FTS
FTS
Sort
Sort
Sort
CPUs
Disk
Disk
Shared system bus
FTS(1) FTS(2) FTS(3) Sort(4) Sort(5) Sort(6)
Shared memory
Crossbar switch connection models
Eliminate data skew
All nodes are directly connected to all other nodes via crossbar switch.
There is only one hop to and from any destination, guaranteeing uniform
data access times.
CPU Mem CPU Mem
Disk
Disk
CPU Mem CPU Mem
Disk
Disk
CPU Mem CPU Mem
Disk
Disk
Crossbar
Switch
Refinements to the crossbar switch
architecture
1.
2.
Share the disk drives using the switch, thus combining the virtues of
shared nothing and clustered architectures in one architecture.
Make the loosely coupled nodes symmetric multiprocessors.
SMP
Disk
SMP
Disk
SMP
SMP
SMP
SMP
Crossbar
Switch
Disk
Disk
Disk
Disk