Infrastructure Issues for a Data Warehouse
Download
Report
Transcript Infrastructure Issues for a Data Warehouse
Infrastructure for Data
Warehouses
Basics Of Data Access
Buffer
Bus Structure
Memory
Cache
Data Store
Machine Memory
Buffer
Buffer
Data Store
Basics Of Data Access: Storage
Data on a single disk all share one controller.
Striping data randomly across several disks reduces
contention for controller time.
Databases requiring 100% uptime use striping or
mirroring to facilitate backup and maintenance.
Backups can be written from one copy while
processing proceeds with the other one.
Striping, particularly in a RAID environment, permits
replacement of failed hardware without bringing
down the database.
Basics Of Data Access: Retrieval
The speed of processing a given retrieval is
primarily governed by the number of disk accesses
required to execute it.
Data is transferred to and from the disk in buffer
sized units. On large systems these buffers (blocks)
can be set by the code; on PC’s the buffer sizes
(sectors) are fixed.
A block may contain several records. If all of the
records in a block can be processed before another
retrieval is needed then processing is faster.
Basics Of Data Access: Busses
A bus transfers data from device to device. In single
systems the bus is internal. In distributed systems
the network acts as the bus.
Busses transfer data in units of a word. Normally a
word is smaller than a buffer unit so transfer takes
several bus cycles. (For networks packets do the
same thing as words on a backplane bus.)
Busses can service only one unit on the bus
network at a time. Multiple units on the same bus
can generate bus contention.
Basics Of Data Access: Cache
Cache is high speed data storage location
that stores the most recently used data that is
to be transferred between units in a system.
Cache speeds up processing by taking
advantage of data reuse (looping) typical of
most programs, by reducing the number of
physical DASD accesses required.
Memory cache (as opposed to CPU cache) is
a location in main memory and can be set by
the system administrator.
Program Characteristics
Transaction Systems
Access few records at a
time.
Require records from
random locations.
Update and modify
data frequently.
Data Warehouse Systems
Access a number of
records at a time.
Require records in
order.
Update and modify
data infrequently.
System Tuning
Transaction Systems
Small buffers
Large cache
Fast busses
Data Warehouse Systems
Large buffers
Small cache
Wide busses
Acxiom Overview
Acxiom, creates and delivers Customer
and Information Management Solutions
that enable many of the largest, most
respected companies in the world to build
great relationships with their customers.
Acxiom achieves this by blending data,
technology and services to provide the
most advanced customer information
infrastructure available in the marketplace
today.
Data Warehouses
The characteristics of an Acxiom data
warehouse generally are...
•
•
•
•
•
•
•
•
Large multi-terabyte databases
Large periodic sequential data loads
Denormalized database schema
Sequential reads/full table scans
Little or no indices
Little or no transaction logging
Robust periodic backup solutions
Performance measured using
megabytes/gigabytes per second (MBPS,
GBPS)
Data Warehouses
IBM
The processing platform is
generally a large global
class server or cluster of
servers running UNIX.
The database is;
A large vertical
database that is
denormalized with few
tables but very long
with sorted data and
are sometimes
several billion rows.
The data is striped
across the storage in
a manner that
prevents physical hot
spots and takes
advantage of the wide
bandwidth.
Database
The storage subsystem is very fast
with wide bandwidth
and high levels of
redundancy which
permits the ability to
move large amounts
of sequential data in
a very short time.
Data Warehouses
IBM
Transactional Databases
The characteristics of an Acxiom transactional
database generally are...
• Small, usually no larger than a few terabytes
• Random and simultaneous inserts, updates,
deletes, and queries
• Random reads and writes
• Normalized database schema
• Transaction logging and archiving with
incremental and periodic backup solutions
• Generally sub-second response required per
transaction taking into account concurrency
• Performance measured using transactions per
second (TPS) and I/O latency
Transactional Databases
IBM
The processing platform is
generally a medium/enterprise
class server
The database is;
A normalized database
that utilizes lookup
tables.
The data is stored
randomly within a table
but striped across the
storage to prevent
physical hot spots.
Database
The storage subsystem is very fast
with low latency and
nominal bandwidth
and high levels of
redundancy which
permits the ability to
move small amounts
of selected data
quickly.
Transactional Databases
IBM
Hybrid Databases
The characteristics of an Acxiom hybrid
database generally are...
• Medium sized, usually three to ten terabytes
• Random and simultaneous inserts, updates, deletes, and
queries
• Random and sequential reads and writes
• Loosely normalized database schema
• Indices used sparingly
• Usually a batch maintenance process
• Transaction logging and archiving with incremental and
periodic backup solutions
• Generally sub-second response required per transaction
taking into account concurrency
• Performance measured using TPS, I/O latency, and MBPS
Hybrid Databases
IBM
The processing platform is
generally a medium sized
global class server
The database is;
A large vertical
database that is loosely
normalized with few
tables but very long
with sorted data and
are sometimes more
than a billions rows.
The data is striped
across the storage in a
manner that prevents
physical hot spots and
takes advantage of the
wide bandwidth.
Database
The storage subsystem is very fast
with wide bandwidth
and high levels of
redundancy which
permits the ability to
move large amounts
of random and
sequential data in a
very short time.
Hybrid Databases
IBM
What’s New/
Future Innovations
Grid or scale-out environments...
• Utilize low cost commodity based servers
• Low cost/no cost operating systems
• Many servers can be working on one problem with the
aggregate processing power being more that one large
server for less money
• Not locked into a single vendor or supplier
• When adding a new node, able to use current
technology at a lower price
• Need to understand and factor in peripheral costs such
as network, administration, data center etc.
Parallel
Grid
Clustered
Grid
IBM
server
IBM
server
IBM
server
IBM
server
IBM
server
IBM
server
pSeries
pSeries
pSeries
pSeries
pSeries
DB
DB
DB
DB
DB
DB
OS
OS
DB
pSeries
Distributed Grid Database
• Shared nothing environment, each partition has its own
resources allowing unlimited scalability (up to 999
partitions).
Any partition can receive connections and
• Centralized
management
of partitioned
environment.
distribute
queries among
the other
nodes.
• Data is equally distributed across all partitions.
Summary
Understand the process in which the database is to be
used and fashion a solution to meet the requirements
and customer expectations
Even though a DBA may only be responsible for the
database, many factors such as operating system and
hardware configuration affect the functionality of the
database and thus are a concern to the DBA. A DBA
must relate the database to its environment to achieve an
optimized solution.
A large multi-terabyte database is not a scary monster, it
is the same as dealing with a smaller database, just add
a few more zeros.