CS-220DataWarehousingFall2014

Download Report

Transcript CS-220DataWarehousingFall2014

Data Warehousing
• High quality information is vital to good
decision making
• Data is different from information
• Organizations have a lot of data, may not
have a lot of information
•  information gap
Data Warehousing
• Databases, in an organization, may have
been developed in a fragmented manner
over time  need to reorganize, filter, .. the
data before mining it
• Most systems are designed for operational
processing, not informational processing
Data Warehousing
• Operational processing: the processing of
daily data
• Information processing: analysis of
summarized data in order to support
decision making
Data Warehousing
• Ex 1: WalMart: what to stock, when, ..
• Ex 2: Airlines: how much to overbook a
flight by (forecast no-shows, ..)
• Ex 3: Amazon: people who bought this
book also bought these books
Data Warehousing
• A lot of transactions are done via the web
today
•  fully automated, lots of data, real time
 opportunities to mine the data
In this chapter
• How data warehouses relate to operational
systems
• 3-tier architecture for data warehousing
• Extracting data from operational systems
and loading them into a data warehouse
• Big data / NoSql
• Data mining
Data Warehousing: Basic Concepts
• A data warehouse is:
–
–
–
–
subject oriented (organized around key topics)
Integrated (formats, names, ..)
time variant (time dimension  trends, …
non-updateable (loaded by operation systems,
not by end users)
Data Warehousing: Brief History
• Improvements in database technology
• Advances in computer hardware (cpu,
memory, storage, ..)
• End user computing
• Advances in middleware products
(networks, internet, ..)
Need for Data Warehousing
• Business require an integrated, companywide view of high-quality information
• IS department must separate informational
from operational systems to improve
performance in managing company data
Need for a company-wide view
• Data scattered around company
–
–
–
–
–
Different formats
Different names/synonyms
Free data vs structured data
Inconsistent data (?)
Missing data
Operational vs informational
systems
• Operational system: designed to manage
day to day operations; reservation systems,
sales transactions, ..
• Informational systems: designed to support
decision making; trends, data mining,
forecasting, planning, ..
Informational systems
• A data warehouse is centralized
• A data warehouse adds value to data by
improving its quality and consistency
• A data warehouse eliminates the
contention/bottleneck for resource use
Data Warehouse Architectures
• Independent data marts
• Dependent data marts
• Logical data marts
Independent data marts
•
•
•
•
Figure 9-2: 4 steps from left to right
Data is extracted from various sources
Data is transformed and integrated
Data warehouse = collection of data marts;
contains both summary and detailed data
• Tools for users
Data Mart
• A data warehouse that is limited in scope,
whose data is generated from either a data
warehouse or other source data systems
• Independent data mart = data mart filled
with data extracted from the operational
environment, not from a data warehouse
Dependent data marts
• Independent data marts have some
limitations:
• Need separate process for each data mart
• Possible consistency issues
• Limited drill down capabilities
• Scaling costs can be high
Dependent data mart
• A data mart filled exclusively from an
enterprise data warehouse and its reconciled
data
• EDW = Enterprise Data Warehouse = a
centralized, integrated data warehouse that
is the entry point and single source of data
for decision support applications
Dependent data marts
•
•
•
•
•
Figure 9-3: 4 steps from left to right
Data is extracted from various sources
Data is transformed and integrated
EDW and data marts
Tools for users, pulling data from EDW
and/or data marts
Logical data marts and Real time
Data Warehousing
• Figure 9-4: 3 steps from left to right
• Data is extracted from various sources
• Data is transformed and integrated and fed
real-time to data marts
• Tools for users, pulling data from EDW
and/or data marts
Real time Data Warehouse
• An Enterprise Data Warehouse that accepts
near-real-time feeds of transactional data
from the systems of record, analyzes
warehouse data, and in near real-time relays
business rules to the data warehouse
Real time Data Warehouse
• Enable real time data analysis and response
• Capture customer data at time of event
• Analyze customer behavior and possibly
predict customer response
• Develop rules for optimizing customer
interaction
• Take immediate action with customer
Real time Data Warehouse
• E-commerce, abandoned shopping cart can
trigger instant promotional email message
• Fraud detection in credit card transaction
(unusual pattern detected)  triggers call to
credit card owner
Three-Layer Data Architecture
•
•
•
•
•
Figure 9-5:
Operational data (Operational systems)
Reconciled data (EDW)
Derived data (Data marts)
Note: for each data, there is metadata
associated with it (describes the data, ..)
Data warehouse data
•
•
•
•
Status vs event data
Figure 9-6
Status data = data before and after event
Event data = data of the event
Data warehouse data
• Transient vs periodic data
• Figure 9-7
• Transient data = can be overwritten  less
data needed, but less information
• Periodic data = never altered  more data
needed, but more information
Transient vs periodic data
• Likes of a post
• Do we just store the number of likes
(transient)? Can we prevent somebody from
voting twice? Probably not
• Do we store every like (periodic)? 
aggregate to get total
The Derived Data Layer
• Data layer associated with physical and
logical data marts (from figure 9.5)
• Users interact with this layer for their
decision support applications
Characteristics of Derived Data
• The source of derived data is the reconciled
data
• Typically, derived data is aggregated (per
topic, user group, ..)  it looks more like
information than data
The Star Schema
• A simple database design in which
dimensional data are separated from fact or
event data. A dimensional model is another
name for the star schema.
The Star Schema
• Figure 9.9: a start schema includes:
– Dimension Tables (descriptive data)
– Fact table (actual data)
• ER model looks like a star
The Star Schema
• Figure 9.10: example of a star schema:
– Dimension Tables: product, period, store
– Fact table: sales
• Primary key in fact table (sales) is a
combination of all primary keys in the
dimension tables
The Star Schema
• A row of the product table contains
information for a particular product
• A row of the period table contains
information about a unit of time
• A row of the store table contains
information about a store
The Star Schema
• A row of the sales table contains
information about the sales of a particular
product in a particular product during a
particular unit of time: unit sold, dollars
sold, and dollars cost
• Example with actual data in figure 9.11
Fact table
• Can be big: example Sears
• Assumption: 1000 stores, 10000 products, 5
years of daily data
•  1000 X 100000 * ( 5 * 365) rows
Grain of the Fact table
• = level of detail of the fact table =
intersection of primary keys of the
dimension tables
• The finer the grain, the bigger the table
• Tradeoff between size and information
quality/detail
Fact table Variations
• If we are interested in storing events, we
can have a “factless” fact table
•  it contains foreign key values only; no
extra data
• Example figure 9.14: an event took place at
that time, between these people, at that
facility, on that topic (course)
BIG DATA
• A database whose size strains the ability of
RDBMS to capture, manage, and process
data efficiently  scalability issues
• Large scale data gathering and analytics
(particularly in web applications)  big
data  challenging for traditional relational
database management systems
BIG DATA
• NewSQL  Make SQL based, relational
databases more scalable
• NoSQL  alternative databases to SQL
NoSQL databases
• NoSQL Stands for Not Only SQL
• Less functionality than SQL
• Higher performance (fast), more scalable
than SQL
NoSQL
•
•
•
•
Non relational
No schema
Scalable
Hash tables and similar data structures
NoSQL
• No joins
• No constraints
• No complex transactions (do this but if that,
rollback the transaction)
NoSQL Data Structures
• Hash tables: Key, Value pairs
• Key, Document pairs (can retrieve the
document via its key, then query the
document)
• Graph
Hash tables
• Key, Value pairs
• MD Maryland, CA  California, NY 
New York, ..
• Jan  January, Feb  February, ..
• originalPostId  the whole thread for this
original post (this value contains a lot of
data)