Data management

Download Report

Transcript Data management

Lecture 6
Data management
Technology of information systems
Contents
• Different types of data
• Data processing
• Data warehouse systems
Technology of information systems
Different types of data (1)
• In an enterprise we distinguish the following
business processes:
– Primary processes: delivery of products and services
– Secondary processes: taking care of resources
– Tertiary processes: management over the other
processes, i.e. planning, control and coordination.
• Different terms:
– Primary processes: production processes
– Secondary processes: support processes
– Tertiary processes: management processes
– Operational processes: primary + secundary
• Each of these type of processes generates its
own data and has different information needs
Technology of information systems
Different types of data (2)
• Primary processes generate information about:
– Orders generated by
• A client, in case of Make-To-Order
• The (master) production plan in case of Make-To-Stock
– Execution of tasks for the processing of the orders, such as:
• When the task was performed
• Which durable resources were involved
• How much or which consumable resources were involved
• Primary processes need information about the clients, the
(master) production plan and the availablity of resources.
Note: For us “information” and ”data” are synonyms; others use data for basic
facts and information for transformed facts that have a meaning to people.
Technology of information systems
Different types of data (3)
• Secondary processes generate information
about:
– Status of resources:
•
•
•
•
Quality
Availablity
Maintenance
Purchase and removal
– Use of resources in primary business processes
• They need information about:
– Need for resources in the future
– Availability of resources in the market (external info)
Technology of information systems
Different types of data (4)
• Tertiary processes generate periodically or on request for
external stakeholders (e.g. share holders and government)
information about status and forecasts of, such as:
–
–
–
–
–
–
–
Financial resources
Turnover (earnings)
Costs (expenditures)
Profit
Efficiency
Energy consumption and pollution
Diversity
• For internal stakeholders (employees) they generate plans,
procedures and decisions.
• They need information about:
– The primary and secundary processes
Technology of information systems
Different types of data (5)
• Operational information (from primary and
secondary processes) is about actual facts
• The processing of operational information is called:
On Line Transaction Processing (OLTP)
• The management information is aggregated
historical facts.
• The processing of management information is
called:
On Line Analytical Processing (OLAP)
• Note that OLTP needs the actual data, while OLAP
can use “old data” (since aggregates change slowly)
Technology of information systems
Different types of data (6)
• Discovery of patterns in this data is called: Data
mining (process mining is a special subset)
• Management information is transformed into
– Forecasts
– Plans
– Decisions
• These transformations are performed by decision
support systems.
Technology of information systems
Different types of data (7)
Each kind of information has its own preferred data type.
This is not a “law”, but a “guideline”:
• Order-related information-------Document type
• Resources-related information----Relational type
• Management information--------Hyper-cube type
For each type we have specfic COTS:
• Document typeXML databases or dedicated document
managers
• Relational type relational database systems
• Hyper-cube type  data warehouse systems
Technology of information systems
Different types of data (8)
• For order-related data use the document type:
– Each case has its own document, which is normally a tree
structure.
– Only actual data, if the case is handled there is no use of the
information in the primary processes.
– An XML-database is a good candidate for this.
– Storing this type of data in a relational database means that all
case data is divided over several tables
– In the operations we seldom have to relate cases, we work on
one case at a time
– So retrieval of a case means a query over many tables and
taking out one or a few tuple(s) (tuple = record) and for updating
the case data has to be divided into small pieces and stored in
different tables.
– Disadvantage: redundancy, if a clients name changes, we have
to update all documents.
Technology of information systems
Different types of data (9)
• For resource-related data use the relational type:
– There are many relationships among resources and
between resources and orders.
– Resources (e.g. some stock item) have a status that can
be changed and they perform activities. Both are recorded
and they are related.
– For instance if one replenish order is made for a supplier,
the inventory of all resource types supplied by this supplier
are retrieved.
– Sometimes it is also case oriented data, e.g.
• Personnel files
• Maintenance of equipment
Technology of information systems
Different types of data (10)
• For management infomation use the hyper-cube
• Aggregation means that we are not interested in
simple facts but in statistics, i.e. quantification over
attributes of event types.
• So events have one or more attributes, some of
them are numerical. For instance a sales event
might have the following attributes:
– Sales date
– Sales person
– Client
– Product type
– Volume of the order
– Price of the order
Technology of information systems
Different types of data (11)
• Management is not interested in this fact but wants
to know for instance:
What is the total value (or volume) of the orders:
– Of a certain product type
– In a particular period
– In a particular region
• Dataware house terminology:
– Events with attributes are called facts
– The different attributes are called dimensions,
typically: time, geography and organization structure
– The attributes with numerical values, used as statistics, are
called measures
Technology of information systems
Different types of data (12)
• The most important statistics are:
– Frequencies: How often did an event of a special type occur?
Often a histogram is a good presentation, but a pie chart will also
do.
– Totals: What is the sum of all values associated with these
events?
– Averages: What is the average (mean) value of an event type?
– Variances: What is the spread of a value of an event type?
(spread is square root of variance)
– Coefficient of variation = spread divided by mean value.
Give some usefull examples for each of them
Technology of information systems
Data processing(1)
There is a natural direction of data processing:
1. Operational processes generate actual facts and the
are collected in data manager (either document
managers or relational data bases).
2. As soon as the transactions are closed, the data is
not of interest anymore and it should be archived in
a historical data manager.
3. Concurrently with this archiving some useful
information is collected in a data warehouse for
management infomation (this will be elaborated)
4. However a data warehouse can also become full, so
we need a strategy to refresh a data warehouse as
well.
Technology of information systems
Data processing(2)
There is a natural direction of data processing:
New facts
Operational Data Managers
Old facts
Extract, Transform, Load
Old facts
Archive
Extracted and transformed
update
Technology of information systems
Data warehouse
Data processing (3)
Use ideas from statistics:
•
•
•
•
Sequence of measurements: X1,X2,X3,….
Assume n>1
Mean estimator: Mn = (i=1…n Xi)/n
So the update is:
Mn+1 = (n.Mn + Xn+1)/(n + 1)
• Variance estimator: Sn = (i=1…n Xi2 – n.Mn2)/(n - 1)
• The update is:
Sn+1 =((n - 1)/n).Sn + (Mn - Xn+1)2/(n + 1)
• So we have to memorize only: n, Mn, Sn (M0=S0=0)
Technology of information systems
Data processing (4)
Updating statistics:
• Sequence of measurements: X1,X2,X3,….
• The moving average (history length k):
Mn= (i=n-k+1..n Xi)/k
• Weighted moving average:
Mn= (i=n-k+1..n ai.Xi)/k
• Usual case: exponential smoothing (0<a<1)
Mn= (i=n-k+1..n a(n-i).Xi)/k
so Xn-k+1 obtains weight ak-1 and X1 weight 1
Technology of information systems
Data warehousing (1)
Three-Tier Architecture
other
Metadata
sources
Operational
Data Bases
Extract
Transform
Load
Refresh
OLAP
Server
Monitor
&
Integrator
Analysis
Query/Reporting
Operational Data Store
Serve
Data
Warehouse
Data Mining
ROLAP
Server
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
Technology of information systems
Data warehousing (2)
About the architecture:
• ETL-tools take care of:
– Extraction: take the relevant data out of the Operational
Data Bases (ODB)
– Transformation:
• Cleansing: resolving inconsistencies
• Conversion: format unification
• Consolidation: different sources have different levels of detail
– Loading: uploading in the data warehouse
• Often a data warehouse has a Operational Data Store
(ODS). This is not an ODB but a intermediate layer
between ODB and the data ware house. It contains
detailed information, including history.
• Data marts: specialized subsets of a data warehouse for
a specific application domain.
Technology of information systems
Data warehousing (3)
•Multidimensional view on facts (events with attributes)
•In each dimension we may have an aggregation
hierarchy, e.g. second, minute, hour, day, week,...
Value or volume
sales
location
client
product
Technology of information systems
Data warehousing (4)
Data Cubes
2Qtr
3Qtr
4Qtr
sum
Ireland
France
Germany
sum
Technology of information systems
Country
TV
PC
VCR
sum
1Qtr
Date
Data warehousing (5)
Hierarchy provides a lattice of cubes
all
product
date
country
date, country
product, date
product, country
product, date, country
Technology of information systems
Datawarehousing (6)
• Result of a query is:
– One cell
– Two dimensional slice (or cross table)
– Multi dimensional sub cube
Product: color
Date:month,
JuneAugust
2006
Blue
Red
Orange
Total
June
51
25
158
234
July
58
20
120
198
August
65
22
51
138
Total
174
67
329
570
Technology of information systems
Data warehousing (7)
Cross tabulation
Dimensions
June
Blue
Red
Orange
Total
51
25
158
234
August
Data Points/
58 st
20
120
1 level of aggregation
65
22
51
Total
174
July
67
329
Aggregated
w.r.t.ofY-dim
Technology
information systems
198
Aggregated
w.r.t. X-dim
138
570
Aggregated
w.r.t. X and Y
Data warehousing (8)
Operations with Data Cubes (1)
• Roll up, aggregation
– store  city
– cities  country
– product  product type
• Drill down city  store
– country  cities
– product type  product
• Drill through: go back to the original, individual
data records
Technology of information systems
Data warehousing (9)
Operations with Data Cubes (2)
• Pivoting: change the dimensions that are
“displayed”; select a cross-tab.
– look at the cross-table for product-date
– display cross-table for date-customer
• Slice and dice: select a part of the cube by
restricting one or more dimensions
– restrict analysis to “product type = television” and
“region = France”
Technology of information systems
Data warehousing (10)
Implementation
Three categories:
– ROLAP (Relational OLAP)
• OLAP supported on top of a relational database
(SQL has special OLAP features since 1999)
– MOLAP (Multi-Dimensional OLAP)
• Use of special multi-dimensional data structures,
special index techniques for sparse matrices
– HOLAP: (Hybrid)
• combination of previous two
Technology of information systems
Data warehousing (11)
ROLAP
• Cubes can easily be represented in relational
tables: special value “all”
Month
Jan
Jan
Jan
Feb
Prod.
p1
p2
p1
p1
all
Jan
Jan
all
p1
all
p1
all
all
all
Cust.
c1
c1
c2
c1
…
c1
c1
all
c1
…
all
Price
10
8
10
9
102
18
1 230
4 235
1 253 458
Technology of information systems
Datawarehousing (12)
Data models for ROLAP data warehousing
• Star schema
• Snowflake schema: further normalization of
dimensions
• Extension: more stars (i.e. more facts!) with
common dimensions.
Technology of information systems
Datawarehousing (13)
Time
Example of a Star Schema
Geography
Time code
Quarter code
Quarter name
Month code
Month name
Sales
Date
Account
Account code
KeyAccount
code
Geography code
Region code
Region manager
State code
City code
Geography
code
Time code
Account code
KeyAccount
name
Product code
Account name
Value
Account type
Volume
Account market
Technology of information systems
Product
Product code
Product name
Brand code
Brand name
Product type code
Product type name
Datawarehousing (14)
Example of a Snowflake Schema
Quarter
Geography
Time
Quarter code
Quarter name
Geography code
Region code
Region manager
State code
City code
Time code
Quarter code
Month
Month codeDate
Month code
Month name
Account
Account code
KeyAccount
code
Sales
Geography
code
Time code
Product type
Product type code
Product type name
Brand
Brand code
Brand name
Account code
KeyAccount
name
Product code
Account name
Value
Account type
Volume
Product code
Product name
Brand code
Product type code
Account market
Technology of information systems
Product