Data Analysis

Download Report

Transcript Data Analysis

Data Analysis
OLTP and OLAP
Data Warehouse
SQL for Data Analysis
Data Mining
Bogdan Shishedjiev Data Analysis
1
Data Processing
• Data processing types
– OLTP (On Line Transaction Processing)
– OLAP (On-line Analytical Processing )
• Database types
– Transactional
• Numerous users
• Dynamic (flickering)
• Always maintaining current state
• Critical (very loaded)
– Warehouses
• Aa few users (analyzers)
• Relatively stable
• Maintaining the data history (all states in the time)
• Not loaded
Bogdan Shishedjiev Data Analysis
2
Architecture
Bogdan Shishedjiev Data Analysis
3
Architecture
• Source compnents
– Filter – separate and assure the coherency of data to be
exported
– Export – do the transfer of data portions in precise
moments of time.
• Warehouse components
–
–
–
–
–
Loader – Initial loading and preparing the warehouse.
Refresh – loads the portions
достъп
data mining
Export – to other warehouses. This creates an hierarchy
of warehouses
Bogdan Shishedjiev Data Analysis
4
Relational Schemes
• Star
Bogdan Shishedjiev Data Analysis
5
Relational Schemes
• Snowflake
Bogdan Shishedjiev Data Analysis
6
Data Warehouse Design
•
Stages
– Choose the activity processes to model
– Choose the granularity of the activity procesus
– Choose the dimensions that can be applied to every
record of the fact table.
– Choose the facts that must be recorded in the fact
table
Bogdan Shishedjiev Data Analysis
7
Data Warehouse Design
• Fact types – the most valuable are numerical
continuous values
– Additive – they can be added along all dimensions
(Money amounts)
– Semi-additive – they can be added along some of
dimensions (Precipitations, Product quantities)
– Non-additive – they cannot be added along any
dimension (Wind speed, wind direction)
Bogdan Shishedjiev Data Analysis
8
Data Warehouse Design
• Recommendations
–
–
–
–
Use continuous additive numerical values
The fact table is highly normalized
Don’t normalize the dimensions. The gain is < 1%
Design thoroughly the dimension attributes. Most often
they are textual and discrete. They are used as headings
and constraint sources in the answers to users
Bogdan Shishedjiev Data Analysis
9
Dimension time
time_key
day_of_ week
day_no_in_month
day_no_overall
week_no_ln_year
week_no_overall
month
month_no_overall
quarter
fiscal_period
holiday_flag
weekday_flag
last_day_in_month_flag
season
event
Dimension shop
stoie_key
store_ name
store_number
store_street_address
store_city
store_county
store_state
store_zip
store__manager
store_phone
store_FАX
floor_plan_ type
photo_processing_type
finance_services_type
first_opened_date
last_remodel_date
store_surface
grocery_surface
frozen-surface ..and others
Example –
Hypermarket Chain
Facts - Sales
Time_key
product_key
Store_key
promotion
key
dollar_sales
units_sales
dollar_cost
customer-count
Bogdan Shishedjiev Data Analysis
Dimension product
product_key
SKU(stock keeping units )_description
SKU_number
package_size
brand subcategory
category
departement
package_ type
diet_type
weight
weight_unit_of_mesure
units_per_retail_case
units_per_shipping_case
cases_per_pallet
shelf_width
shelf_height
shelf_depth
..and others
Promotion
promotion_key
promotion_name
price__reduction_ type
ad_type
dlsplay_ type
coupon_type
ad_media_name
display_provider
promo_cost
promo_begin_date
promo_end-date
..and others
10
Hypermarket Chain
• Fact table
– Granularity – each sell of a product (SKU – Stock
Keeping Unit)
– Values – Total cost (additive), SKU quantity (semiadditive), price (non-additive), customer count (non
additive)
• Dimensions
–
–
–
–
Time
Product
Store
Promotion
Bogdan Shishedjiev Data Analysis
11
Hypermarket Chain
• Calculation of disk space needed
– Dimension time : 2 years x 365 days = 730 days
– Dimension shop : 300 shops, everyday records
– Dimension product : 30.000 products in each shop;
3000 are sold every day in each shop.
– Dimension promotion : An article can participate in
only one promotion in a shop during one day.
– Elementary fact records 300 x 730 x 3000 x 1 = 657
.106 records
– Key field number 4; Value field number 4 ; Total
number osf fields =8
– Fact table size - 657 .106 x 8 fields x 4B = 21 GB
Bogdan Shishedjiev Data Analysis
12
Data Operations for Data Analysis
• General form of a SQL statement
select D1.C1, ... Dn.Cn, Aggr1(F,Cl),…,
Aggrn(F,Cn)
from Fact as F, Dimension1 as D1,...
DimensionN as Dn
where join-condition (F, D1)
and...
and join-condition (F, Dn)
and selection-condition
group by D1.C1, ... Dn.Cn
order by D1.C1, ... Dn.C
Bogdan Shishedjiev Data Analysis
13
Data Operations for Data Analysis
• Example
select Time.Month, Product.Name, sum(Qty)
from Sale, Time, Product, Promotion
where Sale.TimeCode = Time.TimeCode
and Sale.ProductCode = Product.ProductCode
and Sale.PromoCode = Promotion.PromoCode
and (Product. Name = ' Pasta' or Product.Name = 'Oil')
and Time.Month between 'Feb' and 'Apr'
and Promotion.Name = 'SuperSaver'
group by Time.Month, Product.Name
order by Time.Month, Product.Name
pivot Time.Month
Feb
Mar
Apr
Oil
5K
5K
7K
Pasta
45K
50K
51K
Bogdan Shishedjiev Data Analysis
14
Data Cube
The cube is used to represent data along some measure of interest.
Although called a "cube", it can be 2-dimensional, 3-dimensional, or
higher-dimensional. Each dimension represents some attribute in the
database and the cells in the data cube represent the measure of
interest.
Bogdan Shishedjiev Data Analysis
15
Data Cube
• Data cube representation
Combination
Count
{P1, Calgary, Vance}
2
{P2, Calgary, Vance}
4
{P3, Calgary, Vance}
1
{P1, Toronto, Vance}
5
{P3, Toronto, Vance}
8
{P5, Toronto, Vance}
2
{P5, Montreal, Vance}
5
{P1, Vancouver, Bob}
3
{P3, Vancouver, Bob}
Combination
Count
{P3, Vancouver, Richard}
9
{P4, Vancouver, Richard}
2
{P5, Vancouver, Richard}
9
{P1, Calgary, Richard}
2
{P2, Calgary, Richard}
1
{P3, Calgary, Richard}
4
5
{P2, Calgary, Allison}
2
{P5, Vancouver, Bob}
1
{P3, Calgary, Allison}
1
{P1, Montreal, Bob}
3
{P1, Toronto, Allison}
2
{P3, Montreal, Bob}
8
{P2, Toronto, Allison}
3
{P4, Montreal, Bob}
7
{P3, Toronto, Allison}
6
{P5, Montreal, Bob}
3
{P4, Toronto, Allison}
2
{P2, Vancouver, Richard}
11
Bogdan Shishedjiev Data Analysis
16
Data Cube
• Totals - the value ANY or ALL or NULL
Bogdan Shishedjiev Data Analysis
17
Data Cube
• Drill down – adding a dimension for more detailed results
Time. Month
Product.Name
sum(Qty)
Feb
Pasta
Mar
Apr
TIme.Monih
Product.Name
Zone
sum(Qty)
48K
Feb
Pasta
North
18K
Pasta
50K
Feb
Pasta
Centre
18K
Pasta
51K
Feb
Pasta
South
12K
Mar
Pasta
North
18K
Mar
Pasta
Centre
18K
Mar
Pasta
South
14K
Apr
Pasta
North
18K
Apr
Pasta
Centre
17K
Apr
Pasta
South
16K
Bogdan Shishedjiev Data Analysis
18
Data Cube
• Roll-up - removing dimension
TIme.Monih
Product.Name
Zone
sum(Qty)
Feb
Pasta
North
18K
Feb
Pasta
Centre
18K
Feb
Pasta
South
12K
Mar
Pasta
North
18K
Mar
Pasta
Centre
18K
Mar
Pasta
South
14K
Apr
Pasta
North
18K
Apr
Pasta
Centre
17K
Apr
Pasta
South
16K
Product.Name
Zone
sum(Qty)
Pasta
North
54K
Pasta
Centre
53K
Pasta
South
42K
Bogdan Shishedjiev Data Analysis
19
Data Cube
•The whole data cube
TIme.Monih
Feb
Feb
Feb
Mar
Mar
Mar
Apr
Apr
Apr
ALL
ALL
ALL
Feb
Mar
Apr
ALL
ALL
Product.Name
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
Pasta
ALL
Zone
North
Centre
South
North
Centre
South
North
Centre
South
North
Centre
South
ALL
ALL
ALL
ALL
ALL
Bogdan Shishedjiev Data Analysis
sum(Qty)
18K
18K
12K
18K
18K
14K
18K
17K
16K
54K
53K
42K
48K
50K
51K
149K
149K
20