Transcript Slides

Liem Tran
Robert Turan
Miguel Delgado
Outline
I.
Introduction to Data Warehousing
I. Definition of Data Warehouse
II. Data Warehouse Development
III. Benefits and Advantages
IV. Database vs Data Warehouse
V. Datamart
II. Data Warehouse Structure
I. Star & Snowflake Schema
II. Fact Table & Dimension Table
III. OLTP
IV. OLAP
V. ROLAP
VI. MOLAP
VII. HOLAP
III. Real Life Implementation of Data Warehouse
I. Data Warehouse Providers
II. What Data Warehouse Do Big Companies Use?
III. Job Opportunities
What is a data warehouse?
“A Data Warehouse (DW) is simply a
consolidation of data from a variety of
sources that is designed to support
strategic and tactical decision making. Its
main purpose is to provide a coherent
picture of the business at a point in time”.
-data-warehouses.net
What led to DW development?
• Complex logic was required to join multiple tables making
it difficult to get the desired information.
• Inefficient queries had to be run after normal operating
hours, not to disrupt other transactions.
• Increased number of databases within a single business.
• Higher number of databases = inconsistent and
fragmented data!
• Lack of useful information needed for decision making.
Benefits and advantages
• Queries can be performed without impacting support
systems. (transactional and operational systems)
• Data is organized to support complex questions.
• Faster and more efficient retrieval of information gives the
user a better experience.
• Single truth representation of data.
• Data is persistent.
• Data can be integrated from multiple sources.
Database vs Data Warehouse
Database
•Current data
•Online Transactional
Processing (OLTP)
•Detailed information
•Relational view
Columns and rows
Data warehouse
•Historical data
•Online Analytical Processing
(OLAP)
•Summarized information
•Multidimensional view
Layers of columns and rows
Data Marts
• Subsets of a data warehouse.
• Based on star or snowflake schema.
• Designed to meet the needs of a specific group in an
organization.
• Easier access to data that is needed.
• Shorter implementation time.
• <100 GB of storage
ETL
• Extract – Data is extracted from one or more data sources.
•Validation rules are applied
• Transform – Data is processed to ensure consistency.
•E.g. Date formatting
• Load – Process of loading the formatted data into the data
warehouse.
Hardware Considerations
•Configure I/O for bandwidth not capacity
Disk drives may have enough space to store data, but I/O
can suffer.
•Stripe data files
Files are distributed across multiple disks
Highest possible I/O bandwidth
•Use redundancy
Protects against hardware failure
•Plan for growth
Schemas
Star vs Snowflake
Star Schema
• Most simple design and most
commonly used
• Consists of a fact table that point
to dimension tables
• Dimensions are not normalized
• PKs in dimension tables make up
the composite key in fact table
• High performance due to simple
design and few joins
• Recommended for large or small
datawarehousing & data marts
Snowflake Schema
• More complex version of Star
schema
• Dimension tables are normalized
• More complex queries due to
more joins
• Longer query execution times
• If dimension tables hold high
volume of data snowflake design
can save space
• Recommended for smaller data
warehouses
What exactly is a fact table?
• Contains keys for dimension tables and measurable attributes
• Table is usually narrow due to few attributes
• Table has many rows due to high volume of measurable data
• Measurable attributes are usually numeric values that provide data useful
for business and data analysts
• For example, the fact table for a retail store will continue key attributes and
the fact attributes can be Sales, Price, Time, and City.
Dimension Tables
• Each table describes a specific dimension
• Tables contain many attributes
• Not to many rows but can grow large in some cases
• Usually rows contain textual values (descriptions of object)
• Denormalized in Star schema
• Normalized in Snowflake schema
Sample Query
SELECT
CalendarYearWeek, sum(SalesAmount) FROM factSales AS f
JOIN dimDate as d ON d.DateKey = f.DateKey
WHERE Year = 2010
GROUP BY CalendarYearWeek;
Results for weekly sales in 2010
OLTP – Online Transactional Process
• Purpose is to update data efficiently
• Uses simple queries for speed
• Uses little space
• Functions on a normalized database schema
• Tuning is important to maximize performance
OLAP – Online Analytical Process
• Main purpose is to analyze data
• Uses historical data
• Different methods (Rolap, Molap, Holap)
• Software sits between client and DBMS
• OLAP tool will understand database organization and contains special
functions to analyze it
ROLAP–Relational Online Analytical Process
• Performs dynamic analysis of data stored in a relational database, rather
than on a multidimensional database
• Requires more processing time and disk space to perform some tasks of
multidimensional databases are designed for
• Supports larger user groups and larger amounts of data
MOLAP – Multidimensional Online
Analytical Process
• A version of Olap that indexes directly into a multidimensional database
• Processes data in a multidimensional array (all combinations of data are
reflected, stored in a cell that can be accessed directly)
• Faster than other methods (due to indexing)
HOLAP – Hybrid Online Analytical Process
•
Combination of Rolap and Olap
• Stores data in both relation and multidimensional databases
• Uses both databases depending on which is better for processing goals
• For data-heavy processing - RDB and MDDB for more speculative
processing
Data Warehouse Providers
• Founded in 1979
• Teradata Active Enterprise Data Warehouse is the platform that runs the Teradata Database, with added data
management tools and data mining software.
• The data warehouse differentiates between “hot and cold” data – meaning that the warehouse puts data that is not
often used in a slower storage section. As of October 2010, Teradata uses Xeon 5600 processors for the server
nodes.
• Teradata Database 13.10 was announced in 2010 as the company’s database software for storing and processing
data.
• Teradata Database 14 was sold as the upgrade to 13.10 in 2011 and runs multiple data warehouse workloads at the
same time. It includes column-store analyses.
• Teradata Integrated Analytics is a set of tools for data analysis that resides inside the data warehouse.
• A multinational computer technology corporation, headquartered in Redwood Shores, California
• database software and technology, cloud engineered systems and enterprise software products—particularly its own
brands of database management systems
• In 2015 Oracle was the second-largest software maker by revenue, after Microsoft
• Oracle 12c Database is the industry standard for high performance scalable, optimized data warehousing.
• The company’s specialized platform for the data warehousing side is the Oracle Exadata Machine.
• There are an estimated 390,000 Oracle DBMS customers worldwide, and about 4,000 Exadata data warehousing
appliances have been sold.
• An American electronic commerce and cloud computing company, founded on July 5, 1994, by Jeff Bezos and based in
Seattle, Washington
• AWS was officially launched in 2006
• The whole shift in data storage and warehousing to the cloud over the last several years has been momentous and
Amazon has been a market leader in that whole paradigm.
• Amazon offers a whole ecosystem of data storage tools and resources that complement its cloud services platform.
• For example, there is Amazon Redshift, a fast, fully managed, petabyte-scale data warehouse cloud solution; AWS Data
Pipeline, a web service designed for transporting data between existing AWS data services; and Elastic MapReduce, which
provides an easily managed Hadoop solution on top of the AWS services platform.
• Big AWS customers: Airbnb, Adobe Systems, Johnson & Johnson, GE, BMW, Cannon, NASA,…
• AWS pulls in about $11 billion in revenue each year for Amazon.
• Three engineers from Google, Yahoo and Facebook (Christophe Bisciglia, Amr Awadallah and Jeff Hammerbacher,
respectively) joined with a former Oracle executive (Mike Olson) to form Cloudera in 2008
• Cloudera has emerged in recent years as a major enterprise provider of Hadoop-based data storage and processing
solutions.
• Cloudera offers an Enterprise Data Hub (EDH) for its variety of operational data store, or data warehouse.
• The EDH is Cloudera’s proprietary framework for the “information-driven enterprise” and focuses on “batch
processing, interactive SQL, enterprise search, and advanced analytics—together with the robust security, governance,
data protection, and management that enterprises require.”
• Cloudera’s data warehouse is based on CDH, which is Cloudera’s version of Apache Hadoop and the world’s largest
distribution at that.
• Cloudera customers: Mastercard, Deloitte,..
• MarkLogic is a Silicon Valley-based private software firm founded in 2001 that offers an enterprise NoSQL database
platform.
•
In 2013 MarkLogic released a new semantics platform which provides the capability of storing billions of RDF triples that
can queried with SPARQL (a semantic query language for the RDF platform).
• Customers: Aetna, NBC, BBC, Boeing, U.S. Navy, and U.S. Army.
What Data Warehouse Do Big Companies Use?
TeraData Customers:
• Apple: operating a multiple-petabyte Teradata system, was Teradata’s “fastest ever
customer to a petabyte.”
• Walmart: Teradata’s first-ever terabyte-scale database in 1992.
• Ebay: Its primary data warehouse is over 9.2 petabyes; its “singularity system” that stores
web clicks and other “big” data is more than 40 petabytes.
What Data Warehouse Do Big Companies Use?
TeraData Customers:
• Harrah’s (Caesar’s Entertainment Casino):understands how much money particular gamblers
can afford to lose in a day before they won’t come back the next day.
• Disney: new bracelet tickets equipped with GPS and NFC that track everything visitors do
while inside Disney’s amusement parks.
• Continental Airlines: to keep its customers happy, began assessing them by lifetime value
and began making alternative arrangements for them as soon as the airline realized flights
would be delayed
What Data Warehouse Do Big Companies Use?
Other Big Companies:
• Microsoft: uses its own Azure Data Warehouse.
• Amazon: uses its own Amazon Redshift.
• IBM: uses its own IBM Data Warehouse.
• Facebook: Hive
Data Warehouse Developer
• The Data Warehouse Developer is responsible for the successful delivery of business intelligence information to the
entire organization and is experienced in BI development and implementations, data architecture and data
warehousing.
• Additional responsibilities include:
A. Creating star schema data models, performing ETLs and validating results with business representatives
B. Supporting implemented BI solutions by: monitoring and tuning queries and data loads, addressing user questions
concerning data integrity, monitoring performance and communicating functional and technical issues.
Data Warehouse Developer
Data Warehouse Analyst
• A data warehouse analyst collects, analyzes, mines and helps the business leverage the information stored in data
warehouses.
• Professionals in this role research and recommend technology solutions related to data storage, reporting, importing
and other business concerns; they also define the user interfaces for managing the interaction between data.
• A data warehouse analyst is often expected to collaborate with business intelligence analysts and developers to
translate data requirements into logical data models.
Data Warehouse Analyst
ETL Developer
• Write scripts to extract data from the source systems
• Analyze data
• Familiar with SQL, Java, XML and several data warehouse architecture techniques such as EDW, ODS, DM, ROLAP
and MOLAP.
• Help the production support team to resolve ETL batch failures
• Perform data analysis as per the project requirements
• Write scripts to execute ETL programs
• Create design documents /test cases.
• Creating data model.
ETL Developer
Thank You