A Business Intelligence Framework

Download Report

Transcript A Business Intelligence Framework

An “Agile Framework” for
Data Warehouse
Development
Goals
• To describe, in general, the essential aspects of a rapid (agile)
data warehouse development environment
• We will specifically talk about (briefly)
•
•
•
•
•
•
Microsoft Visual Studio (SSDT) Database Projects
Microsoft SQL Server 2012 Integration Services (SSIS)
SVN and AnkSVN for Source Code Control
tSQLt for unit testing
Jenkins for continuous integration
Agile for project methodology
• Leave you with some references for more information
FYI, this was not my idea
• This was the system brought in by the Lead Architect at AHN, and
was based on frameworks he used in the past
• Aaron Pugliese
• BI Data Architect at TeleTracking
• [email protected]
• Who Am I?
• Steve Tirone
• [email protected] | @sttpgh | http://stephentirone.com
Quickly: What is Business Intelligence?
I like this definition:
“… the set of techniques and tools for the transformation of raw
data into meaningful and useful information for business analysis
purposes.”
-- from Wikipedia
http://en.wikipedia.org/wiki/Business_intelligence
How Do We Make BI Happen?
• We collect RAW DATA from various systems and STAGE that data in
a staging area
• Then we TRANSFORM that data into a form more suitable for
business analysis (the DATA WAREHOUSE)
• Typically a star schema design, and typically we further transform this into
some type of OLAP database for efficient reporting purposes
• We are going to stop at the data warehouse for this presentation – stay
tuned, though, I hope to get into self-service BI using the tabular cube
model and DAX query language at the next meetup
First, we need a database
• We used a Visual Studio 2012 (SSDT) Database project to create a
Staging database which held all our staging tables
• Database projects are a complete representation of your database as SQL
scripts
• Publishing a database using SSDT incorporates all the work you used to do
manually to update a database, a great time saver
• In addition, you can create pre- and post- deployment scripts to customize
the deployment of your database
• <Publish Profiles> allow you to publish to Dev, Test, and Prod, and can also
be customized
• http://sqlblog.com/blogs/jamie_thomson/archive/2012/05/09/publish-profilefiles-in-sql-server-data-tools-ssdt.aspx
SSDT
Magic!
Use INTEGRATED Source Code Control!
• Visual Studio is often used with TFS (Team Foundation System)
• But, we didn’t have that
• SVN was used for version control of the projects and solutions
• The AnkhSVN plugin is available to interface with SVN from within
Visual Studio
• Best Practice: use source code control integrated into your IDE, so
developers don’t need to make extra effort to save history of
changes
Let’s get some data into those Staging tables
• A Visual Studio 2012 Integration Services project was used to hold
the SSIS packages which pulled data from the sources to the
Staging database
• Here SQL Server 2012 provided some new benefits
• Project level parameterizable connection managers
• Environment scripts were used for Dev, Test, Prod
• Project Deployment (vs Package Deployment) – all or none
• SSIS packages were TEMPLATES, automatically incorporating
important features (logging), allowing the developer to focus on
the specifics task at hand
A Staging Package
A Fact Table Package
Step 1: set up a variable to store counts
DECLARE @Operations TABLE
(
Operation CHAR(1),
OrderDateKey INT,
OrderDate DATETIME,
ShipDateKey INT,
ShipDate DATETIME,
CustomerKey INT,
CustomerID INT,
RecordID INT
);
Step 2: Capture staging row count
--------------------------------------- Get row count of main staging table
-- (Include WHERE clause if necessary)
---------------------------------------SELECT @SourceRowCount = COUNT(*)
FROM ETLFrameworkSampleStaging.dbo.stgSalesOrderDetail;
Step 3: MERGE to the fact table
MERGE dbo.factSales AS Target
USING (SELECT
ISNULL(da.DateKey, -1) AS OrderDateKey,
ISNULL(da2.DateKey, -1) AS ShipDateKey,
...
OUTPUT LEFT($action, 1),
INSERTED.OrderDateKey, Source.OrderDate,
INSERTED.ShipDateKey, Source.ShipDate,
INSERTED.ProductKey, Source.ProductID,
INSERTED.TerritoryKey, Source. TerritoryID,
INSERTED.CustomerKey, Source.CustomerID,
INSERTED.RecordId
INTO @Operations;
Step 4: Log Dimension Key Lookup Errors
--------------------------------------------------------------------------------Log Dimension Key Lookup Errors
------------------------------------------------------------------------------INSERT INTO neuETLFramework.Audit.PackageErrorLog
(PackageExecutionLogKey, RecordID, BusinessKey, SurrogageKey)
SELECT @PackageExecutionLogKey AS PackageExecutionLogKey, RecordID,
OrderDate, OrderDateKey . . .
FROM @Operations AS t1
WHERE OrderDateKey = -1 OR ShipDateKey =-1 OR ProductKey = -1
SET @LookupErrorRowCount = @@ROWCOUNT;
Step 5: return row counts
-------------------------------------------------Return Row Count Information to Package
------------------------------------------------
SELECT COUNT(CASE Operation WHEN 'U' THEN 1 END) AS UpdateRC,
COUNT(CASE Operation WHEN 'I' THEN 1 END) AS InsertRC,
@SourceRowCount AS SrcRC,
@LookupErrorRowCount AS LookupRC
FROM @Operations;
Where does that information go?
So here’s our VS solution
But wait, there’s more!
• Billy Mays, famous pitchman
• Pittsburgh native
• Was a student at Sto-Rox HS
and also at WVU, where he
was a walk-on linebacker on
the football team there
• (all from Wikipedia)
Unit Testing
• How do we know our tables are accurate and correct?
• Use tSQLt to run unit tests on them
• tSQLt is a unit testing framework for Microsoft SQL Server that
allows you to write and execute unit tests in T-SQL
• We used AssertObjectExists to makes sure we had all our tables,
and AssertResultSetsHaveSameMetadata to make sure each table’s
structure was what we expected it to be
• This was just the beginning, much more unit testing can be done
How did we do this?
• Created a unit testing database
• Put all our tSQLt unit tests in there as stored procedures
• Called tSQLt.RunAll from a post deployment script in order to run
the scripts upon a deployment of the database
• The developer generally deployed the database in order to run the tests,
but we will see in a moment how we automated this process
An example tSQLt test
ALTER PROCEDURE testDWTables.[test dimBeneficiary has unknown member]
AS
BEGIN
CREATE TABLE #expected (BeneficiaryKey int,
DESYNPUF_IDvarchar(16) , . . .
INSERT INTO #expected
( BeneficiaryKey, DESYNPUF_ID, . . .
VALUES ( -1, '-1', -1, -1, -1, -1, 'UNKNOWN', '');
SELECT BeneficiaryKey, DESYNPUF_ID, DateOfBirthKey, DateOfDeathKey, GenderKey, RaceKey, FakeName, FakeZIP5
INTO #actual
FROM DESynPUF_DataWarehouse.dbo.dimBeneficiary d
WHERE d.BeneficiaryKey = -1
EXEC tSQLt.AssertEqualsTable '#expected', '#actual';
END;
Running the Test
Here is our VS solution again
How did we automate unit testing?
• By using Jenkins
• Jenkins is a tool for managing builds
• What’s really cool is this allows for continuous integration
• Any time there is a check in to SVN, Jenkins does a full build of the
solution, including unit tests and deploying to the integration SQL server
• Broken builds are detected immediately, and hopefully fixed immediately
• Martin Fowler – Continuous Delivery
• http://youtu.be/aoMfbgF2D_4
(17 minutes)
Jenkins Example
In Summary
Some Best Practices for Data Warehouse Development:
• Templates implement a framework for rapid development,
consistent structure, and always conforming to best practices
• Unit testing can be easily integrated into the process, and will
serve to find problems early and help promote consistency
• Continuous integration is important so that bugs are caught and
fixed early, and the data warehouse code base is always in a state
ready to deploy to production
But, There’s One More Thing
Agile Project Management Methodology
• We’ve described the platform for rapid development of a data
warehouse, but not what to put into it, when, or why
• Enter Agile
• Agile (to me) is a methodology that emphasizes collaboration and iterative
development, where you get your product out to your customers quickly for
their feedback, and you use that to rapidly improve the product
• You need engaged stakeholders!
• All the development productivity in the world isn’t going to help if you
don’t have engaged stakeholders clearly communicating with your
development team
Back to Our Group Charter
• “to meet, network, and learn about strategies, best practices, and
tools”
References
• Agile Data Warehousing
• http://www.amazon.com/Agile-Data-Warehousing-World-ClassIntelligence/dp/0595471676
• tSQLt
• http://tSQLt.org
• SVN and AnkSVN
• SVN -- https://subversion.apache.org/
• AnkhSVN -- https://ankhsvn.open.collab.net/
• Jenkins
• https://jenkins-ci.org/
• SQL Server Business Intelligence Technology Stack
• http://www.microsoft.com/en-us/server-cloud/products/sql-server/
Thank You
• Slides will be somewhere on the meetup site