Custom SQL - Tableau Community

Download Report

Transcript Custom SQL - Tableau Community

Ashley Ohmann
June 20, 2013
* What is Custom SQL?
* What can I do with it?
* Join conditions
* Unions and Self Joins
* Ranks
* Derived Tables
* Tableau queries each data source using SQL that’s
specific to the data type
* The differences querying different data source types are
in
* Syntax
* Functions
* Field length limitations
* Tableau allows the SQL used to query a data source to
be customized (relatively easily) in order to manipulate
the joins, filters, and field lengths and types to produce
a more accurate output
* Every data connection, whether it’s to one or multiple tables, has custom SQL
underpinning it.
* The syntax between different database types varies—for instance, in most flat file
and server-based data sources, table and field names are identified by brackets.
* With Hive, the bracket has been replaced by `
* The functions in a query must be functions that the database can understand, which
is a particular issue with SQL queries that aggregate
* For the most part, all databases use some version of SQL, but there are minor
differences
* For instance, flat Microsoft-based sources, like Access, Excel,
and flat files, use the conventions of Microsoft’s JET driver
* Another example: the ISNULL function, which is used to replace
nulls in SQL Server data sources and is a corollary to IFNULL in
Tableau, is not read by Teradata—one would use COALESCE
instead.
* Join conditions tell the database how the fields in different tables
are related to each other
* A “table” is a queried set of data, not necessarily all of the rows and
columns in a pre-defined table in a database
* Custom SQL is a good tool for using joins to transform a data
source
* Inner join: keeps records common to both tables; this join type
usually has the least number of null values, provided that the join
condition is the right one
* Left join: keeps all records from the first table and the matching
records from the second table
* Right join: keeps all records from the second table and the
matching records from the first table
* Outer/full join: keeps all records from both tables
* Self join: refers to the same table in the inner and outer query. A
union can be a self join.
* Union: stacks tables on top of each other.
* Unions are particularly useful for transforming a data source, either because it was
not set-up properly, or because one would like to create a visualization that needs
for it to be set-up in a certain way
* The Region Sales example data source presents the current and previous year’s sales
by Division and Region for two different battery types.
* The format is similar to that of a pivot table, and the problem with analyzing it is
that the same measure is duplicated according to a qualitative attribute.
* In order to analyze this data source and to create metrics based on the measures, it
needs to be restructured:
* We need to add a field that identifies the battery type as an attribute
* We need to put the sales for this year and last year, respective, in one column
* Connect to the Union Data.xls data source, and select the Region Sales sheet.
* Click the “Custom SQL” option under Step 2 of the data connection
* Click the expansion button to see the whole Custom SQL window
* Re-order the fields so that Division comes first, then Region
* Highlight and copy all of the text, then underneath it, type UNION on a new row,
then create another two carriage returns, and paste the code
* In the outer query, create a new field that identifies the battery type as an
attribute
* Alias the battery type’s sales this year with a generic field name, and do the
same for last year’s sales. Then, delete the fields that reference the other
battery type
* In the bottom query, paste the field that identifies the battery type attribute,
but change the value to “Energizer”
* Alias the sales fields for that battery type so that they have the same names as
the sales fields in the outer query
* Make sure that the commas are in the right places
* Click OK, and then give the data source a good name
* Duplicate the data source, and call it “Union with Calculated Fields”
* We’ll use the existing join conditions to add in some other calculations
* Custom SQL can be used to apply conditions to different groups of data based on
an attribute of the group
* It might be useful to add a quota, and in this case, the quote will be the previous
year’s sales multiplied by 1.15 for Energizer and 1.10 for Duracell
* Edit the Custom SQL to multiple the previous year’s sales by the factors listed
above, and call the new field [Quota]
* This data is useful for making a nice little bullet chart. Add a calculated field in
Tableau identifying whether or not the sales exceeded the quota, and put it on
the color shelf.
* One of the shortcomings of Tableau is that it doesn’t have a real ranking function
* RAWSQL can be used to rank, but for a large data set with ranking conditions on
several fields, it’s best to edit the SQL and include a ranking function
* It doesn’t look like an Excel data connection supports this, so one would need to
pre-rank the data in Excel
* For SQL Server or Oracle data connections, RANK and DENSE_RANK functions are
appropriate
*
*
Syntax: RANK ( ) OVER ( [ partition_by_clause ] order_by_clause )
Looks like: RANK() OVER (PARTITION BY i.LocationID ORDER BY i.Quantity DESC) AS Rank
* I have added the sample data to my SQL Server and written a ranking
query—see the next slide.
SELECT [Region Sales].[Division] AS [Division],
[Region Sales].[Region] AS [Region],
'E' as [Battery Type],
[Region Sales].[Energizer] as [TY Sales],
DENSE_RANK() OVER (ORDER BY [Region Sales].[Energizer] DESC)
AS TY_Rank
FROM [dbo].[Region Sales] [Region Sales]
Union
SELECT [Region Sales].[Division] AS [Division],
[Region Sales].[Region] AS [Region],
'D' as [Battery Type],
[Region Sales].[Duracell] AS [TY Sales],
DENSE_RANK() OVER (ORDER BY [Region Sales].[Duracell] DESC) AS
Rank
FROM [dbo].[Region Sales] [Region Sales]
* Derived tables in a query are very useful for transforming data as well as
aggregating at different dimension levels
* We’ll use a different version of our previous data set to create a derived table
* The YOY Sales sheet in our data source shows, for each Division, Region, Battery
Type, and Year, the Sales.
* We want to have this year’s sales and last year’s sales as separate columns,
though.
* Unfortunately, we can’t use a derived table with an Excel data connection, so the
example on the next slide is for code from a SQL Server.
SELECT [TY].[Battery Type] AS [Battery Type],
[TY].[Division] AS [Division],
[TY].[Region] AS [Region],
[TY].[Sales] AS [TY Sales],
[LY].[LY Sales] AS [LY Sales]
FROM [YOY Sales] [TY]
Left Join
(SELECT [YOY Sales].[Battery Type] AS [Battery Type],
[YOY Sales].[Division] AS [Division],
[YOY Sales].[Region] AS [Region],
[YOY Sales].[Sales] AS [LY Sales]
FROM [YOY Sales]
where [YOY Sales].[Year] = 2012)
as [LY]
on [TY].[Battery Type] = [LY].[Battery Type]
and [TY].[Division] = [LY].[Division]
and [LY].[Region] = [LY].[Region]
where [TY].[Year] = 2013