Download Report

Transcript SQL_Saturday_

Data, Statistics, and R
Integration with Business
Why you should care!
Dustin Elery
Senior Analytics
[email protected]
A little about me
What is this about?
Statistics 101 – Mathematic, not database
What is R?
General overview of R Studio use
Microsoft SQL Server R Services review
Why you should care
A Little About Me
• Business Intelligence – 5 years
• Passion for Statistics and Finance
• MBA Kelly School of Business
• Live in Indy
• Love consulting
• Constant challenge
• Varying technology
• Ability to play in the now
What is this about?
• A different approach to BI
• Similar reporting and analysis at many different clients
• I have always wanted to do some real data mining
• Passion for analysis and statistics drove me to want to look at reporting
• Cover a simple example later
• R chatter got me curious
• Many commonly used BI applications are integrating R
• When I dove in, I was blown away about what I did not know
• I am not an R expert or a Data Scientist
• I want this presentation to motivate other analyst to explore R by showing some basics
• Please feel free to speak up when I am wrong or missing something that is valuable to
Statistics 101
• Descriptive
• Summary Data – Min, Max, Quartiles, Median, Mode, Variance, Standard Deviation
• Inferential Analysis
• Exploratory – think visual, plots
• Inferential – learned in stats; null hypothesis, p-value, etc.
• Sample representing a population
• Predictive –regression, cluster
• Causal – clinical studies
• Mechanistic – mathematical model capturing all changes from one variable
What is R?
• R is a language and environment for statistical computing and graphics.
• R-project.org
• R is available as Free Software under the terms of the Free Software
Foundation’s GNU General Public License in source code form
• “Many users think of R as a statistics system. We prefer to think of it of an
environment within which statistical techniques are implemented.”
• The R environment
• R is an integrated suite of software facilities for data manipulation,
calculation and graphical display. It includes:
• data handling and storage facility
• operators for calculations on arrays
• a large, coherent, integrated collection of intermediate tools for
data analysis
• graphical facilities for data analysis and display either on-screen or
on hardcopy
• a well-developed, simple and effective programming language
which includes conditionals, loops, user-defined recursive functions
and input and output facilities.
R Studio -
• Integrated Development Environment
(IDE) – user friendly
• 4 Panes
• Source – scripting
• Console – interactive, type
commands, execute
• Environment – state of variables
• Misc. – files, visual, packages, help,
SQL Server R Service
• Providing ease of use Advanced Analytics in a familiar
• Development
• R IDE or MSFT R Client, execute on SQL Server, pass results back to
• Optimization
• ScaleR and RevoScaleR
• Leverage the power of SQL Server
• Deployment
• Stored procedures
• Management and Monitoring
• User control, storage
SQL Server R - Development
SQL Server R - Optimize
• Leverage your server instead of just your machine
• ScaleR
• The RevoScaleR package has some of the most popular R functions,
redesigned to provide parallelism and scale.
• The package also includes functions that further boost performance and
scale by pushing computations to the SQL Server computer
• multi-threaded, multi-core, multi-process computation.
SQL Server R - Deployment
Storing and running R code from SQL Server
• Use T-SQL interface
• Avoid extra data movement. You can use Transact-SQL to generate data from a predictive
model in production, or return plots generated by R and present them in an application such
as Reporting Services.
SQL Server R – Manage and Control
• Isolates R sessions
• Control users who can run R scripts and what databases can be
accessed by R code
• Control the amount of resources allocated to the R runtime
• Control and audit the data used by analysts
• Schedule jobs and author workflows containing R scripts using
stored procedures.
Common R Uses
• Data Munging – data cleaning
• Powerful, seemed easy to develop
• Data Exploration
• Correlation, trends, summary statistics - descriptive
• Visualization
• Fast plots, visuals for mining and business consumption, can handle very
large data sets
• Statistical Analysis Open source, free packages for just about
• https://cran.r-project.org/web/packages/
R Studio and SQL Server 2016
• R Studio in Action
• Airline Data
• Review of some R coding basics
• Run through loading and mining the data
• Apply descriptive, exploratory, and predictive analysis
• Take a simple descriptive statistic and apply it immediately.
• Standard Deviation
• Tableau
Demo – R and Tableau examples
• Data Source: Bureau of Transportation Statistics
• http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=OnTime
• CSV, pulled into SQL Server 2016, data cleaning
• Could be done in R, more proficient in SQL
• Data Munging
• Seemed very powerful, example in back up slides
• Open R, look at script – created connection, loaded data into a data frame, run through some
basic commands and results
• Review Tableau Standard Deviation Report example
Why you should care
• Improve Business Intelligence delivery
• Explore large data sets very quickly
• Higher level data mining, customization, and analysis
• Reduce forecasting risk with better modeling techniques and specialized
• Really need a data scientist for business decision predictions
• Stream line visualization and trend analysis
• Drive analysis to data that is deviating outside of what is statistically
• Begin exploring R’s integration into applications
• I am most looking forward to progressing to this next
• Can function as a stand alone BI tool if there are no other options
• Example - Shiny
Shiny Examples
From my coworker – Michael Huffer
Check it out: https://mdhufferbsu.shinyapps.io/NFL1/ & https://mdhufferbsu.shinyapps.io/WTIN/
Why you should care
• Become Inferential – next era of Business Intelligence
• Used by Fortune 500 companies, now available to anyone in a packaged
solution familiar to any existing SQL environments
• Empowering the Analyst – MSFT chart on self service BI
• Differentiate your approach, be that next value add
• Large businesses are using this
• Clients are asking about it
• It is fun, different, yet similar enough to pick up quickly
**** Start of Backup Demo Slides ****
R Studio – create connection, load data, view it
Descriptive Stats
Tableau Reporting Example
***** Start of back up slides *****
Some Helpful Links – really, there is an amazing amount of free content out there
• https://support.rstudio.com/hc/en-us/articles/201057987-Quick-list-of-useful-R-packages
• http://docs.ggplot2.org/current/
• https://www.quora.com/What-are-the-R-use-cases-in-company-settings
Resources Used
• R Studio
• R- project
• SQL Server 2016
• Tableau
• Alteryx
• Pluralsight
• DataCamp
• Find links, notes, scripts, etc. on what I used for this in back up slides
• Most common reports – trends
• Numerical (% growth)
• Close but not the same
• Visual (time series, line or bar charts)
• My first question – is that trend normal?
• Can I see the people, product, pick your
poison…that are trending outside of normal?
• Then drill into the trends
• Example in Demo
Statistics – Standard Deviation Reporting
• Best Results
• Pick a granularity, time, and variance that you
can manage towards
• A business process, product, area, etc. may need
more flexibility
• Longer time periods
• Adjustment to standard deviation (ability to range)
• Different levels or production/sales
• Ability to turn off dimensions where the variance
is expected and/or cannot be changed
• Ability to set Standard Deviation Level
Installing SQL Server 2016 with R Services
• Developer Version: https://www.microsoft.com/en-us/cloudplatform/sql-server-editions-developers
• During Installation
• Mixed Mode Authentication is required for creating R accounts
• R Services
• Microsoft walk through is good: https://msdn.microsoft.com/enus/library/mt696069.aspx
• When setting up the user accounts, their script didn’t work. I had to create
them in the Master and then the other DB’s directly.
• If looking to do more than read from R, set up the right level of access
Installing R Studio and some tips
• R Studio Download:
• General Tips:
• https://support.rstudio.com/hc/en-us/articles/200404846-Working-in-theConsole
• https://www.rstudio.com/resources/cheatsheets/
Demo Attempt in Power BI
• Limited functionality
• https://powerbi.microsoft.com/enus/documentation/powerbi-service-r-visuals/
• I didn’t want to demo given the basic
functionality, needed more time to examine
better visual techniques
• Very easy to run scripts and create visuals with
code, try this walk through with an easy data and
plot command from attached script
• https://powerbi.microsoft.com/enus/documentation/powerbi-desktop-r-visuals/
• Key is to take the time to learn advanced visual
packages and use them instead of the basic R
plotting functions
• Example: ggplot2
• Note the list of “limited” packages is fairly long
Scripts from my demo
• Posted with the slides