Using Normalized Status Change Events Data in Business

Download Report

Transcript Using Normalized Status Change Events Data in Business

Making Good Use of Data at
Hand: Open Source Tools
Mark C. Cooke, Ph.D.
Tax Management Associates, Inc.
Overview
• Open Data concept – Data is produced for
various purposes but can be used to derive
novel insights; i.e. “Business Intelligence (BI)”
• Open Source tools exist for making good use
of existing data sets
– ETL (“Extract, Transform, Load”) + Analytics
• Knime and the R language are two of the most
powerful resources for leveraging data
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Open Data
• Open Data concept – governments collect,
through existing management systems, enormous
quantities of data that can be leveraged in
alternative and novel ways to find solutions.
• The goal is often to leverage the broader
community to develop solutions that
governments may not have previously conceived.
• Open Data and Business Intelligence should be
used by internal consumers as well.
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Open Data
2013
Open Source Tools for Data Analysis Mark
C Cooke - Tax Management Associates, Inc.
“Data Scientist”
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Doing Data the Old Way
• Data is locked inside systems :-(
– Software systems are designed to wrap a Graphical
User Interface (GUI) around data.
– The GUI functionality, historically, has to be
programmed to produce reports, views, and analysis.
• The GUI is driven by the sole purpose of the
software. But the data has many purposes…
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Open Data – Way Forward
• Making data talk across platforms: AS400, SQL,
XML, Excel, PDF’s, Text Files, Image Files (.png,
.jpeg, etc.), Shape Files (ESRI), email archives,
web-scraping, API’s from social media, etc.
• Connecting data across multiple platforms
• Using data for novel insight
• Tools now exist for importing, cleaning,
standardizing, and analyzing data using complex
algorithms built into accessible packages
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Open Data
• These systems are known as “Data Agnostic:”
Database Agnostic - Database-agnostic is a term
describing the capacity of software to function with any
vendor’s database management system (DBMS). In
information technology (IT), agnostic refers to the ability of
something – such as software or hardware – to work with
various systems, rather than being customized for a single
system.
– http://searchdatamanagement.techtarget.com/definition/da
tabase-agnostic
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Data Science
• What is the breadth of the tool base?
– Reading in data from various resources
– Transforming data to merge various resources, translate
data into a usable format or to add new data elements
– Analyzing data from basic logical and statistical
functions to higher level machine learning tools and
algorithms
“Machine learning, a branch of artificial intelligence, concerns the
construction and study of systems that can learn from data.”
http://en.wikipedia.org/wiki/Machine_learning
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Data Science
• What is the output?
– “Business Intelligence” or actionable information
that drives business decisions through insight
– Creating new insights from existing data
– Visualizations - representation of that BI in ways to
make it consumable to a non-specialist audience
“According to Friedman (2008) the "main goal of data visualization is
to communicate information clearly and effectively through
graphical means.”
http://en.wikipedia.org/wiki/Data_visualization
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
2013
Open Source Tools for Data Analysis Mark
C Cooke - Tax Management Associates, Inc.
• Knime is a GUI-based data agnostic tool for ETL,
analytics, and visualization.
• Knime is an open source platform for the desktop
with commercial enterprise server layers
including collaboration tools and web-services
(web-portal).
• Knime supports other analytics languages,
including the R language for statistical computing
www.Knime.org
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
• The advantages of Knime:
– Rapid development environment
– Very powerful processing handling large datasets
on commodity hardware
• Allows for 100% data samples up to millions of
elements row-wise
– Workflows can be saved, shared, and duplicated
– nodes are stepwise allowing for quick revisions
– nodes provide access to complex algorithms
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
What is Knime?
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
The Knime Workbench
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes
• Nodes are the workers inside a
workflow
• Every node serves at least one
function
• Nodes can also be built as MetaNodes, which are a collection of
nodes performing common
functions
• A collection of nodes is called a
“workflow”
• You can develop nodes with Java
and the node development
support
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes
• For example, the file reader node
is an intelligent file reader that
can determine the type of file
• However, it also allows for the
end user to adjust parameters
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes
• The Column Filter node
allows users to filter
columns from a table
(conveniently named…)
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Knime Nodes (sample)
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Knime Integrates with R
• R integration is key to expanding
the data analysis and visualization
capabilities of Knime
• R supports data ingestion of
complex files (including ESRI)
• R supports complex data
manipulation and statistical
analysis
• R supports a wide variety of highly
customizable visualizations
So, what is R, exactly?
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
R Project for Statistical
Computing
www.r-project.org
• R is an open source scripting language which can be run
inside Knime, but also within a command line environment
independently
• Several GUI interfaces for R exist such as R Studio, a group
that provides software for using R as well as training and
extension packages (www.rstudio.com)
• Community contributions make up the bulk of R packages,
which now total more than 4,700
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
R Project for Statistical
Computing
www.r-project.org
• The R base package (standard software) provides methods
for reading data, ETL, analysis and visualizations
• The community provided packages take this base and build
on it depending on the interest of the producer
• Packages stretch across all imaginable data uses, including
advanced statistical analyses, machine learning and data
mining, and advanced graphical visualizations (including
sophisticated mapping)
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Popular R Packages
A (very) brief overview of popular packages:
• Plyr – for advanced data manipulation
• Maps – for mapping datasets onto georeferenced outputs
• GGPlot2 – for advanced data visualizations
• Rcurl – for reading data from webpages and repositories
• TextMining – for text mining applications
• SNA – for social network analysis
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
R Inside Knime
Basic Data Manipulation:
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
R Inside Knime
Basic Visual using Maps:
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Knime + R + TPP
Case examples for working with TPP:
• Look at distribution of TPP accounts across a county, state, or
region
• Map entities or create a heatmap (choropleth) of the
distribution of personal property values
• Compare personal property reporting across schedules across
industry sectors (m&e across manufacturing types)
• Compare like-kind entity reporting (franchises, big-box) for
consistency in values
• Compare personal property accounts with other data
resources (real property accounts, permits, etc.)
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Brief Demonstration
Data:
Florida
67 Counties
More than 1.24 million personal property accounts
Goals:
1. Group all data by industry to illustrate the taxable value and
exempted value by type
2. Subset the data to include only a particular industry
3. Map the state-wide exempt value in a choropleth
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.
Questions?
Thank you for your time and attention. I am always happy to
discuss data, so please feel free to contact me at any of the
information below.
Mark C Cooke
[email protected]
704.847.1234 (office)
704.953.6349 (cell)
www.linkedin.com/in/markccooke
2013
Open Source Tools for Data Analysis
Mark C Cooke - Tax Management Associates, Inc.