Transcript Document

Understanding Data Analytics and
Data Mining
Introduction
Introduction
An important aspect of the decision-making process is the
ability to transform seemingly unrelated data into useful
information which is used to influence a person’s
decision. Understanding what data is needed to make
effective decisions and where that data comes from is
just one step in the process: the next step is mining or
analyzing that data to draw up useful conclusions to aid
in decision making.
The Understanding Data Analysis and Data Mining
presentation is designed to explore the general
principles behind this second step and support the
organization in understanding their options related to
using data effectively in their business.
Distinguishing Analysis and Mining
The terms, “data analysis” and “data mining,” are
sometimes used interchangeably, but they are distinctly
different in practice.
In data analysis, a hypothesis is formed and the data is
analyzed to support or disprove the hypothesis.
In data mining, no hypothesis is formed initially but the
data is analyzed to identify any interesting patterns
from which a hypothesis can be drawn.
Despite their differences, the techniques and methods for
both data analysis and data mining are similar.
Knowledge Discovery in Databases
The Knowledge Discovery in Databases
process includes the following steps:
–
–
–
–
–
–
Selection
Preprocessing
Transformation
Data Mining
Interpretation/Evaluation
Knowledge Presentation
Defining Data
Data are a set of facts.
Facts are true or proven.
Data can come in a variety of types:
–
–
–
Relational data
Operational data
Transactional data
Define Data Entry
A data entry is a single instance or record in a
database. They are also called data objects.
A data entry establishes relationship between
data elements.
–
–
–
person and address
customers and purchases
events and outcomes
Define Dimensions
A dimension is a collection of facts about a
measurable situation.
Dimensions define the who, what, where,
when, and how of a particular focus on the
data.
Dimensions are used to construct how data
patterns are identified and analyzed.
Dimensions – Cube Schema
The cube rendering is a product of online
analytical processing (OLAP) and is used to
show how the different dimensions of data
can be viewed.
Retail Example:
–
–
–
4 retail locations
10 products
12 months
2 age groups
Location
–
Product
Dimensions – Star Schema
Star schemas are used to design how data is
organized in data warehouses.
Product
Location
Orders
Time
Customer
Online Analytical Processing
Online Analytical Processing is an approach
for analyzing multidimensional data from
multiple perspectives interactively.
The acronym for online analytical processing
is OLAP.
Defining Patterns
A pattern is an expression of data which can be modeled.
Data analysis and data mining focuses on identifying,
understanding, and drawing conclusions about interesting
patterns.
An interesting pattern has the following characteristics:
– It can be understood easily by humans
– It can be recreated, meaning it has some level
certainty to its validity
– It can be potentially used by the organization
– It is novel, innovative, and requires investigation
– For data analysis, it validates and confirms the
hypothesis
Queries
Queries are a mechanism for retrieving
information from a database: they consist of
questions.
Standard queries are predefined questions to
ask a database.
Data Mining Techniques
There are several techniques of note in data
mining:
–
–
–
–
–
Characterization and Discrimination
Associations and Correlations
Classification and regression
Clustering analysis
Outlier analysis
Characterization and Discrimination
Characterization will describe the data in
summary or general terms.
Discrimination will describe the data, usually
by means of comparison.
Association and Correlation
Associations and correlations are pattern
relationships made against data objects.
Often used in frequent pattern mining.
Classification and Regression
Classification attempts to find a predefined
data model to describe the data set.
Regression attempts to find an existing data
model to describe missing or unavailable
numerical data sets.
These are predictive approaches and utilize
methods such as decision trees and neural
networks.
Cluster Analysis
Data objects are analyzed without using class
labels, or generating class labels.
Image from visibleearth.nasa.gov
Outlier Analysis
Looks at the abnormalities in data: data that
does not behave as expected.
Standards
Cross Industry Standard Process for Data Mining
(CRISP-DM) was developed by the European
Strategic Program on Research in Information
Technology
Sample, Explore, Modify, Model, and Assess
(SEMMA) was developed by SAS Institute Inc.
The Toolkit
The Toolkit is designed to enable an organization to
improve their capabilities in data warehousing and data
analysis, while maintaining a level of neutrality between
specific technical solutions. The toolkit is comprised of
two parts: an introduction to the concepts and terms
used in these areas, and usable templates to pursue
and implement specific technical solutions
The goal of the Data Warehouse and Data Analysis Toolkit
is to define the contributing factors, major components,
and their relationships, while provide the basic tools to
take action based on the organization’s needs.
Moving Forward
The presentations found within the Toolkit provide
education about the different facets of Data
Warehousing and Data Analysis: they can be used for
self-edification or as the foundation for presenting a
case to different levels of the organization.
The process document, Developing Data Analysis
Capabilities, is intended to be a step-by-step guide in
creating a Data Analysis foundation in your
organizations. Multiple templates have been created to
support the process and aid organizations in their
efforts to improve their Data Analysis capabilities.