DataArchPredAnalytics20161210x

Download Report

Transcript DataArchPredAnalytics20161210x

Data Modeling for
Predictive Analytics
SQL Saturday Providence – Dec 10
Beth Wolfset, Data Architect
Email: [email protected]
SQL Saturday 575 - Thanks to our Sponsors!
• Gold
• Silver
• Bronze
• Swag
• Blog
About BlueMetal (an event sponsor)
Modern technology, craftsman quality.
We’re an interactive design and technology architecture firm matching the most experienced
consultants in the industry to the most challenging business and technical problems facing our
clients. Founded August 2010 and as of October 2015 we are an Insight company.
6 | YEARS IN OPERATION
5 | LOCATIONS
6 | SERVICE AREAS
4 | INDUSTRY SPECIALIZATIONS
We are hiring Data Engineers in Boston and NYC
Data Is An Asset
“Whether you want it or not, the amount and
variety of data are expanding exponentially.
Embrace that trend and transition your
organizations to understand information as a
competency that needs the right people,
processes and platforms”
John Lewis, president & CEO, consumer group, NA, at Nielsen
“organizations integrating high-value,
diverse, new information types and
sources into a coherent information
management infrastructure will
outperform their industry peers
financially by more than 20%.”
Regina Casonato, et al, Gartner Research
“Information is the oil of the
21st century, and analytics is
the combustion engine.”
Peter Sondergaard, Gartner Research
Why Are We Here?
• What is Data Modeling
• What is the Predictive Analytics Process
• Database Types and how No SQL fits in
• What are the Types of Data Models and where they are
appropriate
• A Word on other Data Topics
•
Data Architecture and No SQL
•
Code First vs Model First
What is Data Modeling
Data modeling is a process used to define and analyze
data requirements needed to support the business
processes within the scope of corresponding information
systems in organizations. Therefore, the process of data
modeling involves professional data modelers working
closely with business stakeholders, as well as potential
users of the information system.
The Data Journey
Predictive Analytics Process
Identify
Business
Objectives
Monitor &
Analyze
Exploratory
Analysis
Implement
Statistical
Model
Develop
Statistical
Model
Test
Statistical
Model
Popular Types of Databases
Database Type
Example
Relational (SQL)
Database Type
Example
MySQL
Document
Lotus Notes
CouchBase
CouchDB
MongoDB
OrientDB
Raven
Terrastore
Graph & Resource Neo4J
Description
Flock
Framework (RDF) HyperGraph
Infinite Graph
Jena
Sesame
AllegoGraph
Search Engine
ElasticSearch
Splunk
Solr
MarkLogic
Sphinx
Key-Value
Riak
Redis
Column-Family
Cassandra
Amazon Simple DB
Oracle
Sybase
Object-Oriented
Berkely
Level
Memcached
Cache
Db4o
ObjectStore
Versant
Objectivity/DB
Hypertable
Hierarchical
DATA ARCHITECTURE AND NOT ONLY SQL
NO SQL – Not Only SQL
• Implies a non-relational database
• Does not imply there is no structure to the data
• Data may be stored in methods that do not require the structure to be
understood a priori
• Structure of data may be defined at query time
NO SQL does not mean NO Data Architect
• The Data Architect understands what data is available and the methods of
accessing it, regardless of the management system
• Data requires documentation
• Generally uses Schema-On-Read

Data Models
Model Types
•
•
•
•
•
•
Conceptual
Logical
Physical
Star Schemas
Summary Tables and Cubes
Others
Business Case
• To capture information for an Educational System. This will include
information about the Employees, Faculty, Students , Alumni,
Courses and Educational Organization
Data Modeling Tools
Tool
ERwin Data
Modeler
ER/Studio
Creator
Supported Database Platforms Supported
Supported data
OSs
models (conceptual,
logical, physical)
ERwin Inc.
Access, IBM DB2, Informix,
Windows
Conceptual, logical,
(formerly part Ingres, MySQL, Oracle,
physical
of CA
Progress, MS SQL Server,
Technologies) Sybase, Teradata
Embarcadero Access, IBM DB2, Informix,
Windows
Conceptual, logical,
(acquired by Hitachi HiRDB, Firebird,
physical, ETL
IDERA)
Interbase, MySQL, MS SQL
Server, Netezza, Oracle,
PostgreSQL, Sybase, Teradata,
Visual Foxpro and others via
ODBC/ANSI SQL
Enterprise
Architect
Sparx Systems IBM DB2, Firebird, InterBase,
Informix, Ingres, Access, MS
SQL Server, MySQL, SQLite,
Oracle, PostgreSQL, Sybase
Windows,
Linux, Mac
Conceptual, Logical
& Physical + MDA
Transform of Logical
to Physical
SQL Server
Management
Studio
Oracle SQL
Developer Data
Modeler
PowerDesigner
Microsoft
MS SQL Server
Windows
Physical
Oracle
Oracle, MS SQL Server, IBM DB2 Crossplatform
Logical, physical
Sybase
MS SQL Server, Oracle,
PostgreSQL, MySQL, IBM DB2,
Informix
Conceptual, logical,
physical
Windows
Supported
notations
Forward
Reverse
Engineering Engineering
Model/database
comparison and
synchronization
Update database
and/or update
model
Repository
IDEF1X, IE
(Crows feet),
and more
Yes
Yes
IDEF1X, IE
(Crows feet)
Yes
Yes
Update database
and/or update
model
ER/Studio Repository and Team
Server (formerly
Portal/CONNECT) for
collaboration
IDEF1X, UML
DDL,
Information
Engineering &
ERD
Yes
Yes
Update database
and/or update
model
Multi-user collaboration using
File, DBMS or Cloud Repository
(or transfer via XMI, CVS/TFS or
Difference Merge).
Yes
IDEF1X, IE
(Crows feet),
and more
IDEF1X, IE
(Crows feet),
and more
Yes
Yes
Yes
Yes
Workgroup edition provides
collaboration
Yes
Update database
and/or update
model
Update database
and/or update
model
Yes
Yes
Conceptual Model
A conceptual data model is a high level or coarse model which is abstract
in structure and content that is intended to represent a business area
SubType
SuperType
Transactional Data Relational Data
 Warehouse Data Non-Relational Data
 IoT Data
SuperType
SubType
Logical Model
A logical data model (LDM) is a type of data model showing aAttribute
representation of the organization's data, independent of any particular
Entity
technology
Primary / Surrogate Key
Business Key / Alternate Key
SuperType
Audit ColumnsDomains
SuperType
Foreign Key
 Transactional Data
 Warehouse Data
 IoT Data
SubType
Relational Data
 Non-Relational Data
SubType
{
Physical Model
"course": {
{
"courseId":
"ENG
301",
A physical data model (or database design) is a representation of a data design which
"person": {
"instructor": {
"name": {
takes
into account the capabilities and constraints of a given database management
"name": {
"first_name": "Severus",
system
"first_name": "Severus",
"last_name": "Snape"
What
"last_name": "Snape"
},
},
about reverse
"ssn": "999-999-9999",
"ssn": "999-999-9999"
engineering?
"active_status": true,
SuperType
},
"person_type": "External",
"students": [
"role": {
{
"faculty": {
"name": {
"department": "Engineering",
"first_name": "Harry",
"salary": "80000"
"last_name": "Potter"
},
},
"student": {
"ssn": "111-111-1111"
"fee": 0
},
}
{
},  Transactional Data
"name": {
"course": {
"first_name": "Luna",
 Warehouse
Data
"courseId":
"ENG 301"
"last_name": "Lovegood"
}  IoT Data
},
}
"ssn": "222-222-2222"
} Relational Data
}
SubType
X Non-Relational Data
]
}
}
Star Schema
A star schema is a representation of a dimensional data model which consists of facts and dimensions.
Transactional Data

Warehouse Data

IoT Data
X
Relational Data

Non-Relational Data
X
Summary Tables and Cubes
A summary or aggregation table will be created when metrics have been standardized and reports or
analytics want to use the information frequently or consistently.
Transactional Data

Warehouse Data

IoT Data

Relational Data

Non-Relational Data

Predictive Analytics Process
Identify
Business
Objectives
Monitor &
Analyze
Exploratory
Analysis
Implement
Statistical
Model
Develop
Statistical
Model
Test
Statistical
Model
Other Areas
•
•
•
•
•
•
Data Model Patterns
Master Data Management
Code First vs Model First
Data Flow Diagrams
Use Case Diagrams
Data Governance
Other
Areas?
Data Model Patterns
Logs
Taxonomies &
Reference Data
Events
Business Objects
Demographic Data
Auditing
Utilization
Master Data Management
Master Data
tCustomer
Customer
Config
Sales
Customer
Data
•
core data
that
is essential
operation of the business
Three
Ways
to Masterto
Data
•
• Mutually
Exclusiveset of identifiers and extended attributes that describes
consistent
and uniform
the core• entities
Vertically Fragmented
Master Data
Management
• Match
and Merge
•
•
•
Name: S. Snape
Master Plan
SSN: 123-45-6789
List
Degree: Engineering
a methodology
that identifies the most critical information within an
Which is the
organization—and
creates a single view of truth to power business
right
processes
customer
Name:
Prof. Severus
Snape
discipline
in which business
and
IT work
together toEngineering
ensure the
address?
SSN: 123-45-6789
Classes and
uniformity, accuracy, stewardship,
semantic consistency
Emp Id: 456
accountability of the enterprise’s
official
shared master data assets
Address: 9 Galen
St
Phone: 617-555-1212
Math
Degree: Engineering
may be technology
Masterenabled
Classes
Name: Prof. Snape Class List
Emp Id: 456
Phone: 617-555-1212
Name: Severus Snape
SSN: 123-45-6789
Address: 9 Galen St
Philosophy
ClassesPhone: 617-555-1212
Code First vs Model First
Code First
Model First
•
•
•
•
•
•
•
Build code-data structures in memory
Measured on speed to provide
functionality
OR/M maps structures to DB
Limits need to understand DB Access
•
Understand data and future growth
Use standards and templates
Measured on multiple uses across
applications
Consistency of model facilitates efforts
Data Abstraction Layer
Goals
• Efficient Storage
• Performant Retrieval
• Understand DB
Goals
• Ease of Development
• Understand code
Microservices
Stored
Procedures
Data Flow Diagram
A data flow diagram is a representation of
the movement of data through a system
 Transactional Data
 Warehouse Data
 IoT Data
Relational Data
Non-Relational Data
23
Use Case Diagram
A use case diagram is a representation
of a user's interaction with the system
 Transactional Data
 Warehouse Data
 IoT Data
Relational Data
Non-Relational Data
Thank you.
We appreciate your interest, and look forward to working with you in the future!
Beth Wolfset
[email protected]
Twitter: @beth_wolfset
www.bluemetal.com | (866) 252-0111