Real Time Analytic Processing

Download Report

Transcript Real Time Analytic Processing

The Dynamic Duo of Data
Warehousing and Real-Time
Streams
Bharat Gera
Manager
Big Data
Information is at the Center
of a New Wave of Opportunity…
44x
as much Data and Content
Over Coming Decade
2009
800,000 petabytes
1
1
2020
35 zettabytes
80%
Of world’s data
is unstructured
… And Organizations
Need Deeper Insights
1 in 3
Business leaders frequently
make decisions based on
information they don’t trust, or
don’t have
1 in 2
Business leaders say they don’t
have access to the information
they need to do their jobs
83%
of CIOs cited “Business
intelligence and analytics” as
part of their visionary plans
to enhance competitiveness
60%
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
make swift business decisions
New types of real time Data is Emerging
Stock market
Telephony




CDR processing
Social analysis
Churn prediction
Geomapping
 Impact of weather on
securities prices
 Analyze market data at
ultra-low latencies
Law Enforcement,
Defense & Cyber Security
 Real-time multimodal surveillance
 Situational awareness
 Cyber security detection
Transportation
 Intelligent traffic
management
Fraud prevention
 Detecting multi-party fraud
 Real time fraud prevention
Smart Grid & Energy
 Transactive control
 Phasor Monitoring Unit
e-Science
 Space weather prediction
 Detection of transient events
 Synchrotron atomic research
Health & Life
Sciences
 Neonatal ICU
monitoring
 Epidemic early
warning system
 Remote healthcare
monitoring
2
Other
Natural Systems
 Wildfire management
 Water management





Manufacturing
Text Analysis
Who’s Talking to Whom?
ERP for Commodities
FPGA Acceleration
3
4
5
6
7
8
9
10
11
12
Why Real Time Feeding of the Data Warehouse ?
Warehouse
13
Traditional Benefits of Going Real Time Feed
Reduces the Time taken to Populate the Warehouse
• Reduces Batch Windows
• Allows pre-processing
Opens the Opportunity for Operational Real Time
Queries
Less disruption and dependencies on Source Systems
New Drivers
Real time / “Right Time Reporting”
Real Time Event Analysis
•
•
Fraud
Churn
Real Time Analytical Applications
•
•
Applications that Use Analytics in real time Business Processes
Replenishment planning, Manufacturing, network Management
Master Data Driven Businesses
Approaches for Real Time Feed
Change Data Capture
• Assumes Knowledge of Source application
• Can use IBM Infosphere CDC or DB2 Q-replication
Application Integrated Feed
Enterprise Service Bus delivery
• Tap into Application Integration Layer for Updates
InfoSphere Streams
Infosphere Streams
Smart Analytics
Big Insights
Source System CDC
17
Warehouse
Data Integration
Business Intelligence
Enterprise Service BUS
InfoSphere Streams and Smart Grid Analytics
Netezza
Infosphere Streams
To Open Your Business to the Next level of Innovation You need
the right Architecture for Innovation
Cognos
Applications
High volume semi-structured
data in file systems
(often highly distributed)
Direct reporting
& analytics and
dashboards
Mash ups
SOA Web
Service
Information
Integration is Key
ODS
IBM Confidential
InfoSphere
Streams
InfoSphere
Information
Server
Combine All Sources for consistent
Warehouse Feed
• Data Integration
• Data Quality
• Data Delivery
Traditional data sources
(ERP, CRM, databases, etc.)
Real-time streaming
Data (structured and
unstructured)
• Event detection and
capture of Real time
Data
Financial
Planning
Marts
Active
Warehouse
InfoSphere
BigInsights
Spreadsheets
InfoSphere Streams Delivers Real Time Analytic Processing
A Platform to Run In-Motion Analytics on BIG Data
ICU
Monitoring
Real time
deliveryEnvironment
Algo
Trading
Volume
Variety
Terabytes per second
Cyber
Security
Monitoring
Powerful
Analytics
Government /
Law enforcement
Telco churn
predict
Smart
Grid
Petabytes per day
All kinds of data
All kinds of analytics
Velocity Insights in microseconds
Millions of
events per
second
Microsecond
Latency
Traditional / Non-traditional
data sources
Sophisticated Analytics
Mining in Microseconds
(included with Streams)
Acoustic
(IBM Research)
(Open Source)
Text
(listen,
verb),
(radio,
noun)
Advanced
Mathematical
Models
Simple & Advanced Text
(included with Streams)
(IBM Research)
(Open Source UIMA)
Predictive
(IBM Research)
(IBM Research)
 R( s , a )
t
t
population
Image & Video
GeoSpatial
(IBM Research)
(Open Source)
Statistics
(included with
Streams)
What is Stream Computing?
Continuous Ingestion
Continuous Analysis in Microseconds
Stream Processing Architecture
Service Infrastructure
Context
Business Intelligence
Weather
GPS Location, Transactions,
Personal Health Monitor etc.
Real-time Revenue Assurance
Scoring Engine
Real-time dashboards

Telco Data
Online Monitoring
Online Learning for
Fraud Detection
Data
warehouse
Support
data
Mediation
Data mining
Warehouse
CDR collection
CDR Filtering
CDR deduplication
Stream Processing
InfoSphere
BigInsights
23
Telecom Customer using Streams
Challenge
• Call Detail Record (CDR) processing within
Warehouse was sub-optimal, unable to meet the
business requirements and expanding TCO
• Could not achieve real time billing, required handling
billions of CDRs per day, and de-duplication against 15
days worth of CDR data
• Unable to support for future IT and Business with realtime analytics
Solution
• InfoSphere Streams supports real-time mediation by
handling 6bil CDRs each day and linear scalability for
growth
• Delivered a platform for real-time analytics
• Offloaded CDRs processing to Streams platform
enhances warehouse performance and improved
TCO
• Single platform for mediation and real time analytics
reduces IT complexity
Business Benefits
• Real time CDR processing enables
real time billing
• Provides platform for real-time
analytics to drive revenue: ex:
location driven marketing campaigns.
• Data now processed reduced from 12
hours to 1 second. HW costs reduced
to 1/8th
• Support for future growth without the
need to re-architect, more data, more
analysis
Real time Momentum
25
Enterprise Integration
Data Warehouse
• Trusted Information &
Big Data Platform
Governance
– Companies need to
govern what comes in,
and the insights that
come out
Enterprise
Integration
• Data management
– Insights from Big Data
must be incorporated into
the warehouse
Traditional Sources
26
New Sources
IBM InfoSphere Streams v2.0
Agile Development
Environment
Scale-out Architecture
Sophisticated Analytics
with Toolkits & Adapters
Front Office 3.0
• Eclipse IDE
• Streams LiveGraph
• Streams Debugger
• Clustered runtime for
near-limitless capacity
• RHEL v5.3 and above
• x86 multicore hardware
• InfiniBand support
• Ethernet support
•
•
•
•
•
•
•
Database Toolkit
Mining Toolkit
Financial Toolkit
Standard Toolkit
Internet Toolkit
User defined toolkits
Over 50 samples
… and is especially useful for mining.
Stream
Analytics
RTAP
Non-Traditional
/ NonRelational Data
Sources
In-Motion
Analytics
Ultra Low
Latency
Results
Traditional /
Relational Data
Sources
Database
Warehouse
Analytics OLAP / OLTP
28
At-Rest
Analytics
Results
Streams mining toolkit
Use when there’s value in immediate awareness of anomalies
Supports Predictive Model Markup Language (PMML)


PMML: Supported by many vendors, e.g. SAS, SPSS
Mining algorithms from InfoSphere Warehouse
Operator Name
(Algorithm Type)
Algorithm
Supported PMML
Versions
Classification
Decision Tree
2.0 - 3.0
Logistic Regression
2.0 - 3.2
Naïve Bayes
2.0 - 3.2
Linear Regression
2.0 - 3.0
Polynomial Regression
2.0 - 3.0
Transform Regression
2.0 - 3.0
Demographic Clustering
2.0 - 3.0
Kohonen Clustering
2.0 - 3.0
Association Rules
2.0 - 3.2
Regression
Clustering
Associations
29
Velocity of Data Access – Multilevel Storage Access
Times
Fetching data to analyze is MUCH faster with in memory systems
Analysis
Retrieval time 1 ns
Processor L1 cache
Retrieval time 10 ns
x 10
Processor L2 cache
Retrieval time 100 ns
x 10
Records in Memory
Bottleneck
Retrieval time 10 ms
Records on disk
30
x 100 000
Streams Mining models
Scoring Models on data in motion
Streams
• 3x to 19x more records per second
• 4 cores vs 216 cores (50x less) hardware
Source
PMML Scorer
Dummy
Test Processing Graph
Processor: Intel Xeon, 4 core@3 GHz, 8 GB Ram
Data: Benchmark Data (270 fields, 100000 samples)
Models: 20 Models (different types) from Warehouse
Model Type
Streams Thruput (tuples/sec)
Classification
8000 – 15000
Regression
20000 – 50000
Clustering
8000 – 35000
Greenplum on 18 node rack, each node 12 cores
2,612 records per second
Greenplum performance info source: http://www.enterprisestrategygroup.com/2011/06/emc-greenplum-data-computing-appliance/
31
Churn Prediction Highlights
• Data Set
– Anonymized call detail records (CDRs)
•
•
300,000 customers, 40,000,000 CDRs (avg 2K bytes per CDR)
CDR: caller, callee, caller provider, callee provider, timestamp, type,
etc.
– Customer contract data
•
Events: join, leave, plan change, etc.
• Hardware Infrastructure
– Five Blade Servers: Dual-Processor, Dual-Core Xeon 5160
– Gigabit Ethernet Switch
• Performance Results
– ~400,000 calls and customer events per second
 34 Billion CDRs per day on 5 standard x86 Blades
32
Visual Representation
A New Paradigm: In-Motion analytics for High throughput and Ultra-low latencies
Continuous Ingestion
33
Continuous Queries /Analytics on data in motion
Putting it all together …
Discover
Model
Visualize
& Publish
IBM SPSS
Measure
Score
Streaming Data
Sources
34
IBM Cognos Real
Time Monitoring
RTAP Combined With OLAP for Smarter Decisions
direct operational adjustments
On site: RTAP: Real Time Analytic Processing
Low Latency Analytics
Alerts
Wired drill pipe and
surface data
Event detection,
pattern recognition
sensory and drilling data
On shore: OLAP: Online Analytic Processing
longer term
exploration
adjustments
Traditional /
Data Sources
35
Data repository,
(Historian)
traditional
data
analytics
Results
Reports
Streams for Cyber Security 2
1
Network/Internet Forensics
Data Collection
InfoSphere Streams
Backup System
TAP Probes
Smart
Anaytic
System
ISP
5 Gbps per 12core Streams
node, depending
on required
processing, for
deep packet
inspection
3
Data
security
(Guardium)
System Management and Result Visualization
Visualizer
LDAP
TDS
Cognos BI
SNMP
Cognos RTM
System Monitoring
ITM
TDW
36
TSM
WAS
Customer Insight for Real-Time Advertising
Customers
Multichannel
@
Website
Capture:
Search keywords
Page content
Cookies
IP addresses
Device info
Actions within a
window of time
In-Motion Behavior
Analysis
Match with Global Id
Map keywords to
attributes and
classification hierarchy
Invoke behavior
models/scores
Transactions from this
customer
• Cardholder since YYYYMM
• Average transaction value
• Monthly transaction value
• Categories purchased
• Brands purchased
Descriptive
• Age
• Gender
• Family situation
• Zip code
Target
Advertising
Platform
(Campaign
Management)
Interactions
Predictive Models
Scoring, Segmentation,
Analysis, Association
Advertisers
• Web registration
• Web visits
• Customer service contacts
• Channel preference
Attitudes
• Satisfaction scores
• Shopper type
• Eco score
Transactions from
all customers
Chart provided by: Arvind Sathi, IBM – Lead Architect - Information Agenda Communications Sector
37
37
RTAP Combined With OLAP for Smarter Customer Care
Cognos Real
“A moment’s insight
Time Monitor
is sometimes worth a
SOA Web
Service
life’s experience.”
Oliver Wendell Holmes
Cognos
Mashups
Applications
Spreadsheets
Fin Planning
InfoSphere
Warehouse
Data Marts
DB2
Analytic
Models
CDRs
38
InfoSphere
Streams
• ERP,CRM and Other
Data Sources
IBM
BigInsights
Thank You!
ibm.com/smartersystems
Simply put, IBM is making systems smarter.
39