Memory-first Analytics

Download Report

Transcript Memory-first Analytics

HPC Meets Big Data in Financial
Services
Topics
● The evolution of Big Data on Wall St
● Big Data 1.0 Fragmentation 2.0
●
●
●
●
●
●
Needs of the next gen analytics platform
Cray HPC technologies relevant to analytics
The Graph revolution
Memory first analytics
Cray’s next gen technologies
Architectural simplification opportunities
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE
The Evolution of Finance Technology
Industrializing
sales and trading
•
•
•
•
•
•
Mobile and
Internet
Decisions driven by the business and app dev
Tight budgets now
A data centric view
Unprecedented volume, velocity, variety
Shift to real-time or intraday
Holistic enterprise view necessary
The Data
Revolution
•
•
•
Cray Inc. Proprietary – Not For Public Disclosure
Real-time use cases
•
Risk
•
Fraud
•
Surveillance
•
KYC
•
Portfolio optimization
•
On the wire analytics
Roboadvice, Machine Learning
NLP, sentiment analytics, social graph etc etc
3
Legacy Wont Cut It
The Trading Silos
• Own OMS and
EMS
• Own position
keeping
• Own Risk
Management
The regulatory
and business
need
• Firm wide risk
and trading
limits
• Regulatory
pressure
Organizational
Databases
• Too costly,
scale issues,
fragmented
• Equities
structured, FIs
schema-less
Storage
• Expensive to
scale and
underperformant
• Not parallel
Technology
Cray Inc. Proprietary – Not For Public Disclosure
4
Big Data 1.0 - Fragmentation 2.0
Not enterprise
class
Batch oriented
Weak
persistence
model for realtime or DR
Application level
clusters e.g.
Cassandra
Analytic pipeline
fragmentation
and sprawl
Monitoring not
holistic
Network
bandwidth
challenged
Data (and
licenses) scaled
with compute
Cray Inc. Proprietary – Not For Public Disclosure
Departmental
ambition and
design
Operational
complexity and
fragility
And more…..
5
The Mega Trend is creating Analytics Cluster Sprawl &
Siloed Big Data Application and Data Clusters
Data Prep/
ETL
Stream
Processing
Data
Mining
Interactive
Queries
Analytics Pipeline
•
•
•
Multiple steps of analytics processing
Each with different computing characteristics
Requiring separate cluster for each step
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE
High TCO
• Large datacenter footprint
• High management cost
• Significant data movement
Actionable
Insight
Big Data and Analytics 2.0 Platform Needs
(All the Data, All the Tools. Start Anywhere Go Anywhere)
• Memory based streaming on-wire analytics
• Operational analytics
Performance • Deep history and batch analytics
Range
Tool Range
• All tools working well
• Spark, Hive, Cassandra, Storm, Mahout etc
• Graph (OLTP and OLAP style)
• User can build an analytic and datastore anywhere in the pipeline and evolve it anywhere
• Multi tier external storage for performance and capacity with integrated metadata – day 0 to year 7
• Single shared storage for analytic Hadoop and Risk Grid
Persistence • Data and compute scaled independently
Layer
Cray Inc. Proprietary – Not For Public Disclosure
7
Needed – One platform, many different workloads, and operationally efficient
Data Prep
Simple counts,
Summarization
Batch processing
Every item in a
dataset once
Throughput Matters
Statistics
Machine Learning
Iterative algorithms
Same Subset of data several
times
Graphs,
Search
Interactive Exploration
Different subsets each
time
Latency Matters
•
Analytics pipelines are a sequence of tasks, possibly using different tools and processing
different sized datasets
•
These tasks place very different infrastructure demands, and have resulted in narrowly
optimized task-specific solutions and infrastructure (eg DW appliances)
•
Need - Support the widest range of analytic workloads on a
single system footprint, without compromising performance
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE
8
General Hadoop Analytics: Needs
Managing Stack Fluidity and Complexity
• Continuous Apache improvements
• Technology outside the Hadoop distribution e.g. Cassandra
• Clusters are hard to balance, build and maintain reliably
• Need a supportable appliance with an open distribution
• Large graph is becoming a part of the pipeline
HDFS – Grade level B
• 3x on cluster replication is not a recipe for success
• Scale compute and storage independently
• Storage should be global, have a performance/capacity hierarchy
• And be much much faster
Environmentals
• Can improve significantly on a commodity cluster
• Density, Power, Cooling
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE
Bringing Cray HPC Technologies to
Big Data Analytics
Enabling transformative ROI through
extreme technology
Cray Inc. Proprietary – Not For Public Disclosure
10
Cray’s Vision:
The Fusion of Supercomputing and Big & Fast Data
Modeling The World
Cray Supercomputers solving “grand challenges” in science, engineering and analytics
Math Models
Modeling and
simulation augmented
with data to provide
the highest fidelity
virtual reality results
Compute
DataIntensive
Processing
High throughput event
processing & data
capture from sensors,
data feeds and
instruments
Store
Cray Inc.
Data Models
Integration of datasets
and math models for
search, analysis,
predictive modeling
and knowledge
discovery
Analyze
Cray’s Assets we bring into Analytics
Performance
Philosophy
• History at the forefront of supercomputing
• Scaling and reliability
• Great people who live and breathe performance
Systems
Philosophy
• Balanced systems design: memory, CPU, storage
• Green environmental focus: density, power, cooling
• Systems and appliances vs. components
• Prebuilding and testing
Unique
Technology
• Aries interconnect
• Global address spaces and languages
• Cray Graph Engine (port of Apache Jena for massive
shared memory)
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE
Urika Product Line Today
Urika-GD
• ‘Discovery’ appliance
• Optimized for graph
analytics (XMT2)
• Fine-grained shared memory
• Massively multithreaded
hardware accelerator to speed
access to large, shared
memory
• Graph data model (RDF)
• SPARQL query language:
Pattern-matching
• Parallelized graph algorithms
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE
Urika-XA
• Analytic platform
• Supports wide range of
analytic applications
• Hadoop, Spark, and
future workloads
• Batch and low-latency
• Data mining, machine
learning, interactive
data exploration
GRAPH: What’s All the Fuss About?
FS: All about relationships…
Bad
things
Good
things
• Fraud: Employees, customers,
transactions
• Compliance: Rules, policies,
counterparties, transactions,
employees
• Opportunity: news, securities,
companies, market data
• Portfolio optimization: demographics,
customers, goals, securities, news
Cray Inc. Proprietary – Not For Public Disclosure
15
Graph Use Case Example in FS – Smart Compliance
(Mphasis)
Cray Inc. Proprietary – Not For Public Disclosure
16
Data Discovery: The Real Promise of Big Data…
“Take all these different data sources and put them together and then
help me find something about the data that I don’t already know…”
Graph – A Summary
If you are not
doing or planning
Graph apps now
your competitors
are and you will
be soon
Cray was well
ahead of the
commercial
market with world
leading graph
technology
Cray’s scaling
technology is still
far ahead of
everyone else
Graph needs
to be a
general part
of the
analytics
pipeline
Cray Inc. Proprietary – Not For Public Disclosure
ISVs with Graph
needs approach
us to make their
apps enterprise
scale
18
A Common Technology Combination
Real-time
Data
Kafka
Cassandra
Persistence
Storm or
Spark
Processing
Reference
Large
Graph
For this to work well you need:
- Heavy memory and compute
- Common persistence layer
- Large unpartitioned graph capability
- (this means globally addressable shared memory)
Cray Inc. Proprietary – Not For Public Disclosure
19
Cray Analytics Direction
● Merging HPC and Compute
● Run analytics alongside traditional HPC
● Aries Interconnect
● In memory working storage
● Include Graph in the analytics pipeline
● Aries -> PGAS-> Cray Graph Engine
● Flexible Clusters
● Storage scaling flexibility
Cray Inc. Proprietary – Not For Public Disclosure
20
Memory-first Analytics – 3 Drivers
1
Use Cases
2
‘Data Gravity’
• Biggest productivity killer in analytics
● Then:
Summarization/Aggregation in
is data movement
Batch mode
• High Cost vs. Low value of moving
● Now:
data
• Interactive Data Mining
• The ‘No Movement’ movement
• Streaming analytics
• Databases in memory (SAP HANA,
(Internet of Things)
Oracle 12c)
• Low-latency analytic query
• Analytic processing in memory:
processing at scale
Spark/Databricks
• Filesystems in memory: (Tachyon)
• Complete memory-first datacenter
architecture (eg. RAMCloud)
YarcData COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE
3
Economics
• Many useful analytic
datasets fit in “memory”
– Amazon customer
orders, UPS tracking
details for 1 year
• “Not whether memory
is cheaper than disk, but
cheap enough”
Our vision for Data Intensive Supercomputing
& local
memory
22
Cray – Around the corner
Compute Layer
•
•
•
•
•
Appliance with multiple distribution support
Technology outside the Hadoop distribution e.g. Cassandra
Replace Infiniband with faster and smarter Aries interconnect -> large shared memory
Large graph to become a part of the pipeline
Embrace new technologies such as OpenStack, Mesos and Docker
Storage Layer
•
•
•
•
In memory file system support
SSD for HDFS
Posix compliant Lustre parallel file system for scalable performant global namespace
Scale compute and storage independently
Environmentals
• Can improve significantly on a commodity cluster
• Density, Power, Cooling excellence
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE
Capital Markets – A Case For Convergence
Hours (Gb)
Usage
Days (Tb)
Metrics Monitoring RT Discovery and
visualization
and Alerts
Time Series
Years (Pb)
Search
Surveillance
Trade
Analytics
Compliance
SQL Analytics Store: Actian, Netezza
Apache Storm
Apache Spark
Streambase
Apache Ignite
Non SQL: MongoDB, Hbase, Cassandra
Server Tier
Ingest and on-wire cluster
Operational analytics cluster
Deep and Complex Analytics
Persistence
Layer
SSD: Redis
Hazelcast
IMDGrid
(On cluster):
MongoDB
Cassandra
HDFS
Proprietary Parallel FS
(On cluster):
MPP RDBMS
HDFS
Proprietary Parallel FS
Graph Store
Internal Drives or External Storage
Internal Drives or External Storage
Processing
Layer
Performance
Tier
Capacity Tier
24
SQL: MemSQL, VoltDB
SQL Data Warehouse: DB2, Teradata
Non SQL: Graph
SEC 17a-4 Immutable Store
Architectural Simplification with Cray
Hours Gb
Usage
Days Tb
Metrics Monitoring RT Discovery and
visualization
and Alerts
Years Pb
Time Series
Search
Surveillance
Trade
Analytics
Compliance
Data Operating System: Yarn or Mesos
Processing
Layer
Apache Storm
Apache Spark
Streambase
Apache Ignite
Persistence
Layer
SSD: Redis
Hazelcast
IMDGrid
Tachyon
Server Tier
Performance &
Capacity Tier
25
SQL: XtremeData
SQL: XtremeData
Non SQL: MongoDB, Hbase, Cassandra
Non SQL: Hbase, MapReduce, Graph
(OFF cluster):
Cray Lustre: with Cassandra, Hadoop, Graph, XtremeData
Cray Urika
Cray TAS
SEC 17a-4
Immutable Store
Summary – Extreme Technology Gives Extreme
Flexibility
● You invest in an analytic platform for several years
● Technology and Use Cases are changing fast
● A platform needs to handle as much as can be thrown at it
for that investment cycle.
● Oh you want to do scalable graph in your analytic pipeline – sorry you
can’t do that
● NOT acceptable
● With one inexpensive platform you can converge:
●
●
●
●
On the wire analytics
Deep analytics
ETL, data prep, analytics etc
…and graph (even if your use case you don’t know today)
CRAY CONFIDENTIAL – DO NOT DISTRIBUTE