Virtualnation, D4M discussion

Download Report

Transcript Virtualnation, D4M discussion

D4M – Signal Processing On Databases
42 Sydney St
Artarmon NSW 2064
Australia
Virtualnation
Starting with Big Data
•
•
•
•
•
•
Why care?
In your reach - big data and big compute on a budget
Start with data and apply math
D4M with Accumulo: New technology from MIT and NSA
that claims
• It requires 100x less code; and is
• 100x faster than other approaches
Fundamentally mathematical analysis for big data
Lift the lid.
Virtualnation
Understand the world through data and math
•
•
•
•
•
How do you want to understand and the world?
IT approaches have evolved from a past where IT was
expensive and controlled by the few
Modeled and constrained problems to not only fit onto
limited computers but fit in with the politics of the enterprise
If you could observe without built in constraints and preconceived bias – how would you approach computing?
Understand through scientific method - data and math
Virtualnation
The Primordial Web (92)
Browser (html):http put
Server (http):
SQL
Database (sql):
data
http get
Gopher
Language:
Client
•
•
Server
Database
Browser GUI? HTTP for files? Perl for analysis? SQL for data?
A lot of work just to view data.
Virtualnation
The Modern Web
Game (data):
http put
Server (http):
java
Database (triples):
data
http get
Language:
Client
•
•
•
Server
Database
Game GUI! HTTP for files? Perl for analysis? Triples for data!
A lot of work to view a lot of data.
Great view. Massive data.
Future Web?
Game (data):
http put
Server (http):
java
Database (triples):
data
http get
Language:
Client
•
•
•
Server
Database
Game GUI! Fileserver for files! D4M for analysis! Triples for data!
A little work to view a lot of data. Securely.
Great view. Massive data.
Big Data and Big Compute on a budget
•
•
•
~$9K server with 256G RAM, 32 CPU core and 1.7TB SSD
~ $26K cost 270TB storage server
$199 4TB USB drive
•
ZFS / Smart OS as a free virtualization technology
•
~68TB entire transactional corpus $45B Australian retailer
•
How big are your possible data sets?
Virtualnation
Apache Accumulo
NSA’s Big Table implementation and now top level Apache
project
Cell level security to support privacy and need to know
Supports large scale processing of sparse matrices…
Virtualnation
Packaged into a secure production configuration
Virtualnation
Parallel Warehouse Scale Computer
Memory Hierarchy
Unit of Memory
Implications
High
CPU
CPU
CPU
RAM
RAM
RAM
RAM
disk
disk
disk
disk
Registers
Instruction Operands
Cache
Blocks
Network Switch
Local Memory
Messages
CPU
CPU
CPU
CPU
RAM
RAM
RAM
RAM
disk
disk
disk
disk
Bandwidt
h
Latenc
y
Programmabilit
y
CPU
High
Capacit
y
Parallel Architecture
Remote Memory
Pages
SSD
High
Disk
See http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
Virtualnation
High
Starting with Big Data
•
•
•
•
Now cheap to collect all data forever.
Unconstrained approach to data acquisition
No analysis up front or modeling
Much of it involves Graph Analytics
ISR
• GOAL: Identify anomalous
patterns of life
Social
• GOAL: Identify hidden
social networks
Cyber
• GOAL: Detect cyber attacks
or malicious software
Virtualnation
D4M - Signal Processing on Database
Weak Signatures,
Noisy Data,
Dynamics
Novel Analytics for:
Text, Cyber, Bio
High Level Composable API:
D4M (“Databases for Matlab”)
Distributed Database:
Accumulo/HBase (triple store)
Distributed
Database/
Distributed File
System
Interactive
Supercomputing
High Performance Computing:
Cluster+ Hadoop
Virtualnation
Detection Theory
Virtualnation
Matlab Demo - Reuters Corpus V1 (NIST)
810,000 Reuters news items
Demonstration picked 70,000 and found 13,000 entities
A is a 70Kx13K associative array with 500K entries.
D4M demonstrations
Virtualnation
7 Universal Constructs for Analytics
Virtualnation
Multi-Dimensional Associative Array
Virtualnation
Universal Exploded Schema
Virtualnation
D4M Stores Giant Space Matrices in the
Accumulo Triple Store Database
Triple Store
Distributed Database
D4M
Dynamic
Distributed
Dimensional
Data
Model
Associative Arrays
Numerical Computing Environment
B
A
C
Query:
T(:,ggaatctgcc)
E
D
A D4M query returns a sparse matrix or
graph from a triple store…
Triple store are high performance
distributed databases for heterogeneous
data
…for statistical signal processing or
graph analysis in Matlab
Virtualnation
Big Data for High Speed Sequence Matching
Virtualnation