Data Packs

Transcript Data Packs

Infobright Meetup
Host Avner Algom
May 28, 2012
Agenda
 Infobright
 Paul Desjardins
VP Bus Dev
Infobright Inc
What part of the Big Data problem does Infobright solve?
Where does Infobright fit in the database landscape?
 Joon Kim
Technical Overview
Sr. Sales Engineer
Use cases
 WebCollage
 Lior Haham
WebCollage Case Study: Using Analytics for Hosted
Web Applications
 Zaponet
 Asaf Birenzvieg
CEO
 Q/A
Introduction / Experience with Infobright
Growing Customer Base across Use Cases and
Verticals
1000 direct and OEM installations across North America, EMEA and Asia
8 of Top 10 Global Telecom Carriers using Infobright via OEM/ISVs
Logistics,
Manufacturing,
Business
Intelligence
Online & Mobile Advertising/Web
Analytics
Government
Utilities
Research
Financial
Services
Telecom & Security
Gaming,
Social
Networks
The Machine-Generated Data Problem
“Machine-generated data is the future of data management.”
Curt Monash, DBMS2
Machine-generated/hybrid data
 Weblogs
 Computer, network events
 CDRs
 Financial trades
 Sensors, RFID etc
 Online game data
Rate of Growth
Human-generated data - input
from most conventional kinds
transactions
 Purchase/sale
 Inventory
 Manufacturing
 Employment status
change
The Value in the Data
“Analytics drives insights; insights lead to greater understanding of customers
and markets; that understanding yields innovative products, better customer
targeting, improved pricing, and superior growth in both revenue and profits.”
Accenture Technology Vision, 2011
Network Analytics
•
•
•
•
•
Network optimization
Troubleshooting
Capacity Planning
Customer Assurance
Fraud Detection
CDR Analytics
• Customer Behavior
Analysis
• Marketing
Campaigns/Services
Analysis
• Optimize network
capacity
• Fraud Detection
• Compliance and
Audit
Mobile Advertising
Analytics
• Need to capture web
data, mobile data,
network data
• Mobile ad campaign
analytics
• Customer Behavior
Analysis
Current Technology: Hitting the Wall
Today’s database technology requires huge effort and massive hardware
How Performance Issues are Typically Addressed – by Pace of Data Growth
75%
Tune or upgrade existing databases
66%
70%
Upgrade server hardware/processors
54%
60%
Upgrade/expand storage systems
33%
44%
Archive older data on other systems
30%
32%
Upgrade networking infrastructure
21%
High Growth
4%
7%
Don't Know / Unsure
0%
20%
Source: KEEPING UP WITH EVER-EXPANDING ENTERPRISE DATA
By Joseph McKendrick, Research Analyst; Unisphere Research October 2010
40%
60%
Low Growth
80%
100%
Infobright Customer Performance Statistics
Fast query response with no tuning or indexes
Mobile Data
Analytic Queries
Alternative
2+ hours with
MySQL
Oracle Query Set
(15MM events)
Alternative
<10 seconds
43 min with SQL
Server
BI Report
23 seconds
10 seconds – 15
minutes
Data Load
Alternative
Alternative
7 hrs in Informix
Alternative
17 seconds
11 hours in
MySQL ISAM
11 minutes
0.43 – 22
seconds
Save Time, Save Cost
 Fastest time to value
 Economical
 Download in minutes, install in minutes
 No indexes, no partitions, no projections
 No complex hardware to install
 Minimal administration
 Self-tuning
 Self-managing
 Eliminate or reduce aggregate table creation
 Outstanding performance
 Fast query response against large data
volume
 Load speeds over 2TB /hour with DLP
 High data compression 10:1 to 40:1+
8
 Low subscription cost
 Less data storage
 Industry-standard servers
Where does Infobright fit in the database landscape?
 One Size DOESN’T fit all.
 Specialized Databases Deployed
 Excellent at what they were designed for
 More open source specialized databases than commercial
 Cloud / SaaS use for specialty DBMS becomes popular
 Database Virtualization
 Significantly lowered DBA costs
Row
Hadoop
Column
Your
Warehouse
9
NewSQL
NoSQL
The Emerging Database Landscape
Row / NewSQL*
Columnar
NoSQL-Key Value
Store
NoSQL –
Document Store
NoSQL – Column
Store
Basic
Description
Structured data
stored in rows on
disk
Structured data is
vertically striped and
stored in columns on
disk
Data stored usually
in memory with
some persistent
backup
Persistent storage
along with some
SQL like querying
functionality
Very large data
storage, MapReduce
support
Common use
cases
Transaction
processing,
interactive
transactional
applications
Historical data
analysis, data
warehousing,
business intelligence
Used as a cache for
storing frequently
requested data for a
web app
Web apps or any
app which needs
better performance
without having to
define columns in
an RDBMS
Real-time data
logging such as in
finance or web
analytics
Positives
Strong for capturing
and inputting new
records. Robust,
proven technology.
Fast query support,
especially for ad hoc
queries on large
datasets;
compression
Scalability, very fast
storage and retrieval
of unstructured and
partly structured
data
Persistent store with
scalability features
such as sharding
built in with better
query support than
key-value stores.
Very high throughput
for Big Data, strong
partitioning support,
random read write
access
Negatives
Scale issues - less
suitable for queries,
especially against
large databases
Not suited for
transactions; import
and export speed;
heavy computing
resource utilization
Usually all data must
fit into memory, no
complex query
capabilities
Lack of
sophisticated query
capabilities
Low-level API,
inability to perform
complex queries, high
latency of response
for queries
Key Player
MySQL, Oracle, SQL Infobright, Aster
Server, Sybase ASE
Data Sybase IQ,
Vertica, ParAccel
MemCached,
Amazon S3, Redis,
Voldemort
MongoDb,
Couchdb, SimpleDb
HBase, Big Table,
Cassandra
,
Why use Infobright to deal with large
volumes of machine generated data?
EASY
•TO INSTALL
•TO USE
AFFORDABLE
•LESS HW
•LOW SW
COST
Technical Overview of Infobright
Joon Kim
Senior Sales Engineer
[email protected]
Key Components of Infobright
003
Column-Oriented
Knowledge Grid – statistics and
metadata “describing” the supercompressed data
Data Packs – data stored
in manageably sized,
highly compressed data
packs
Data compressed using
algorithms tailored to
data type
1
Smarter architecture
 Load data and go
 No indices or partitions
to build and maintain
 Knowledge Grid
automatically updated as
data packs are created or
updated
 Super-compact data footprint can leverage off-theshelf hardware
Infobright Architecture
1. Column Orientation
Incoming Data
EMP_ID
1
2
3
FNAME
Moe
Curly
Larry
LNAME
Howard
Joe
Fine
SALARY
10000
12000
9000
Column Oriented Layout
(1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)
 Works well with aggregate results (sum, count, avg. )
 Only columns that are relevant need to be touched
 Consistent performance with any database design
 Allows for very efficient compression
2. Data Packs and Compression
Data Packs
64K
 Each data pack contains 65,536 data values
 Compression is applied to each individual data pack
 The compression algorithm varies depending on data type and
distribution
64K
Compression
 Results vary depending on the distribution of
64K
64K
16
Patent Pending
Compression
Algorithms
data among data packs
 A typical overall compression ratio seen in
the field is 10:1
 Some customers have seen results of 40:1
and higher
 For example, 1TB of raw data compressed 10 to 1
would only require 100GB of disk capacity
3. The Knowledge Grid
Knowledge Grid
Applies to the whole table
Knowledge Nodes
Built for each Data Pack
Information about the data
Column A
DP1
Column B
…
Basic Statistics
Calculated during load
Numerical Ranges
DP2
DP3
Character Maps
DP4
DP5
DP6
Dynamic
17
Calculated during query
Knowledge Grid Internals
Data Pack Nodes (DPN)
A separate DPN is created for every data pack created in
the database to store basic statistical information
Character Maps (CMAPs)
Every Data Pack that contains text creates a matrix that
records the occurrence of every possible ASCII
character
Histograms
Histograms are created for every Data Pack that
contains numeric data and creates 1024 MIN-MAX
intervals.
Pack-to-Pack Nodes (PPN)
PPNs track relationships between Data Packs when
tables are joined. Query performance gets better as the
database is used.
This metadata layer = 1% of the compressed volume
006
Optimizer / Granular Computing Engine
1.
2.
3.
4.
Query received
Engine iterates on Knowledge Grid
Each pass eliminates Data Packs
If any Data Packs are needed to resolve query, only those are decompressed
Query
Knowledge Grid
Results
1%
Q: How are my
sales doing this
year?
19
Compressed Data
How the Optimizer Works
SELECT count(*) FROM employees
WHERE salary > 50000
AND age < 65
AND job = ‘Shipping’
AND city = ‘TORONTO’;
1.
Find the Data Packs with salary > 50000
2.
Find the Data Packs that contain age < 65
3.
Find the Data Packs that have job =
‘Shipping’
4.
Find the Data Packs that have City =
“Toronto’
5.
Now we eliminate all rows that have been
flagged as irrelevant.
6.
Finally we have identified the data pack that
needs to be decompressed
007
salary
age
job
city
All packs
ignored
Row
s1
to
65,53
65,5
7 to
36
131,0
131,0
72
73 to
……
All packs
ignored
All packs
ignored
Only this pack will
be decompressed
Completely Irrelevant
Suspect
All values match
Infobright Architected on MySQL
“The world’s most popular open source database”
21
Sample Script (Create Table, Import, Export)
USE Northwind;
DROP TABLE IF EXISTS customers;
CREATE TABLE customers (
CustomerID varchar(5),
CompanyName varchar(40),
ContactName varchar(30),
ContactTitle varchar(30),
Address varchar(60),
City varchar(15)
Region char(15)
PostalCode char(10),
Country char(15),
Phone char(24),
Fax varchar(24),
CreditCard float(17,1),
FederalTaxes decimal(4,2)
) ENGINE=BRIGHTHOUSE;
065
-- Import the text file.
Set AUTOCOMMIT=0;
SET @bh_dataformat = 'txt_variable';
LOAD DATA INFILE "/tmp/Input/customers.txt"
INTO TABLE customers FIELDS TERMINATED BY
';' ENCLOSED BY 'NULL' LINES TERMINATED BY
'\r\n';
COMMIT;
-- Export the data into BINARY format.
SET @bh_dataformat = 'binary';
SELECT * INTO OUTFILE
"/tmp/output/customers.dat" FROM
customers;
-- Export the data into TEXT format.
SET @bh_dataformat = 'txt_variable';
SELECT * INTO OUTFILE
"/tmp/output/customers.text" FIELDS
TERMINATED BY ';' ENCLOSED BY 'NULL' LINES
TERMINATED BY '\r\n' FROM customers;
Infobright 4.0 – Additional Features
Built-in intelligence for machine-generated data:
Find ‘Needle in the Haystack’ faster
DomainExpert
Intelligence about machinegenerated data drives faster
performance
Near-real time, ad-hoc
analysis of Big Data
• Enhanced Knowledge Grid with
domain intelligence
• DLP: Linear scalability of data
load for very high performance
• Automatically optimizes
database, no fine tuning
• Rough Query: Data mining “drill
down” at RAM speed
• Users can directly add domain
expertise to drive faster
performance
Work with Data Even Faster
DomainExpert
• Intelligence to
automatically
optimize the
database
DomainExpert: Breakthrough Analytics
 Enables users to add intelligence into
Knowledge Grid directly with no schema
changes
 Pre-defined/Optimized for web data analysis
 IP addresses
 Email addresses
 URL/URI
 Can cut query time in half when using this
data definition
DomainExpert: Prebuilt plus DIY options
 Pattern recognition enables faster queries
 Patterns defined and stored
 Complex fields decomposed into more homogeneous parts
 Database uses this information when processing query
 Users can also easily add their own data patterns
 Identify strings, numerics, or constants
 Financial Trading example– ticker feed
“AAPL–350,354,347,349” encoded “%s-%d,%d,%d,%d”
 Will enable higher compression
Get Data In Faster: DLP
Near-real time
ad-hoc analysis
• Linear scalability
of data load for
very high
performance
Distributed Load Processor (DLP)
 Add-on product to IEE which linearly scales
load performance
 Remote servers compress data and build
Knowledge Grid elements on-the-fly…
 Appended to the data server running the main
Infobright database
 It’s all about speed: Faster Load & Queries
Get Data In Faster: Hadoop
Near-real time
ad-hoc analysis
• Hadoop
connectivity
• Use the right tool
for the job
Big Data - Hadoop Support
 DLP Hadoop connector
 Extracts data from HDFS, load into
Infobright at high speeds
• You load 100s of TBs or Petabytes into
Hadoop for bulk storage and batch
processing
• Then load TBs into Infobright for near-real
time analytics using Hadoop connector and
DLP
Infobright / Hadoop:
Perfect complement to analyze Big Data
Rough Query: Speed Up Data Mining by 20x
Near-real time
ad-hoc analysis
• Rough Query:
Data mining “drill
down” at RAM
speed
Rough Query – Another Infobright Breakthrough
 Enables very fast iterative queries to quickly
drill down into large volumes of data
 “Select roughly” to instantaneously see
interval range for relevant data,
 uses only the in-memory Knowledge Grid information
 Filtering can narrow results
 Need more detail? Drill down further with
rough query or query for exact answer
The Value Infobright Delivers
High performance with much less work and lower cost
Faster queries
without extra
work
Fast load / High
compression
• No indexes
• No projections
or cubes
• No data
partitioning
• Faster ad-hoc
analytics
• Multi-machine
Distributed
Load Processor
• Query while
load (DLP)
• 10:1 to 40:1+
compression
Lower costs
• Less storage
and servers
• Low cost HW
• Low-cost
subscriptions
• 90% less
administration
Faster time to
production
• Download in
minutes
• Minimal
configuration
• Implement in
days
Q&A
Infobright Use Cases
Infobright and Hadoop in Video Advertising: LiveRail
LiveRails’s Need
 LiveRail’s platform enables
publishers, advertisers, ad networks
and media groups to manage, target,
display and track advertising in online
video.
 With a growing number of
customers, LiveRail was faced with
managing increasingly large data
volumes.
 They also needed to provide near
real-time access to their customers
for reporting and ad hoc analysis.
Infobright’s Solution
 LiveRail chose two complementary
technologies to manage hundreds of
millions of rows of data each day -Apache Hadoop and Infobright.
 Detail is loaded hourly into Hadoop and
at the same time summarized and
loaded into Infobright.
 Customers access Infobright 7x24 for
ad-hoc reporting and analysis and can
schedule time if needed to access
cookie-level data stored in Hadoop.
“Infobright and Hadoop are complementary technologies that help us manage
large amounts of data while meeting diverse customers needs to analyze the
performance of video advertising investments.”
Andrei Dunca, CTO of LiveRail
Example in Mobile Analytics: Bango
Bango’s Need
A leader in mobile billing and analytics
services utilizing a SaaS model
Infobright’s Solution
 Reduced queries from minutes to
seconds
Query
SQL Server
Infobright
1 Month Report
(5MM events)
11 min
10 secs
 450GB per month on SQL Server
1 Month Report
(15MM events)
43 min
23 secs
SQL Server could not support required
query performance
Complex Filter
(10MM events)
29 min
8 secs
Received a contract with a large media
provider
 150 million rows per month
Needed a database that could
 scale for much larger data sets
 with fast query response
 with fast implementation
 and low maintenance
 in a cost-effective solution
 Reduced size of one customer’s
database from 450 GB to 10 GB for one
month of data
Online Analytics: Yahoo!
Customer’s Need
Infobright’s Solution
• Pricing and Yield Management team
• Loading over 30 million records per day
responsible for pricing online display
ads
• Can now store all detailed data, retain 6
billion records
 Requires sophisticated analysis of
terabytes of ad impression data
 With prior database, could only store
30 days of summary data
 Needed a database that could:
• 6TB of data is compressed to 600GB on
disk
• Queries are very fast, Yahoo! can do adhoc analysis without manual tuning
• Easy to maintain and support
• Store 6 months+ of detailed data
• Reduce hardware needed
• Eliminate database admin work
• Execute ad-hoc queries much
faster
“Using Infobright allows us to do pricing
analyses that would not have been
possible before. We now have access to
all of our detailed Web impression data,
and we can keep 6x the amount of data
history we could previously.”
Sr. Director PYM, Yahoo!
Case Study: JDSU
 Annual revenues exceeded $1.3B in 2010
 4700 employees are based in over 80 locations
worldwide
 Communications sector offers instruments, systems,
software, services, and integrated solutions that help
communications service providers, equipment
manufacturers, and major communications users
maintain their competitive advantage
JDSU Service Assurance Solutions
 Ensure high quality of experience (QoE) for wireless
voice, data, messaging, and billing.
 Used by many of the world’s largest network operators
JDSU Project Goals
 New version of Session Trace solution that would:
 Support very fast load speeds to keep up with increasing call
volume and the need for near real-time data access
 Reduce the amount of storage by 5x, while also keeping much
longer data history
 Reduce overall database licensing costs 3X
 Eliminate customers’ “DBA tax,” meaning there should require
zero maintenance or tuning while enabling flexible analysis
 Continue delivering the fast query response needed by
Network Operations Center (NOC) personnel when
troubleshooting issues and supporting up to 200 simultaneous
users
High Level View
38
Session Trace Application
For deployment at Tier 1
network operators, each site
will store between 6 and 45TB
of data, and the total data
volume will range from 700TB
to 1PB of data.
Infobright Implementation
Save Time, Save Cost
 Fastest time to value
 Economical
 Download in minutes, install in minutes
 No indexes, no partitions, no projections
 No complex hardware to install
 Minimal administration
 Self-tuning
 Self-managing
 Eliminate or reduce aggregate table creation
 Outstanding performance
 Fast query response against large data
volume
 Load speeds over 2TB /hour with DLP
 High data compression 10:1 to 40:1+
41
 Low subscription cost
 Less data storage
 Industry-standard servers
What Our Customers Say
“Using Infobright allows us to do pricing analyses that would
not have been possible before.”
“With Infobright, [this customer] has access to data within
minutes of transactions occurring, and can run ad-hoc queries
with amazing performance.”
"Infobright offered the only solution that could handle our
current data load and scale to accommodate a projected
growth rate of 70 percent, without incurring prohibitive
hardware and licensing costs.
“Using Infobright allowed JDSU to meet the aggressive goals we
set for our new product release: reducing storage and increasing
data history retention by 5x, significantly reducing costs, and
meeting the fast data load rate and query performance needed by
the world’s largest network operators.”
Where does Infobright fit in the database landscape?
 One Size DOESN’T fit all.
 Specialized Databases Deployed
 Excellent at what they were designed for
 More open source specialized databases than commercial
 Cloud / SaaS use for specialty DBMS becomes popular
 Database Virtualization
 Significantly lowered DBA costs
Row
Hadoop
Column
Your
Warehouse
43
NewSQL
NoSQL
NoSQL: Unstructured Data Kings
Tame the Unstructured
• Store Anything
• Keep Everything





Schema-less Designs
Extreme Transaction Rates
Massive Horizontal Scaling
Heavy Data Redundancy
Niche Players
Top NoSQL Offerings
NoSQL: Breakout
Key-Value
Document
Store
Hybrid
Column
Store
120+ Variants : Find More at nosql-databases.org
Graph
What do we see with NoSQL
Strengths
Weakness
• Application
Focused
• Programmatic
API
• Capacity
• Lookup Speed
• Streaming data
• Generally no
SQL Interface
• Programmatic
Interfaces
• Expensive
Infrastructure
• Complex
• Limits with
Analytics
Lest We Forget Hadoop
Scalable, fault-tolerant distributed system for data
storage and processing
Hadoop Distributed File System (HDFS): selfhealing high-bandwidth clustered storage
MapReduce: fault-tolerant distributed processing
Value Add




Flexible : store schema-less data and add as needed
Affordable : low cost per terabyte
Broadly adopted : Apache Project with a large, active ecosystem
Proven at scale : petabyte+ implementations in production today
Hadoop Data Extraction
NewSQL: Operational, Relational Powerhouses
Overclock Relational
Performance
• Scale-Out
• Scale “Smart”




New, Scalable SQL
Extreme Transaction Rates
Diverse Technologies
ACID Compliance

Data Packs

Transcript Data Packs

Directory