Transcript Slide 1
Innovations in Database Technology
IRMAC BI/DW SIG
May 28, 2009
Agenda
About Infobright
Data Warehousing Challenge
Use Cases
Infobright Approach
Infobright Architecture
Infobright Versions & System Requirements
2
About Infobright
3
About Infobright
Founded
2006
Headquarters
Toronto, Canada; offices in Boston, MA and Warsaw, Poland
The Infobright Data
Warehouse
Simplicity: No new schemas, no indices, no data partitioning, easy to maintain
Scalability: Designed for rapidly growing volumes. Ideal for up to 30 TB
Low TCO: Industry-leading compression, less storage, industry standard servers,
low software costs, minimal ongoing operational expenses
The Open Source
Solution
Community (open source) and Enterprise Editions are available
Leverages MySQL connectivity to ETL and BI
MySQL Integration
Provides MySQL customers with scalable, enterprise-ready data warehouse
MySQL/SUN Microsystems invests in Infobright Sept 15, 2008
4
Data Warehousing Challenge
5
.
Data Warehousing Challenges
Limited Resources
and Budget
More Data,
More Data Sources
More Kinds of Output
Needed by More Users,
More Quickly
010101010101010101010101010
01010101010101010101010101
Real time data
01 1
010101010101010101010
0101010101010101010101010
Multiple databases
1010
01010101010101010101
0101010101010101010
External Sources 01
101
1
10
1
1
1
010
1
0
Labor intensive, heavy
66
Traditional Data
Warehousing
indexing and partitioning
Hardware intensive:
massive storage; big servers
Expensive and complex
Data Warehousing – Raising The Bar
New Demands:
Larger transaction volumes driven by the internet
Impact of Cloud Computing
More -> Faster -> Cheaper
Data Warehousing Matures:
Near real time updates
Integration with master data management
Data mining using discrete business transactions
Provision of data for business critical applications
Early Data Warehouse Characteristics:
Integration of internal systems
Monthly and weekly loads
Heavy use of aggregates
7
Use Cases
8
Use Cases
Infobright is a good fit for;
•
•
•
•
Loading millions of transactions with a limited batch window
Summarizing transactional data for trend analysis
Extracting transactional detail based on specific constraints
Ad hoc query support across many dimensional attributes
Avoid using Infobright for;
• Real-time transactional updates (operational data entry)
• Full data extracts (select * from …)
• Row based operations that need to access all columns of
a table are typically better suited to row based databases
9
Customer Experience – Load Speed
Business Requirement
• Mavenir - OEM customer deploying a world wide telco application
• Application provides operators with access to detailed SMS traffic
• Needed a low cost solution with the ability to load 20K records
per second
• Peak of 70M messages per hour during Chinese New year
Solution
• Custom front end developed using MySQL JDBC driver
• Completed design, test, deployment in < 3 months with no assistance from Infobright
• Allowed for expansion from 7 to 90 days of online SMS history
• Supports plan for 70% annual growth
• Rollout to allow for 120 concurrent users
10
Customer Experience – Query Performance
Business Requirement
• Sulake - Online Social Networking service with 126M users across
31 countries
• 990M page impressions per month
• Need to quickly analyze online spend on a daily basis to enhance
online experience and drive additional revenue
• Existing InnoDB solution was able to process business queries in a
reasonable time frame (queries taking hours to complete)
Solution
• Business opportunities were being lost due to inability to analyze
subscriber behavior using transactions
• Customer used existing data model and deployed the application using
Business Objects – Data Integrator for ETL, Web-Intelligence for BI
• Existing ETL workflows were converted to Infobright in less than 4 weeks
without assistance
• Historically long running queries (hours) now running in minutes and seconds
• Additional benefits due to compression were a reduced need for disk storage
and an overall reduction in I/O and network traffic
11
Customer Experience - TCO
Business Requirement
• A global provider of electronic trading solutions across 22 time
zones and 700 financial exchanges
• Wanted to expand analytical access to financial transactions to
include both current (30 days) and archived transactions (4 years)
• Expansion of existing Sybase solution was too costly
Solution
• Infobright was able to achieve performance benchmarks within the first 3 days of a
proof of concept using production data
• 28,000 records per second load speed
• Join 100M row with a 30Mrow table -> 400k rows, returned in 185 seconds
• Additional queries that did not complete using Sybase, finished in minutes using
Infobright
• Final solution deployed using Pentaho Kettle for ETL and Crystal Reports for BI
• Success with modest data size (150GB) has opened opportunities for additional more
detailed transactional analysis
12
Customer Experience – Query Performance and TCO
Business Requirement
• TradeDoubler – Based in Sweden, a global digital marketing
company, serving 1600+ online advertisers across Europe and Asia.
• TradeDoubler optimizes Web marketing campaigns by analyzing
Web clicks, impressions and purchases.
• Analyzing terabytes of data about the results of its programs is
central to the company’s success.
• Selected Infobright to produce analytical results rapidly, seamless
interoperability with their MySQL database and low TCO
Solution
• Deployed solution using a single, $12,500 Dell server with 8 CPU cores and 16 GB
RAM
• Used Pentaho Kettle for ETL and Jaspersoft Server Pro Reports for BI
• Needed to process and analyze data 20 billion online transactions/month
• In POC, loaded > 3.2 billion rows at > 300,000 rows / second
• In production, achieved 30x data compression
13
• Extremely fast query speed. 3 queries that previously did not return, now returned
within a minute
Infobright Approach
14
Introducing Infobright
Column Orientation
Knowledge Grid – statistics and
metadata “describing” the supercompressed data
Data Packs – data stored
in manageably sized,
highly compressed data
packs
Data compressed using
algorithms tailored to
data type
151
5
Smarter architecture
Load data and go
No indices or partitions
to build and maintain
Knowledge Grid
automatically updated as
data packs are created or
updated
Super-compact data footprint can leverage off-theshelf hardware
Column vs. Row-Oriented
EMP_ID
1
2
3
FNAME
Moe
Curly
Larry
LNAME
Howard
Joe
Fine
SALARY
10000
12000
9000
Row Oriented (1,Moe,Howard,10000; 2,Curly, Joe,12000; 3,Larry,Fine,9000;)
Works well if all the columns are needed for
every query.
Efficient for transactional processing if all the
data for the row is available
Column Oriented (1,2,3; Moe,Curly,Larry; Howard,Joe,Fine; 10000,12000,9000;)
Works well with aggregate results (sum, count, avg.
)
Only columns that are relevant need to be touched
Consistent performance with any database design
Allows for very efficient compression
16
Data Packs and Compression
Data Packs
64K
Each data pack contains 65, 536 data values
Compression is applied to each individual data pack
The compression algorithm varies depending on data type
64K
and data distribution
Compression
64K
64K
17
Results vary depending on the
Patent Pending
Compression
Algorithms
distribution of data among data packs
A typical overall compression ratio
seen in the field is 10:1
Some customers have seen results
have been as high as 40:1
Knowledge Grid
Data Pack Nodes (DPN)
A separate DPN is created for every data pack created in
the database to store basic statistical information
Character Maps (CMAPs)
Every Data Pack that contains text creates a matrix that
records the occurrence of every possible ASCII
character
Histograms
Histograms are created for every Data Pack that
contains numeric data and creates 1024 MIN-MAX
intervals.
Pack-to-Pack Nodes (PPN)
PPNs track relationships between Data Packs when
tables are joined. Query performance gets better as the
database is used.
This metadata layer = 1% of the compressed volume
18
A Simple Query using the Knowledge Grid
SELECT count(*) FROM employees
WHERE salary > 50000
AND age < 65
AND job = ‘Shipping’
AND city = ‘TORONTO’;
1.
Find the Data Packs with salary > 50000
2.
Find the Data Packs that contain age < 65
3.
Find the Data Packs that have job =
‘Shipping’
4.
salary
age
job
All packs
ignored
Rows 1 to
65,536
65,537 to
131,072
All packs
ignored
131,073 to
……
Find the Data Packs that have City =
“Toronto’
5.
Now we eliminate all rows that have been
flagged as irrelevant.
6.
Finally we have identified the data pack that
needs to be decompressed
All packs
ignored
Only this pack will
be decompressed
Completely Irrelevant
Suspect
All values match
19
city
A Join Query using the Knowledge Grid
Car Sales
id
sale
Sales Person
discount prov
id
date
Pack-to-Pack
carsales_id vs salesperson_id
salesperson.id
0
1
1
1
1.
Eliminate the Car Sales Data Packs that are
irrelevant based on constraints in the SQL
2.
Determine the related Sales Person Data
Packs based on the values of carsales_id
found in the relevant Car Sales Data Packs.
3.
Create a Pack-to-Pack node that stores the
results of the join condition between Car
Sales and Sales Person.
4.
Any subsequent queries will be able to use
the PPN to resolve joins between Car Sales
and Sales Person
0
0
carsales.id
20
SELECT MIN(sale), MAX(discount), name
FROM carsales, salesperson
WHERE carsales.id = salesperson.id
AND carsales.prov = ‘ON’
AND carsales.date = ‘2008-02-29’
GROUP BY name;
name
Indicates that the
Data Packs are
related
Infobright Architecture
21
Infobright – Embedded With MySQL
MySQL/Infobright Architecture
CONNECTORS: Native C API, JDBC, ODBC, .NET, PHP, Python, Perl, Ruby, VB
MyISAM
•Views
•Users
•Permissions
•Tables Defs
Caches & Buffers
Parser
MySQL Loader
Management
Services &
Utilities
SQL Interface
CONNECTION POOL: Authentication, Thread Reuse,
Connection Limits, Check Memory, Caches
Infobright
Loader / Unloader
Infobright Optimizer
and Executor
My SQL Optimizer
Knowledge
Knowledge Grid
Grid
Knowledge
Knowledge Knowledge
Knowledge Knowledge
Knowledge Knowledge
Knowledge Knowledge
Knowledge Knowledge
Knowledge Knowledge
Knowledge
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Data
Pack
Data
Pack
Data
Pack
Data
Pack
Data
Pack
Data
Pack
Data
Data Pack
Data Pack Data Pack
Data Pack Data Pack
Data Pack Data Pack
Pack
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Infobright ships with the full
MySQL binaries. The MySQL
architecture is used to support
database components such as
connectors, security and
memory management.
Compressor
Compressor // Decompressor
Decompressor
Data
Data Pack
Pack
Data
Data Pack
Pack
Data
Data Pack
Pack
Data
Data Pack
Pack
Data
Data Pack
Pack
Data
Data Pack
Pack
Data
Data Pack
Pack
Infobright Components
•IB Storage Engine consisting of 64Kb Data Packs, Compressor, and the Knowledge Grid
• IB Optimizer that uses rough set algorithms and the knowledge grid to navigate the database
• IB Loader supports text based and binary data formats
22
Optimized SQL for Infobright
MySQL
The Infobright Optimizer supports a large amount
of MySQL syntax and functions. When the
optimizer encounters SQL syntax that is not
supported, then the query is executed using the
MySQL optimizer.
Infobright Optimized SQL
•Select Statements
•Comparison Operators
•Logical Operators
•String Comparison Functions (LIKE, ..)
•Aggregate Functions
•Arithmetic Operators
•Data Manipulation Language (I/U/D)
23
• Data Definition Language (CREATE & DROP)
• String Functions
• Date/Time Functions
• Numeric Functions
• Trigonometric Functions
• Case Statements
String
Numeric
Date
Infobright Data Types
24
Most of the data types expected for a MySQL
database engine are fully supported. The
data types that are currently not
implemented within Infobright include BLOB,
ENUM, SET and Auto Increment.
ETL Integration
Leverage existing IT tools and resources for
fast, simple deployments and low TCO
Increased efficiency with popular platforms
Deeper ETL Integration
Jaspersoft, Talend, Pentaho
Leverages end-to-end data management
provided by ETL tools
Improved support for Data Manipulation
Language (DML)
25
Data Loading with & without custom ETL connectors
Loading Infobright tables with custom connectors:
Kettle from Pentaho
Talend ETL from Talend
Jaspersoft ETL (Talend) from Jaspersoft
Two ways to invoke Infobright loader without connectors
1. Generate a CSV or binary file and invoke the Infobright loader to load the file
2. Named pipe technique:
Create a named pipe (i.e. mkfifo /home/mysql/s_mysession1.pipe)
Launch the Infobright loader in the background to read from the pipe
Launch the ETL process that writes data to the named pipe
When the ETL process runs, as records are written to the named pipe,
the loader reads them and writes them to an Infobright database table
26
Infobright Versions & System Requirements
27
Comparison of ICE and IEE
Features
Forums and/or
one-time 4-hr support pack
Available
Warranty and
Indemnification
No
Included
INSERT/UPDATE/DELETE
No
Supported
Infobright Loader
Up to 50 GB/hr
Multi-threaded, Up to 300 GB/hr
Data Load Types
Text only
Text & Binary
(100% faster)
No
Supported
64-bit Intel and AMD
RHEL 5, CentOS 5, Debian
32-bit Intel and AMD for Windows
XP, Ubuntu 8.04, Fedora 9
64-bit Intel and AMD
Windows Server 2003,
Windows Server 2008, RHEL 5,
CentOS 5, Debian, Solaris 10
Technical Support
MySQL Loader
Platform Support
28
System Requirements
29
For More Information
Data Warehouse Evangelist
Bob Newell
[email protected]
Or join our open source community at
www.infobright.org
Thank you
30
Query performance
Infobright
# Query
Query name
Traditional DB
Intervall
No cache
Cache
No cache
Cache
1
Affiliate/minor/sum(order)/year
20060101-20061231
7,72
0,99
13,00,91
4,03,21
2
Affiliate/major/sum(order)/year
20060101-20061231
31,52
7,81
N/A
N/A
1
Affiliate/minor/sum(order)/month
20060101-20060131
1,32
0,43
1,00,43
10,69
2
Affiliate/major/sum(order)/month
20060101-20060131
3,23
0,65
2,12,34
18,55
3
Events/Cat=2/Country/sum(no of)/year
20060101-20061231
37,16
24,42
N/A
N/A
4
Events/Cat=*/Country/sum(no of)/year
20060101-20061231
41,67
29,62
N/A
N/A
3
Events/Cat=2/Country/sum(no of)/month
20060101-20060131
15,16
7,15
8,08,13
2,10,15
4
Events/Cat=*/Country/sum(no of)/month
20060101-20060131
22,12
8,01
15,08,32
3,12,82
Time in minutes, seconds, milliseconds
313
1