Big Data - CERN Indico

Download Report

Transcript Big Data - CERN Indico

<Insert Picture Here>
Oracle’s Big Data solutions
Jean-Philippe Breysse
Oracle Suisse
The following is intended to outline our general product
direction. It is intended for information purposes only, and
may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality,
and should not be relied upon in making purchasing
decisions.
The development, release, and timing of any features or
functionality described for Oracle’s products remain at the
sole discretion of Oracle.
USE CASE 3: LOGS ANALYSIS OF SERVERS
 Short Description :
– Daily logs analysis
 Issues:
– Find correlations on what drives to failures
– Log files stored as flat files
4
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Insert Information Protection Policy Classification from Slide 13
Unstructured
Semi-structured
Structured
Oracle Technology mapped to Analytics Landscape
Master &
Reference
Transactions
Oracle Data
Integrator
Machine
Generated
Oracle 12g
Files
Oracle BI
Enterprise
Endeca
MDEX
Oracle
NoSQL
Oracle
Hadoop
HDFS
Text, Image,
Video, Audio
Acquire
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.
Oracle R
Enterprise &
Oracle Data
Mining
Oracle
Times Ten
Oracle
Golden Gate
Data
5
Oracle 12g
Oracle
Hadoop
MapReduce
Organize
Insert Information Protection Policy Classification from Slide 13
Oracle
Essbas
e
Analyze
Oracle Real Time
Decisions
Oracle Endeca
Information
Discovery
Decide
Agenda
•
•
•
•
•
•
Big Data
Solution Spectrum
Inside the Big Data Appliance
Big Data Applications Software
Big Data Analytics
Conclusions
<Insert Picture Here>
Big Data
Why Everyone Should Care
Tapping into Diverse Data Sets
Video and Images
Big Data:
Decisions based
on all your data
Documents
Social Data
Information
Architectures
Today:
Decisions based
on database data
Machine-Generated Data
Transactions
A bit of history ...
: Developed initially by Doug Cutting (Nutch Opensource websearch engine) and Yahoo -> inspired by
Google’s papers on MapReduce and GFS (2003-2004)
resulted in Apache Hadoop (2006)
Amazon Dynamo (2007): distributed systems technologies
Cassandra: was developed at Facebook (2008) to power
their Inbox Search feature (columnar oriented distributed
DB) based initially on Dynamo and Bigtable (built by Google)
Voldemort: is a distributed data store that is designed as a
key-value store used by LinkedIn for high-scalability storage
(NoSql key value)
Cloudera: . It contributes to Hadoop and related Apache
projects and provides a commercial distribution of Hadoop
9
So What is Big Data Anyway?
It’s a matter of perspective. Big Data is both:
• LARGE AND VARIABLE DATASETS that are difficult for traditional
database tools to easily manage – including datasets that once
seemed not important or too problematic to deal with. Big Data
datasets include:
− Extremely large files of unstructured or semi-structured data
− Large and highly distributed datasets that are otherwise difficult
to manage as a single unit of information
• NEW SET OF TECHNOLOGIES that can economically capture, store,
manage, and extract value from Big Data datasets – thus facilitating
better, more informed business decisions
Structured Data vs. Unstructured Data
Relational databases work best with structured data
– data which has underlying structure (schema) and size
that easily fits the specific confines of database columns
and rows. Unstructured data is highly variable, lacks fixed
structure, and is often too large to easily handle by RDBMS
systems.
Source: IDC Digital Universe Study, Extracting Value from Chaos, June 2011 (sponsored by EMC)
10
<Insert Picture Here>
Drive Value from Big Data
Building a Big Data Platform
Divided Solution Spectrum
Data
Variety
Unstructured
NoSQL
Distributed
File Systems
Schema-less
Schema
Transaction
(Key-Value)
Stores
DBMS
(OLTP)
Acquire
Flexible
Specialized
Developer
Centric
MapReduce
Solutions
ETL
DBMS
(DW)
Organize
Advanced
Analytics
Analyze
SQL
Trusted
Secure
Administered
Hadoop to Oracle – Bridging the Gap
Data
Variety
Unstructured
HDFS
Hadoop
MapReduce
Schema-less
Cassandra
Oracle Loader for
Hadoop
Schema
RDBMS
(OLTP)
Acquire
ETL
RDBMS Advanced
(DW)
Analytics
Organize
Analyze
Oracle Integrated Software Solution
Data
Variety
Unstructured
HDFS
Schema-less
Oracle
NoSQL DB
Oracle
Analytics:
Hadoop
Oracle Loader
for Hadoop
Oracle
Data Integrator
Schema
Oracle
(OLTP)
Acquire
Oracle
(DW)
Organize
Data
Mining
R
Spatial
Graph
mapreduce
OBI EE
Analyze
<Insert Picture Here>
Inside the Big Data Appliance
Overview
Oracle Engineered Solutions
Data
Variety
Unstructured
Schema
Information
Density
16
Big
HDFSData Appliance
Hadoop
• Hadoop
• NoSQL Database
Oracle Loader
Oracle
NoSQLLoader for
• Oracle
for hadoop
Hadoop
DB
• Oracle Data Integrator
Oracle
Data Integrator
Oracle
Database
(OLTP)
Acquire
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
In-DB
Analytics
“R”
Mining
Text
Graph
Spatial
Oracle ExadataOracle
• OLTP & DW Database
(DW) R
• Data Mining & Oracle
• Semantics
• Spatial
Organize
Insert Information Protection Policy Classification from Slide 8
Exalytics
• Speed of
Thought
Analytics
Oracle
BI EE
Analyze
Big Data Appliance
Usage Model
Oracle
Big Data Appliance
Oracle
Exadata
InfiniBand
Stream
Acquire
Organize
Oracle
Exalytics
InfiniBand
Analyze & Visualize
Oracle Big Data Appliance Hardware
•18 Sun X4270 M2 Servers
– 48 GB memory per node = 864 GB memory
– 12 Intel cores per node = 216 cores
– 24 TB storage per node = 432 TB storage
•40 Gb p/sec InfiniBand
•10 Gb p/sec Ethernet
Scale Out to Infinity
Scale out by connecting racks
to each other using Infiniband
• Expand up to eight racks without
additional switches
• Scale beyond eight racks by adding
an additional switch
Oracle Big Data Appliance Software
• Oracle Enterprise Linux 5.6
• Oracle Hotspot Java VM
• Cloudera’s Distribution
including Apache Hadoop
• Cloudera Manager
• Open Source Distribution of R
• Oracle NoSQL Database
Community Edition
<Insert Picture Here>
Big Data Application Software
Acquire New Information
Key-Value Store Workloads
• Large dynamic schema based data repositories
• Data capture
•
•
•
•
Web applications (click-through capture)
Online retail
Sensor/statistics/network capture (factory automation for example)
Backup services for mobile devices
• Data services
•
•
•
•
Scalable authentication
Real-time communication (MMS, SMS, routing)
Personalization
Social Networks
Oracle NoSQL DB
A distributed, scalable key-value database
• Simple Data Model
• Key-value pair with major+sub-key paradigm
• Read/insert/update/delete operations
• Scalability
• Dynamic data partitioning and distribution
• Optimized data access via intelligent driver
Application
Application
NoSQLDB Driver
NoSQLDB Driver
• High availability
• One or more replicas
• Disaster recovery through location of replicas
• Resilient to partition master failures
• No single point of failure
• Transparent load balancing
• Reads from master or replicas
• Driver is network topology & latency aware
• Elastic (Planned for Release 2)
• Online addition/removal of Storage Nodes
• Automatic data redistribution
Storage Nodes
Storage Nodes
Data Center A
Data Center B
<Insert Picture Here>
Big Data Application Software
Organizing Data for Analysis
Oracle Loader for Hadoop Features
• Load data into a partitioned or non-partitioned table
– Single level, composite or interval partitioned table
– Support for scalar datatypes of Oracle Database
– Load into Oracle Database 11g Release 2
• Runs as a Hadoop job and supports standard options
• Pre-partitions and sorts data on Hadoop
• Online and offline load modes
30
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
Oracle Loader for Hadoop
INPUT
1
MAP
MAP
ORACLE LOADER FOR HADOOP
MAP
REDUCE
REDUCE
MAP
MAP
REDUCE
MAP
REDUCE
REDUCE
MAP
MAP
REDUCE
REDUCE
MAP
MAP
SHUFFLE
/SORT
MAP
MAP
MAP
MAP
REDUCE
MAP
REDUCE
MAP
REDUCE
MAP
SHUFFLE
/SORT
INPUT
2
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
MAP
REDUCE
MAP
REDUCE
MAP
SHUFFLE
/SORT
31
SHUFFLE
/SORT
SHUFFLE
/SORT
REDUCE
Oracle Loader for Hadoop: Online Option
Read target table metadata
from the database
Perform partitioning,
sorting, and data
conversion
Connect to the database
from reducer nodes, load
into database partitions in
parallel
ORACLE LOADER FOR HADOOP
MAP
REDUCE
MAP
REDUCE
MAP
MAP
REDUCE
MAP
REDUCE
MAP
32
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
SHUFFLE
/SORT
SHUFFLE
/SORT
REDUCE
Oracle Loader for Hadoop: Offline Option
Read target table metadata
from the database
Perform partitioning,
sorting, and data
conversion
Write from reducer nodes to
Oracle Data Pump files
ORACLE LOADER FOR HADOOP
MAP
REDUCE
MAP
REDUCE
MAP
MAP
REDUCE
MAP
REDUCE
MAP
33
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
SHUFFLE
/SORT
SHUFFLE
/SORT
REDUCE
Import into the database in
parallel using external table
mechanism
Selection Output Option for Use Case
Oracle Loader for Hadoop
Output Option
Use Case Characteristics
Online load with JDBC
The simplest use case for non
partitioned tables
Online load with Direct Path
Fast online load for partitioned
tables
Offline load with datapump files
Fastest load method for external
tables
On Oracle Big Data Appliance
Leave data on HDFS
Direct HDFS
Parallel access from database
Import into database when
needed
35
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
Automate Usage of Oracle Loader for Hadoop
Oracle Data Integrator (ODI)
• ODI has knowledge modules to
– Generate data transformation code to run on Hive/Hadoop
– Invoke Oracle Loader for Hadoop
• Use the drag-and-drop interface in ODI to
– Include invocation of Oracle Loader for Hadoop in any ODI
packaged flow
36
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
37
Copyright © 2011, Oracle and/or its affiliates. All rights
reserved.
<Insert Picture Here>
Big Data Analytics
Real Time Analytics Platform
R Statistical Programming Language
Open source language and
environment
Used for statistical
computing and graphics
Strength in easily producing
publication-quality plots
Highly extensible with open
source community R
packages
<Insert Picture Here>
Drive Value from Big Data
Conclusions
Big Data Appliance
Big Data for the Enterprise
• Optimized and Complete
• Everything you need to store and integrate
your lower information density data
• Integrated with Oracle Exadata
• Analyze all your data
• Easy to Deploy
• Risk Free, Quick Installation and Setup
• Single Vendor Support
• Full Oracle support for the entire system and
software set
Oracle Integrated Solution Stack for Big Data
Oracle
NoSQL
Database
Enterprise
Applications
ACQUIRE
Hadoop
(MapReduce)
Oracle
Loader
for Hadoop
Oracle Data
Integrator
ORGANIZE
Data
Warehouse
In-Database Analytics
HDFS
ANALYZE
Oracle
Analytic
Applications
DECIDE
Oracle: Big Data for the Enterprise
• The most comprehensive solution
• Includes everything needed to acquire, organize and
analyze all your data
• Optimized for Extreme Analytics
• Deepest analytics portfolio with access to all data
• Engineered to Work Together
• Eliminate deployment risk and support risk
• Enterprise Ready
• Deliver extreme performance and scalability
Questions