Impliance: an I nformation - University of Wisconsin–Madison

Download Report

Transcript Impliance: an I nformation - University of Wisconsin–Madison

1
IBM Research
Impliance:
an Information Management Appliance
Bishwaranjan Bhattacharjee
IBM Watson Research Center
Vuk Ercegovac, Joseph Glider, Richard Golding, Guy Lohman, Volker Markl,
Hamid Pirahesh, Jun Rao, Robert Rees, Frederick Reiss, Eugene Shekita, Garret Swart
Almaden Research Center
Impliance -- Information Management Appliance
© 2002 IBM
Corporation
Agenda
 Motivation: Observations  Requirements
 What is Impliance?
 How is Impliance different from…?
 Research opportunities
 Conclusions
2
Impliance -- Information Management Appliance
© 2007 IBM Corporation
After all our successes
(and last night’s revelry),
it’s easy to become
self-congratulatory.
Sorry, time for…
3
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Some embarrassing questions:
 Why is most (>80%) of the world’s data still not in databases
 Didn’t we “solve” this problem in the 1980s with object-relational systems?
 Do you use a database to store your data on your laptop?
 Why not? (You are a database bigot, aren’t you?)
 Have you ever tried to query (with SQL) a database that:
You didn’t create, and…
Had more than 500 tables?
 Just how easy is it to incrementally add DB capacity beyond 1 machine?
100 machines?
 Have “self-managing” databases significantly simplified administration?
4
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Observation  Requirements (1 of 5)
Observation #1: Information converging
 Many types of data in today’s enterprise



Structured (traditional Data Base)
Semi-structured (traditional Content Management, XML)
Unstructured (text, multimedia)
 Each needs a different search interface, today



SQL
JSR-170
Keyword search / Information Retrieval
Requirement #1: Store / Search / Analyze all data
 Need to rapidly relate information of different types
 With one unified interface!
 Real use cases in paper
Observation  Requirements (2 of 5)
Observation #2: Awash in data, but not
information
 Typical complaint: “I can’t find what I’m looking for!”
 But just finding data isn’t enough!
 Today’s Business Intelligence is too human-intensive
Requirement #2: Pro-actively derive useful
information




Need to glean more business value from enterprise data
What sort of analytics exploit unstructured data?
Need to automatically extract the semantics of text
A rebirth of data mining?
Observation  Requirements (3 of 5)
Obs. #3: Total Cost of Ownership (TCO) is paramount
 People costs dominate TCO
– Hardware often less than 50% of TCO
 Minimize Time To Value
– Databases take too long to set up!
 Wizards & Advisors simply mask complexity, add brittleness
Reqmt. #3: System must be simple, robust, & secure
 Sacrifice resource utilization for radical simplification of:
– Setup / Configuration / Deployment (e.g., Self-Organizing)
– Operation
+
 KISS (you know this one)
 KIWI – Kill It With Iron [Weikum]!
 Example: “Good enough” plans exploiting massive parallelism
Observation  Requirements (4 of 5)
Observation #4: Data volumes growing fast
 Data is kept longer
 Lots of new kinds of data: RFID, email, photos, videos
 Disk densities improving, but not seek times!
– 1 TB disk for $399 (Hitachi)
Requirement #4: Simple & massive scale-out
 1000s of nodes
 With low management overhead
 No single point of failure
Observation  Requirements (5 of 5)
Obs. #5: Today’s Info. Mgmt. software based upon
hardware 30 yrs. ago
 Example: Update-in-place databases due to expensive
disk
 Today: Cheap CPUs, large storage, fast networks
Requirement #5: Need new (software) architecture
 Opportunity to radically rethink Info. Mgmt. software
architecture (Stonebraker: “refactor”), based upon:
– Hardware economics
• e.g., cheap (multi-core) CPUs, storage, memory, network
– Software:
• Formats (e.g., XML, semi-structured data)
• Functionality required (e.g., unstructured search, analytics)
– Specified in the right order:
• Service requirements  Software  Hardware
IBM Research
What is Impliance?
Administrator-less:
Scalable:
 Low Time to Value by Self-Organizing
 Massively parallel scale-out…
 Low Total Cost of Ownership
…to Petabytes!
Bundled:
Structured
Data
(Tables)
XML
Text
Manage and Search All Data:
 Structured, Semi-Structured, …
 …Even Unstructured Text!
Impliance – Information Management Appliance
 HW & SW
 Pre-configured
 Pre-tuned
 Limited APIs
Pro-actively Mine Information:
 Glean business insight from data
© 2007 IBM Corporationi
10
What Does Impliance Actually Do?
 All enterprise information:
Stores & Retrieves (Search / Query)
Composes / Integrates / Mashups
Finds trends & exceptions (Business Intelligence)
11
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Think of Impliance as…
 Content Management on steroids (beyond JSR-170)
 File System with all content searchable
 Data Warehouse with all your enterprise’s data
Not just structured information
Excluding high-rate OLTP (web site)
 A Jambalaya
12
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Content
Management
Impliance
XML
Un Structured
Transaction
Ingestion
Archiving
Products
DBMS
Structured
SemiStructured
Types of Data
Where does Impliance fit?
OLTP
Warehousing/OLAP
Lifetime of Data
Archiving
How is Impliance related to…
 Google Base?
Primary data store
Appliance (product, i.e., sits in customer site), not a Service
Enterprise, not “the masses”
 DataSpaces / Google “Pay as you go”?
Primary data store (vs. lazy federation of existing data sources)
Enterprise, not “the web”
 Database “Appliances” (Netezza, DataAlegro, Green Plum, etc.)?
Not just structured (relational) data
Discovery of semantics
More pro-active
14
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Research Opportunities
 Reducing TCO – Make categories of administration just GO AWAY
Self-Organizing to obviate database design
Exploit appliance’s limited externalized interfaces
 New HW & SW architectures using off-the-shelf components
Achieving fine-grained scale-out
Targetting robust, “good enough” designs
Exploiting integration of components
 Data and query models that
Unify all data, yet are simple
Tolerate “schema chaos”
Combine best features of keyword search & SQL
 Automated discovery of
Data & query semantics for
Improving precision of queries
Organizing data adaptively
Trends, exceptions, etc. (pro-active Business Intelligence)
15
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Conclusions
 We’ve come a long way towards
the autonomic dream
incorporating all data
 But we can do much more!
 Impliance provides exciting opportunity for DB research
To lower TCO for information management
To exploit today’s hardware and software advances
To rethink information management in a fundamentally new way
 Join us!
16
Impliance -- Information Management Appliance
© 2007 IBM Corporation
IBM Research
Hindi
Thai
Traditional Chinese
Gracias
Spanish
Russian
Thank You
Obrigado
English
Brazilian Portuguese
Arabic
Danke
German
Grazie
Merci
Italian
Simplified Chinese
Tamil
French
Japanese
Korean
17
Impliance – Information Management Appliance
© 2007 IBM Corporation
Appendix
18
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Redefining Information Systems -- Players
Web 2.0 oriented next generation systems (delivered through services or appliances):
 Google, Yahoo, MSN, (IBM)
Google base (a semi-structured/un-structured information base)
Google OneBox
 NextGen systems built by integration of successful open source (Green Plum)
Data models: RSS/ATOM/Wiki/…
Architecture: DB+Search+Content systems (e.g., MYSQL+Lucene+Jackrabbit)
Entrenched HW/Storage/middleware companies
 Storage-driven:
EMC-- Moving up the value chain, brought in a classic Content system
IBM– IDS: synergy between classic CM (JCR) and storage
 Server-driven:
Netezza, Datallegro (for BI)
Zantaz
(for email compliance)
Data Power
(XSLT filtering)
 Middleware-driven (IBM, Oracle, Microsoft)
Oracle Secure Enterprise Search
19
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Research Focus 1: Reducing TCO
 Make entire categories of administration JUST GO AWAY
 Reducing time-to-value through new design principles
Self-organization of “schema chaos” obviates lengthy logical & physical design, REORG
Fine-grained scale-out (instead of scale-up) obviates need for load balancing, etc.
 New software architecture
Target robust, highly-predictable, “good enough” utilization (KIWI = Kill It With Iron)
Componentization
Each component simple, robust, and adaptive
Virtual service model
Service Broker optimizes resources and assigns the workload
 Exploit integrated hardware and storage systems to provide
Built-in redundancy and availability
Automated backup and archiving (ILM)
Easy cluster management
Schema chaos support at storage level (semantic storage)
Ability to use new types of grid elements (cell blade server) seamlessly
20
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Xaction
Stream
Research Focus 2: Scalability
Transactional
Cluster
Analytic Grid
 True Grid Model
Off-the-shelf, commodity hardware
Dedicate blades to different tasks
Transaction
Blade
Analytic
Blade
Data
Stream
Commodity Interconnect
Supports Mixed Workloads
Analytics, Search, Content, …
Data Array
 Fine-grained scale-out
Data Array
Different blade types scale independently
Data Blade
Data Blade
Data proc
Data proc
RAID
…
Data: storage and simple filtering
Analytical: aggregation & mining
Transaction: search, transactional
get/put
From SMB to largest enterprises
Content
Stream
RAID
 Integrating modern HW & storage, e.g.
BC3, intelligent bricks
Logic pushdown into storage
Archive/
ILM
Stream
Predicate application
Aggregation
Redundancy management
Data+Content+Search+Digital Media
21
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Parallel Run-time: Comparison of Plumbing
Platform
WS XD
Application
Transactional
(composition;no search, no
BI)
ETL (streaming)
DataStage (E2)
(cleansing,
transformation,composition)
Querying
model
Parallelism
Fault
tolerance
Resource
Scheduling
limited
moderate
yes
yes
rich
high
yes
yes
yes
limited
GPFS
Storage
extremely limitedextremely high
DB2 ESE
with DPF
Analytics for relational
rich
high
yes
yes
Google
Map/Reduce
Analytics for anything
(search, transformation,
simplistic composition)
limited
extremely high
yes
yes
Impliance
Analytics for anything,
Search, Composition
rich
extremely high
yes
yes
22
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Applications
content
Relational
data
XML
Web page
 Data Analyzer,
Discovery, Query:
JCR
Data Analyzer Discovery
SQL
Query
Large-scale
computation
XSLT
HTTP
Data/
Query
Modeler
Video
 Data Modeler
Simple, generic
Scalable Reliable Runtime Support
 SRRS
Fault tolerant
Archive
ILM
…
Objects
Resource
Modeler
 DDS
Provide reliability
Distributed Data Store
 VSCR
Virtual Storage and Computing Resource
Commodity HW
Security
Control
23
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Research Focus 3: Information Modeling and Querying
 Simple, rich, unified information model & associated query languages, e.g.
Google Base approach promising
Defined typed attributes for navigation
Defined label for keyword search
Infosphere, MUSIC
Open community (RSS / Atom / wiki)
 Automatic schema discovery and integration – self-organizing!
Integrating solutions from Infosphere, CLIO
 Intelligence discovery
Automatic discovery of semantics (UIMA, Web Fountain, Avatar)
Pro-active, continuous mining (vs. passive BI model)
Contextual information supply
Including reporting and advanced analytics
24
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Eliminate Admin Tasks…
…Rather than adding layers (1 of 3):
 Special-purpose, turn-key appliances for basic services
vs. today’s general-purpose SW (but still uses off-the-shelf hardware!)
Bundled, Pre-installed, Pre-configured, Pre-tuned software!
Examples:
Information Management appliance
Web Server appliance
Minimizes interfaces user has to worry about
No need to externalize underlying operating system, storage details
Eliminates need to install, configure, and tune
 Self-organizing data systems
Automatic discovery of data structure
Obviates need to
Define logical and physical schema a priori, reducing time to value
Migrate schema when organization changes
25
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Eliminate Admin Tasks (2 of 3):
 Universal Data Management
Today:
Plethora of special-purpose data managers:
Databases for structured data
Content managers for semi-structured data
File systems for unstructured data
For each, very different
User interfaces (SQL, JSR 170, file interface)
Degrees of semantic knowledge about the data’s contents
Degrees of searchability
Consistency semantics (e.g., transactions) when updated
Management capabilities and interfaces
Tomorrow: Single mechanism for managing all data
Uniform interfaces for all types of data, for
Searching
Updating
Management
Universal indexing (“Google model”) of all data – default search mechanism
Plus more precise searching for auto-discovered (above) structured information
Obviates need to Impose naming conventions to find desired data
26
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Eliminate Admin Tasks (3 of 3):
 Robust storage mechanisms to eliminate need for backups
Never throw out data –keep versions!
Update-in-place
Is an anachronism from days of expensive disk
Increases complexity of transactions
Jeopardizes compliance requirements (Sarbanes-Oxley)
Versions permit queries “as of” some time
Exploits storage density increases (relative to number of disk arms)
RAID provides local reliability
Widely accepted and deployed
Weaver Codes extend to multiple simultaneous failures
How provide universal reliability (i.e., against site disasters)?
Selective, automated replication of new versions?
Cross-site RAID?
 Universal “Call Home” technology for remote management of
Monitoring
Problem determination
Software maintenance & upgrades
27
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Observation / Requirements
 Information converging: Store / Search / Analyze ALL data
Structured (traditional Data Base)
Semi-structured (traditional Content Management, XML, multi-media, call center records)
Unstructured (text)
Same advanced functionality required
 Data volume growing fast: On Demand strategy requires massive scale-out
Lots of new data: RFID, email, photos, videos (Deep Internet-scale systems being built)
Data is kept longer, due to compliance requirements
 Total Cost of Ownership (TCO) is paramount: System simple & robust (not smart &
fragile)
People costs dominate TCO: Hardware often less than 50% of TCO
Hence, sacrifice resource utilization for radical simplification
Delivered in services or appliances
 Today’s IM software based upon hardware 30 yrs ago: Need new software
architecture
Cheap CPUs, large storage, fast network in hardware
Opportunity to radically rethink IM software architecture, based upon:
Hardware economics (e.g., cheap CPUs, storage, memory, & network)
Data:
Formats (e.g., XML, semi-structured data)
Functionality required (e.g., unstructured search, analytics)
28
Impliance -- Information Management Appliance
© 2007 IBM Corporation
Total Cost of Ownership is the Driver
Cost of management and administration is outpacing spending on new systems
$160
35
$140
30
$120
Spending
(US$B)
25
$100
20
Installed base
(M Units)
$80
15
$60
10
$40
5
$20
$0
1996
’97
’98
’99 2000
’01
’02
New server spending (US$M) 3% CAGR
’03
’04
’05
’06
’07
’08
Source: IDC, On-Demand Enterprises and Utility Computing: A Current
Market Assessment and Outlook, IDC #31513, July 2004
Cost of management and administration 10% CAGR
29
Impliance -- Information Management Appliance
© 2006 IBM Corporation
IBM Research
Changing Characteristics of Data
Transactions and
structured data
Text and other human
data
Actionability
Actionability
Scale
Seat on an airplane: Easy to
find, structured
Actionability
Heterogeneity
Heterogeneity
Scale
Machine-generated and
unstructured data
Scale
LifeScience data - protein folding, gene
expression: Difficult to analyze but we
know where to look
Impliance – Information Management Appliance
Heterogeneity
Satellite and surveillance data: An
infinite space of "patterns"
30
Impliance: A Highly-Scalable, Rich-Functional Information
Management Appliance
A box with software pre-installed
JCR
Native
content
retrieval
SQL
interfac
e
Relational
data
XSLT
XML
HTTP
How delivered to enterprise: appliance or
service
What functions?
 Store and manage all information
accept all types of enterprises data
 Deliver all intelligence
Integrate cross silo information
Advanced analytics with richer semantics
Web page
Native
Impliance
update/
Video
load
interfac
e
Archive
ILM
…
What properties?
 Low TCO
easy to deploy (“plug & play”)
simple and stable
 Scalability
From SMB to Very Large (PetaBytes)
(Not for high-end OLTP!)
Data+Content+Digital Media
31
Impliance -- Information Management Appliance
© 2007 IBM Corporation