ISTORE Software Architecture

Download Report

Transcript ISTORE Software Architecture

ISTORE Software Runtime
Architecture
Aaron Brown, David Oppenheimer, Kimberly Keeton,
Randi Thomas, Jim Beck, John Kubiatowicz, and
David Patterson
http://iram.cs.berkeley.edu/istore
1999 Winter IRAM Retreat
Slide 1
ISTORE Runtime
Software Architecture
• Runtime system goals for the ISTORE meta-appliance
(1) Provide mechanisms that allow network service applications
to exploit introspection (monitor + adapt)
(2) Allow appliance designer to tailor runtime system policies
and interfaces
• How the goals are achieved
(1) Introspection: layered local and global runtime system
libraries that manipulate and react to monitoring data
(2) Specialization: runtime system is extensible using domainspecific languages (DSLs)
Slide 2
Roadmap
•
•
•
•
Layered software structure
Example of introspection
Runtime system extensibility using DSLs
Conclusion
Slide 3
Layered software structure
Application FrontEnd Code
Distributed
Global Runtime
Local Runtime
Device Interface
& RTOS
HW Device (NIC)
Parallel App.
Worker Code
Distributed
Global Runtime
Front-end node(s)
Local Runtime
Device Interface
& RTOS
HW Device
Switch
LAN/WAN
Device nodes
Slide 4
Device Interface Layer
Device Interface
& RTOS
HW Device (NIC)
Front-end node(s)
Device Interface
& RTOS
HW Device
Switch
LAN/WAN
Device nodes
Slide 5
Device interface layer
• Microkernel OS modules
• Traditional OS services
– Networking, mem management, process scheduling, threads, …
• Device-specific monitoring
–
–
–
–
Raw access patterns
Utilization statistics
Environmental parameters
Indications of impending failure
• Self-characterization of performance, functional
capabilities
Slide 6
Local runtime layer
Local Runtime
Device Interface
& RTOS
HW Device (NIC)
Front-end node(s)
Local Runtime
Device Interface
& RTOS
HW Device
Switch
LAN/WAN
Device nodes
Slide 7
Local runtime layer
• Non-distributed mechanisms needed by network service
applications
• Feeds information to global layer or performs local
operations on behalf of global layer
• Example mechanisms
– Application-specific filtering/aggregation of device monitoring data
» Example: OLTP server vs. DSS server
– Data layout and naming
» Example: record-based interface for DB, file-based for web
server
– Device scheduling
» Example: maximize TPS vs. maximize disk bandwidth utilization
– Caching
» Coherence essential vs. coherence unnecessary
» More efficient caching implementation possible in second case
Slide 8
Global runtime layer
Distributed
Global Runtime
Local Runtime
Device Interface
& RTOS
HW Device (NIC)
Distributed
Global Runtime
Front-end node(s)
Local Runtime
Device Interface
& RTOS
HW Device
Switch
LAN/WAN
Device nodes
Slide 9
Global runtime layer
• Aggregate, process, react to monitoring data
• Relies on local per-device runtime mechanisms to
provide monitoring data, implement control actions
• Provides application interface that hides distributed
implementation of runtime services
• Example services
– High-level services
» Load balancing: replicate and/or migrate heavily used data
objects when a disk becomes over-utilized
» Availability: replicate data from failed or failing component to
restore required redundancy
» Plug-and-play: integrate new devices into the system
– Low-level services used to implement high-level global services
»
»
»
»
Distributed directory tracks data and metadata objects
Migration, replication, caching
Inter-brick communication
Distributed transactions
Slide 10
Distributed application worker code
Distributed
Global Runtime
Local Runtime
Device Interface
& RTOS
HW Device (NIC)
Parallel App.
Worker Code
Distributed
Global Runtime
Front-end node(s)
Local Runtime
Device Interface
& RTOS
HW Device
Switch
LAN/WAN
Device nodes
Slide 11
Distributed application worker code
• Runs on top of global runtime system
• Written by appliance designer
• Application-specific
– Database
» scan, sort, join, aggregate, update record, delete record, ...
– Transformational web proxy
» fetch web page (from disk or remote site), apply transformation
filter, update user preferences database, ...
• System administration tools implemented at this level
– Customized runtime system defines administrative interface
tailored to application
Slide 12
Application front-end code
Application FrontEnd Code
Distributed
Global Runtime
Local Runtime
Device Interface
& RTOS
HW Device (NIC)
Parallel App.
Worker Code
Distributed
Global Runtime
Front-end node(s)
Local Runtime
Device Interface
& RTOS
HW Device
Switch
LAN/WAN
Device nodes
Slide 13
Application front-end code
• Runs on front-end interface bricks
• Accepts requests from LAN/WAN connection
– Incoming requests made using standard high-level protocols
» HTTP, NFS, SQL, ODBC, …
• Invokes and coordinates appropriate worker code
components that execute on internal blocks
– Takes into account locality and load balancing
– Database: front-end performs SQL query optimization, invokes
distributed relational operators on data storage devices
– Transformational proxy: front-end invokes distiller thread on
appropriate device brick
» if data is cached, invoke on disk node
» otherwise, fetch data from web and invoke on compute node or
disk node
Slide 14
Roadmap
•
•
•
•
Layered software structure
Example of introspection
Runtime system extensibility using DSLs
Conclusion
Slide 15
From introspection to adaptation
Intelligent HW
components
Continuous monitoring
+
Adaptive, selfmaintaining appliance
Extensible, applicationtailored runtime system
• Example: slowly-failing data disk in large DB system
(1) Detect problem
(2) Repair problem while continuing to handle incoming requests
(3) Return to normal system operation
Slide 16
Failing disk: detection
• Microkernel monitoring module continuously monitoring
disk’s health detects exceptional condition, e.g.
– ECC failures
– Media errors
– Increased rates of ECC retries
• Notifies global fault handling mechanism
Slide 17
Failing disk: reaction
• Global fault handling mechanism…
– Prevents system from sending more work to failed device
» Modifies global directory to remove entries corresponding to
failed component’s data
– Application-specific response to impending failure
» Transactional system: discard work currently in progress on
failing device, reissue to another data replica
» Non-transactional system w/o coherent replicas: checkpoint
computation, restore on another data replica
» Transformational web proxy: do nothing
– Instruct disk runtime system to shut disk down
» Disk device is considered failed
Slide 18
Failing disk: return to normal operation
• Global fault handling mechanism...
– Rebuilds data redundancy
» By allocating space for a new replica on a functioning disk and
copying data to it from existing replicas
» Using an application-specific data replication mechanism
• Where to allocate new replicas, how to copy data, how to lay
out data for new replicas, how to update global directory
• Example in upcoming slide
• Life returns to normal
– Degree of fault-tolerance has been restored
• Failed component can be replaced during regularlyscheduled maintenance
Slide 19
Roadmap
•
•
•
•
Layered software structure
Example of introspection
Runtime system extensibility using DSLs
Conclusion
Slide 20
Runtime system extensibility
• Two ways of looking at system
– Partitioned on functional/mechanism boundaries
» Collection of libraries: failure detection, transactions, ...
» Mechanisms are isolated
libfail
application
librepl
libtrxn
OS
libcache
Slide 21
Runtime system extensibility
• Two ways of looking at system
– Partitioned on functional/mechanism boundaries
» Collection of libraries: failure detection, transactions, ...
» Mechanisms are isolated
– Partitioned on global system properties
» This is how the programmer thinks about the system (high-level)
» e.g. application-specific data availability policy
• Failure detection (which devices to monitor, …)
• Replication (used to restore redundancy)
• Transactions (how to restart work in progress)
• Caching (how to handle dirty cached objects during failure)
libfail
application
librepl
libtrxn
OS
libcache
Slide 22
Runtime system extensibility
• Two ways of looking at system
– Partitioned on functional/mechanism boundaries
» Collection of libraries: failure detection, transactions, ...
» Mechanisms are isolated
– Partitioned on global system properties
» This is how the programmer thinks about the system (high-level)
» e.g. application-specific data availability policy
• Failure detection (which devices to monitor, …)
• Replication (used to restore redundancy)
• Transactions (how to restart work in progress)
• Caching (how to handle dirty cached objects during failure)
libfail
application
Customized runtime
system library
compiler
policy
librepl
libtrxn
OS
libcache
Slide 23
Extensibility using DSLs
• DSLs are languages specialized for a particular task
• Each ISTORE DSL
– Encapsulates high-level semantics of one system behavior
– Allows declarative specification of
» Behavior of one aspect of the system (a “policy”)
» Interfaces to coordinated mechanisms that implement the policy
– Is compiled into an implementation that might coordinate
several local and/or global base runtime system mechanisms
» May be implemented as background and/or foreground tasks
• Analysis tools can potentially infer unspecified
emergent system behaviors from the specifications
– e.g. what impact will a new redundancy policy have on
transaction commit time
• Extensions compiled together with local and global
base mechanisms form the distributed runtime system
Slide 24
Extensibility using DSLs: Example
Avail::FailureDetected(Device d) {
Object o; ObjList objs;
Transaction t; TxnList txns;
Replica x, c, r;
Directory::MarkDeviceDisabled(d);
Admin::AlertFailure(d);
objs = Directory::GetObjects(d);
objs stored on failed device
foreach o (objs) {
x = Directory::GetReplica(o,d)
find o’s replica on d
Directory::DeleteReplica(x);
delete from global directory
txns = Txn::GetActiveTxns(x);
foreach t (txns) {
Txn::AbortTxn(t);
abort pending txns for o on d
}
c = Directory::GetReplica(o);
find still-accessible copy
r = LoadBalancer::AllocateReplica(o);
get space for new repl
LocalRuntime::CopyObject(c,c->device,r,r->device);
copy it
Directory::AddReplica(r,r->device);
update directory
foreach t (txns) {
Txn::IssueTxn(txn,r);
reissue txns on new replica
}
}
}
Slide 25
Extensibility using DSLs (cont.)
• Similar specification written for each extension to
base library
• Other examples of extensible system behaviors
–
–
–
–
–
Transaction response time requirements
Prioritizing operations based on type of data processed
Resource allocation
Backup policy
Exported administrative interface
Slide 26
Why use DSLs?
• Possible choices
– Each appliance designer writes runtime system from scratch
» Similar to exokernel operating systems
– All designers use single parameterized runtime system library
» Similar to tunable kernel parameters in modern OSs
– Designer writes high-level specification of system behavior
» DSL compiler automatically translates specification into runtime
system extensions that coordinate base mechanisms
» Advantages include
• Programmability
• Performance
• Reliability, verifiability, safety
• Artificial diversity
Slide 27
DSL advantages (cont.)
• Programmability
– High-level specification close to designer’s abstraction level
» Easier to write, reason about, maintain, modify runtime system
code
» Simple enough to allow site-specific customization at installation
time
• Performance
– Aggressive DSL compiler can take advantage of high-level
semantics of specification language
– Base library mechanisms can be highly optimized; optimization
complexity hidden from appliance designer
– Web example: infer that TCP checksums should be stored
with web pages
Slide 28
DSL advantages (cont.)
• Reliability
– Automatically generate code that’s easy to forget or get wrong
» Example: synchronization operations to serialize accesses to
distributed data structure
• Verifiability
– Of input code (DSL specification)
» More abstract form of semantic checking
» e.g. DSL supports types natural to behavior being specified =>
type-checking verifies some semantic constraints
• e.g. “ensure no unencrypted objects are written to disk”
– Of output code (coordinated use of base mechanisms)
» DSL compiler writer satisfied DSL compiler is correct =>
appliance designer inherits verification effort
• Safety (prevent runtime errors)
– Whole classes of general programming errors not possible
» DSLs hide details: runtime memory management, IPC, ...
» Compiler automatically adds code: synchronization, ...
Slide 29
DSL advantages
(cont.)
Implementation 1
Specification
DSL compiler
Implementation 2
Implementation 3
...
• Artificial diversity
– Potentially allow system to continue operation in face of
internal bugs or malicious attack
» Multiple implementations of component run simultaneously on
different data replicas
» Continuously check each other with respect to high-level behavior
» Non-traditional fault-tolerance, but related to process pairs
– Potentially usable to enhance performance
» Select best-performing implementation(s) for future use;
periodically reevaluate choice
– Examples of possible implementation differences
» Low-level: runtime memory layout, code ordering and layout
» High-level: system resource usage (recompute vs. use stored
data, general space/time/bandwidth tradeoffs)
Slide 30
ISTORE software summary
• ISTORE software architecture provides an extensible
runtime environment for distributed network service
application code
– Layered local and global mechanism libraries provide
introspection and self-maintenance
– Mechanisms can be customized using DSL-based specifications
of application policy
» DSL code coordinates base mechanisms to implement application
semantics and interfaces
» DSL-based extension offers significant advantages in
programmability, performance, reliability, safety, diversity
Slide 31
ISTORE summary
• Network services are increasing in importance
– Self-maintaining scaleable storage appliances match the
needs of these services
• ISTORE provides a flexible architecture for
implementing storage-based network service apps
– Modular, intelligent, fault-tolerant hardware platform is easy
to configure, scale, and administer
– Runtime system allows applications to leverage intelligent
hardware, achieve introspection, and provide selfmaintenance through
» Layered runtime software structure
» DSL-based extensibility that allows easy application-specific
customization
Slide 32
Agenda
•
•
•
•
Overview of ISTORE: Motivation and Architecture
Hardware Details and Prototype Plans
Software Architecture
Discussion and Feedback
Slide 33
Backup slides
Slide 34
What ISTORE is not
• An extensible operating system
– Use commodity OS, only add hardware monitoring module
» MM could just be a device driver => no need for microkernel OS
» ISTORE could be built on top of an extensible operating system
for even greater flexibility
• An attempt to make commodity OS’s extensible
– Extensible runtime system allows designer to customize
higher-level operations than OS extensions do
– Closest to an extensible distributed operating system built on
top of a commodity single-node operating system
• A multiple-protection-domain system
– Assumes non-malicious programmer
– If user-downloaded code permitted, sandbox must be
implemented as part of (trusted) application
– DSLs specify resource allocation/scheduling policies, appliance
designer responsible for ensuring fairness
• A framework for building generic servers
Slide 35
ISTORE boot process
(1) Initially, undifferentiated ISTORE system
(2) On boot, each device block contacts system boot
server
(3) Device blocks download customized runtime system and
application worker code
– Front-end blocks also download application front-end code
• Runtime system libraries structured as shared
libraries => hot upgrade
Slide 36
Example Appliances
•
•
•
•
•
•
•
•
•
E-commerce
Web search engine
Transformational web/PDA proxy
Election server
Mail server
News server
NFS server
Database server: OLTP, DSS, mixed OLTP-DSS
Video server
Slide 37