also available

Transcript also available

Simplifying Cluster-Based Internet Service
Construction with SDDS
Steven D. Gribble
Qualifying Exam Proposal
April 19th, 1999
Committee Members:
Eric Brewer, David Culler, Joseph Hellerstein, and
Marti Hearst
Clusters for Internet Services
• Previous observation (TACC, Inktomi, NOW):
–
clusters of workstations are a natural platform for
constructing Internet services
• Internet service properties
–
support large, rapidly growing user populations
–
must remain highly available, and cost-effective
• Clusters offer a tantalizing solution
–
incremental scalability: cluster grows with service
–
natural parallelism: high performance platform
–
software and hardware redundancy: fault-tolerance
Problem
• Internet service construction on clusters is hard
–
load balancing, process management, communications
abstractions, I/O balancing, fail-over and restart, …
–
toolkits proposed to help (TACC, AS1, River, … )
• Even harder if shared, persistent state is involved
–
data partitioning, replication, and consistency, interacting
with storage subsystem, …
–
solutions not geared to clustered services
• use (distributed) RDBMS: expensive, powerful semantic
guarantees, generality at cost of performance
• use network/distributed FS: overly general, high overhead
(e.g. double buffering penalties). Fault-tolerance?
• roll your own custom solution: not reusable, complex
Idea / Hypothesis
• It is possible to:
–
isolate clustered services from vagaries of state mgmt.,
–
to do so with adequately general abstractions,
–
to build those abstractions in a layered fashion (reuse),
–
and to exploit clusters for performance, and simplicity.
• Use Scalable Distributed Data Structures (SDDS)
–
take conventional data structure (hash table, tree, log, … )
–
partition it across nodes in a cluster
• parallel access, scalability, …
–
replicate partitions within replica groups in cluster
• availability in face of failures, further parallelism
–
store replicas on disk
Why SDDS?
• 1st year ugrad software engineering principle:
–
separation of concerns
• decouple persistency/consistency logic from rest of service
• simpler (and cleaner!) service implementations
• service authors understand data structures
–
familiar behavior and interfaces from single-node case
–
should enable rapid development of new services
• structure and access patterns are self-evident
–
access granularity manifestly a structure element
–
coincidence of logical and physical data units (ALF/ADU)
• cf. file systems, SQL in RDBMS, VM pages in DSM
Challenges (== Research Agenda)
• Overcoming complexities of distributed systems
–
data consistency, data distribution, request load balancing,
hiding network latency and OS overhead, …
–
ace up the sleeve: cluster  wide area
• single, controlled administrative domain
– engineer to (probabilistically) avoid network partitions
– use low-latency, high-throughput SAN (5 s, 40-120 MB/s)
– predictable behavior, controlled heterogeneity
• “I/O is still a problem”
–
plenty of work on fast network I/O, some on fast disk I/O
–
less work bridging network  disk I/O in cluster environment
High-level plan (10000’ view)
1) Implement several SDDS’s
–
hypothesis: log, tree, and hash table are sufficient
• investigate different consistency levels and mechanisms
• experiment with scalability mechanisms
–
Get top and bottom interfaces correct
• top interface: particulars of hash, tree, log implementation
– specialization hooks?
• bottom interface: segment-based cluster I/O layer
– filtered streams between disks, network, memory
2) Evaluate by deploying real services
–
goal: measurable reduction of complexity
• port existing Ninja services to SDDS platform
–
goal: ease of adoption by service authors
• have co-researchers implement new services on SDDS
Outline
• Prototype description
–
distributed hash table implementation
–
“Parallelisms” service and lessons
• SDDS design and research issues
• Evaluation methodology
• Segment-based I/O layer for clusters
• Related work
• Pro Forma timeline
Outline
• Prototype description
–
distributed hash table implementation
–
“Parallelisms” service and lessons
• SDDS design and research issues
• Evaluation methodology
• Segment-based I/O layer for clusters
• Related work
• Pro Forma timeline
Prototype hash table
Client
Client
Service FE
Client
Client
Client
Client
Service FE
Service FE
hash library hash library
hash library
• storage bricks provide
local, network-accessible
hash tables
• interaction with distrib.
hash table through
abstraction libraries
–
“C”, Java APIs available
–
partitioning, mirrored
replication logic in each
library
SAN
• distrib. table semantics
storage
“brick”
storage
“brick”
storage
“brick”
storage
“brick”
storage
“brick”
storage
“brick”
–
handles node failures
–
no consistency
• or transactions, online recovery, etc.
Storage bricks
Worker pool: one thread
dispatched per request.
Messaging, event
queue
Transport specific
comm. and naming
Thread
Pool
RPC
marshalling
Communications
abstractions
TCP
AM
layer layer
virt /
phys
Hash table
primitives
MMAP
management
Argument
marshalling
Local hash table
implementations
MMAP region
management, and
alloc(), free() impl.
storage brick
• Individual nodes are storage bricks
–
consistent, atomic, network accessible
operations on a local hash table
–
uses MMAP to handle data persistence
–
no transaction support
• Clients communicate to set of storage
bricks using RPC marshalling layer
Virtual to
physical
node
names,
inter-node
hashing
Service
application logic
Communications
abstractions
TCP AM virt /
layer layer phys
Service Frontend
Parallelisms service
• Provides “relevant site”
information given a URL
–
an inversion of Yahoo! directory
• Parallelisms: builds index of all URLs,
returns other URLs in same topics
–
read-mostly traffic, nearly no
consistency requirements
–
large database of URLs
• ~ gigabyte of space for 1.5 million
URLs and 80000 topics
• Service FE itself is very simple
–
400 semicolons of C
• 130 for app-specific logic
• 270 for threads, HTTP munging, …
–
hash table code: 4K semicolons of C
http://ninja.cs.berkeley.edu/~demos/
paralllelisms/parallelisms.html
Some Lessons Learned
• mmap() simplified implementation, but at a price
+
service “working sets” naturally apply
–
no pointers: breaks usual linked list and hash table libraries
–
little control over the order of writes, so cannot guarantee
consistency if crashes occur
• if node goes down, may incur a lengthy sync before restart
• same for abstraction libraries: simplicity with a cost
–
each storage brick could be totally independent
• because policy embedded in abstraction libraries
–
bad for administration and monitoring
• no place to “hook in” to get view of complete table
• each client makes isolated decisions
– load balancing and failure detection
More lessons learned
• service simplicity premise seems valid
–
Parallelisms service code devoid of persistence logic
–
Parallelisms front-ends contain only session state
• no recovery necessary if they fail
• interface selection is critical
–
originally, just supported put(), get(), remove()
–
wanted to support java.util.hashtable subclass
• needed enumerations, containsKey(), containsObject()
• to efficiently support required significant replumbing
• thread subsystem was troublesome
–
JDK has its own, and it conflicted. Had to remove threads
from client-side abstraction library.
Outline
• Prototype description
–
distributed hash table implementation
–
“Parallelisms” service and lessons
• SDDS design and research issues
• Evaluation methodology
• Segment-based I/O layer for clusters
• Related work
• Pro Forma timeline
SDDS goal: simplicity
• hypothesis: simplify construction of services
–
evidence: Parallelisms
• distributed hash table prototype: 3000 lines of “C” code
• service: 400 lines of “C” code, 1/3 of which is service-specific
–
evidence: Keiretsu service
• instant messaging service between heterogeneous devices
• crux of service is in sharing of binding/routing state
• original: 131 lines of Java
SDDS version: 80 lines of Java
• management/operational aspects
–
to be successful, authors must want to adopt SDDSs
• simple to incorporate and understand
• operational management must be nearly transparent
– node fail-over and recovery, logging, etc. behind the scenes
– plug-n-play extensibility to add capacity
SDDS goal: generality
• potential criticism of SDDSs:
–
no matter which structures you provide, some services
simply can’t be built with only those primitives
–
response: pick a basis to enable many interesting services
• log, hash table, and tree: our guess at a good basis
• layered-model will allow people to develop other SDDS’s
• allow GiST-style specialization hooks?
Web server
read-mostly HT: static documents, cache. L: hit tracking
Proxy cache
HT: soft-state cache. L: hit tracking
Search engine HT: query cache, doc. table. HT, T: indexes. L: crawl data, hit log
PIM server
HT: repository HT, T: indexes. L: hit log, WAL for re-filing etc.
SDDS Research Ideas: Consistency
• consistency / performance tradeoffs
–
stricter consistency requirements imply worse performance
–
we know some intended services have weaker requirements
• rejected alternatives:
–
built strict consistency, and force people to use
–
investigate “extended transaction models”
• our choice: pick small set of consistency guarantees
–
level 0 (atomic but not isolated operations)
–
level 3 (ACID)
• get help with this one - it’s a bear
SDDS Research Ideas: Consistency (2)
• replica management
–
what mechanism will we use between replicas?
• 2 phase commit for distributed atomicity
• log-based on-line recovery
• exploiting cluster properties
–
low network latency  fast 2 phase commit
• especially relative to WAN latency for Internet services
–
given good UPS, node failures independent
• “commit” to memory of peer in group, not to disk
–
(probabilistically) engineer away network partitions
• unavailable  failure, therefore consensus algo. not needed
SDDS Research Ideas: load management
• data distribution affects request distribution
–
start simple: static data distribution (except extensibility)
–
given request, lookup or hash to determine partition
• optimizations
–
locality aware request dist. (LARD) within replicas
• if no failures, replicas further “partition” data in memory
–
“front ends” often colocated with storage nodes
• front end selection based on data distribution knowledge
• smart clients (Ninja redirector stubs..?)
• Issues
–
graceful degradation: RED/LRP techniques to drop requests
–
given many simultaneous requests, service ordering policy?
Incremental Scalability
• logs and trees have a natural solution
–
pointers are ingrained in these structures
–
use the pointers to (re)direct structures onto new nodes
Incremental Scalability - Hash Tables
• hash table is the tricky one
–
why? mapping is done by client-side hash functions
• unless table is chained, no pointers inside hash structure
• need to change client-side functions to scale structure
–
Litwin’s linear hashing?
• client-side hash function evolves over time
• clients independently discovery when to evolve functions
–
Directory-based map?
• move hashing into infrastructure (inefficient)
• or, have infrastructure inform clients when to change function
– AFS-style registration and callbacks?
Getting the Interfaces Right
• upper interfaces: sufficient generality
–
setting the bar for functionality (e.g.
java.util.hashtable)
–
opportunity: reuse of existing software (e.g. Berkeley DB)
• lower interfaces: use a segment-based I/O layer?
–
log, tree: natural sequentiality, segments make sense
–
hash table is much more challenging
• aggregating small, random accesses into large, sequential ones
• rely on commits to other nodes’ memory
– periodically dump deltas to disk LFS-style
Outline
• Prototype description
–
distributed hash table implementation
–
“Parallelisms” service and lessons
• SDDS design and research issues
• Evaluation methodology
• Segment-based I/O layer for clusters
• Related work
• Pro Forma timeline
Evaluation: use real services
• metrics for success
1) measurable reduction in complexity to author Internet svcs.
2) widespread adoption of SDDS by Ninja researchers
1) port/reimplement existing Ninja services
–
Keiretsu, Ninja Jukebox, the multispace Log service
–
explicitly demonstrate code reduction & performance boon
2) convince people to use SDDS for new services
–
NinjaMail, Service Discovery Service, ICEBERG services
–
will force us to tame operational aspects of SDDS
• goal: as simple to use SDDS as single-node, non-persistent case
Outline
• Prototype description
–
distributed hash table implementation
–
“Parallelisms” service and lessons
• SDDS design and research issues
• Evaluation methodology
• Segment-based I/O layer for clusters
• Related work
• Pro Forma timeline
Segment layer (motivation)
• it’s all about disk bandwidth & avoiding seeks
–
8 ms random seek, 25-80 MB/s throughput
• must read 320 KB per seek to break even
• build disk abstraction layer based on segments
–
1-2 MB regions on disk, read and written in their entirety
–
force upper layers to design with this in mind
–
small reads/writes treated as uncommon failure case
• SAN throughput is comparable to disk throughput
–
stream from disk to network and saturate both channels
• stream through service-specific filter functions
• selection, transformation, …
–
apply lessons from high-performance networks
Segment layer challenges
• thread and event model
–
lowest level model dictates entire application stack
• dependency on particular thread subsystem is undesirable
–
asynchronous interfaces are essential
• especially for Internet services w/ thousands of connections
–
potential model: VIA completion queues
• reusability for many components
–
toughest customer: Telegraph DB
• dictate write ordering, be able to “roll back” mods for aborts
• if content is paged, make sure don’t overwrite on disk
• no mmap( ) !
Segment Implementation Plan
• Two versions planned
–
one version using POSIX syscalls and vanilla filesystem
• definitely won’t perform well (copies to handle shadowing)
• portable to many platforms
• good for prototyping and getting API right
–
one version on Linux with kernel modules for specialization
• I/O-lite style buffer unification
• use VIA or AM for network I/O
• modify VM subsystem for copy-on-write segments, and/or
paging dirty data to separate region
Outline
• Prototype description
–
distributed hash table implementation
–
“Parallelisms” service and lessons
• SDDS design and research issues
• Evaluation methodology
• Segment-based I/O layer for clusters
• Related work
• Pro Forma timeline
Related work
• (S)DSM
–
structural element is a better atomic unit than page
–
fault tolerance as goal
• Distributed/networked FS
[NFS, AFS, xFS, LFS, ..]
–
FS more general, has less chance to exploit structure
–
often not in clustered environment (except xFS, Frangipani)
• Litwin SDDS
[LH, LH*, RP, RP*]
–
significant overlap in goals
–
but little implementation experience
• little exploitation of cluster characteristics
–
consistency model not clear
Related Work (continued)
• Distributed & Parallel Databases
[R*, Mariposa, Gamma, …]
–
different goal (generality in structure/queries, xacts)
–
stronger and richer semantics, but at cost
• both $$ and performance
• Fast I/O research
–
[U-Net, AM, VIA, IO-lite, fbufs, x-kernel, …]
network and disk subsystems
• main results: get OS out of way, avoid (unnecessary) copies
–
use results in our fast I/O layer
• Cluster platforms
–
[TACC, AS1, River, Beowulf, Glunix, …]
many issues we aren’t tackling
• harvesting idle resources, process migration, single-system view
–
our work is mostly complementary
Outline
• Prototype description
–
distributed hash table implementation
–
“Parallelisms” service and lessons
• SDDS design and research issues
• Evaluation methodology
• Segment-based I/O layer for clusters
• Related work
• Pro Forma timeline
Pro Forma Timeline
• Phase 1 (0-6 months)
–
preliminary distributed log and extensible hash table SDDS
• weak consistency for hash table, strong for log
–
prototype design/implementation of segment I/O layer
• both POSIX and linux-optimized versions
• Phase 2 (6-12 months)
–
measure/optimize SDDSs, improve hash table consistency
–
port existing services to these SDDS’s, help with new svcs.
• Phase 3 (12-18 months)
–
add tree/treap SDDS to repertoire
–
incorporate lessons learned from phase 2 services
–
measure, optimize, measure, optimize, … , write, graduate.
Summary
• SDDS hypothesis
–
simplification of cluster-based Internet service construction
–
possible to achieve generality, simplicity, efficiency
• exploit properties of clusters to our benefit
–
build in layered fashion on segment-based I/O substrate
• Many interesting research issues
–
consistency vs. efficiency, replica management, load balancing
and request distribution, extensibility, specialization, …
• Measurable success metric
–
simplification of existing Ninja services
–
adoption for use in construction new Ninja services
Taxonomy of Clustered Services
Stateless
State Mgmt.
Requirements
Examples
Goal:
Little or none
TACC distillers
River modules
Video Gateway
Soft-state
• high availability
• perhaps consistency
• persistence is an
optimization
TACC aggregators
AS1 servents or RMX
Squid web cache
simplify the job of constructing these
classes of services
Persistent State
• high availability and
completeness
• perhaps consistency
• persistence necessary
Inktomi search engine
Parallelisms
Scalable PIM apps
HINDE mint
Performance
• Bulk-loading of database dominated by disk access time
–
can achieve 1500 inserts per second per node on 100 Mb/s
Ethernet cluster, if hash table fits in memory (dominant cost
is messaging layer)
–
otherwise, degrades to about 30 inserts per second (dominant
cost is disk write time)
• In steady state, all nodes operate primarily out of
memory, as the working set is fully paged in
–
similar principle to research Inktomi cluster
–
handles hundreds of queries per s. on 4 node cluster w/ 2 FE’s
Deliverables
Ninja Svcs
HT’
log HT tree
SDDS primitives
I/O layer
= I will build
= I will help build/influence
= I will spectate