Transcript ppt

Distributed Systems
http://net.pku.edu.cn/~course/cs501/2013
Zhi Yang
School of EECS, Peking University
Load balance
14:39
15:52
17:05
18:18
19:31
20:44
21:57
23:10
00:23
01:36
02:49
04:02
05:15
06:28
07:41
08:54
10:07
11:20
12:33
13:46
Load
(reqs/s/volume)
Exchange server load
100000
10000
1000
100
Time of day
Everest: write off-loading for I/O peaks
5
Write off-loading
Reads and Writes
Everest store
Everest client
No
Off-loading
Reclaiming
off-loading
Reclaims
Writes
Everest store
Reads
Everest store
Volume
Everest: write off-loading for I/O peaks
6
Reclaiming in the background
Everest client
“Read any”
Everest store
write
Volume
Everest store
delete(block range, version)
<block range, version, data>
Everest store
 Multiple concurrent reclaim “threads”
Efficient utilization of disk/network resources
Everest: write off-loading for I/O peaks
7
Scaling with #stores
OLTP client
Store
SQL Server binary
Detours DLL redirection
LAN
Everest client
Store
Log
Dushyanth Narayanan
Data
Store
8
Circular on-disk log
HEAD
H
........ 8 7
1 2
TAIL
7 X
X
8 X
9 4 X
1 X
2 ........
7-9
Write
Reclaim
Spin up
9
Why not always off-load?
OLTP client
OLTP client
SQL Server 1
SQL Server 2
Everest client
Read
Write
Dushyanth Narayanan
Read
Write
Read
Write
Data
Data
Store
Store
10
Architectural styles: Layered style
Observation: Layered style is used for client-server system;
2.1 Architectural styles
2.2 System architecture
Another Description of 3-Tier Architecture
Physical 3-Tiered architecure
2.2 System architecture
(c)
(a)
(b)
(d)
(e)
2.2 System architecture
 Horizontally distributed servers may talk to each other.
2.2 System architecture
E.g., Internet Search Engine
2.2 System architecture
3-Tier Example: Clients Invoke
Individual Servers
Client
Web
server
storage
server
storage
server
Client
Process:
Computer:
2.2 System architecture
 An example of horizontal distribution of a Web service.
Disks
2.2 System architecture
Stateless Servers
 Never keep accurate information about the status of a
client after having handled a request:
 Don’t record whether a file has been opened (simply close it
again after access)
 Don’t promise to invalidate a client’s cache
 Don’t keep track of your clients
 Consequences:
 Clients and servers are completely independent
 State inconsistencies due to client or server crashes are
reduced
 Possible loss of performance
20/N
 because, e.g., a server cannot anticipate client behavior (think of
prefetching file blocks)
Stateful Servers
 Keeps track of the status of its clients:
 Record that a file has been opened, so that
prefetching can be done
 Knows which data a client has cached, and allows
clients to keep local copies of shared data
 Observation: The performance of stateful
servers can be extremely high, provided clients
are allowed to keep local copies.
 Session state vs. permanent state
 As it turns out, reliability is not a major problem.
21/N
Thin Client
The XWindow System
 Protocol tends to be heavyweight.
 Other examples of similar systems?
23/N
 VNC
 Remote desktop
Why the hype now?
 Desktops are out of control, users load and run whatever
they like on “their” workstations
Hard to support, no standard software
Hard to manage configuration, no
telling what’s installed
Risk of data loss, company data not
safe
 Businesses need to “prove” compliance with IT security
policies
Shared Services
 Many users share one machine
A user can “run away” with RAM or CPU
Different users may need different apps
that conflict with each other
 Can be relatively simple to deploy
 Sun Global Desktop, Citrix XenApp, Windows Terminal
Services
Shared Services
(a la Terminal Services, XenApp)
Desktop Virtualization
 Similar to shared services but:
 Each user gets “their own” virtual machine
 Machines can be spawned on demand
from a golden image
 Desktop controller server manages user
connections, VM power states, load
balancing
 Users CAN share a machine if appropriate
 Citrix XenDesktop, VMware View
Desktop Virtualization
Production
Example –
Vmware
View
P2P Systems
Use the vast resources of
machines at the edge of the
Internet to build a network
that allows resource sharing
without any central authority.
More than a system for
sharing pirated music/movies
Characteristics of P2P Systems
 Exploit edge resources.
 Storage, content, CPU, Human presence.
 Significant autonomy from any centralized
authority.
 Each node can act as a Client as well as a Server.
 Resources at edge have intermittent connectivity,
constantly being added & removed.
 Infrastructure is untrusted and the components are
unreliable.
Client/Server Architecture
 Well known,
powerful, reliable
server is a data
source
Server
Client
Client
 Clients request data
from server
Internet
Client
 Very successful
model
 WWW (HTTP), FTP, Web
services, etc.
* Figure from http://project-iris.net/talks/dht-toronto-03.ppt
Client
P2P Architecture
 All nodes are both
clients and servers
 Provide and consume
data
 Any node can initiate a
connection
 No centralized data
source
Node
Node
Node
Internet
Node
 “The ultimate form of
democracy on the
Internet”
 “The ultimate threat to
copy-right protection on
the Internet”
* Content from http://project-iris.net/talks/dht-toronto-03.ppt
Node
2.2 Summary of P2P
Centralized P2P: (Napster)
Server is a index of addressing
Pure P2P: (Gnutella 0.4)
No central entities
Any terminal can be removed
Hybrid P2P: (Gnutella 0.6)
Dynamic central entities
Any terminal can be removed
35
02.04.2017
Napster
Further services:
 Chat program, instant messaging service,
tracking program,…
Centralized system
 Single point of failure => limited fault tolerance
 Limited scalability (server farms with load
balancing)
36
Query is fast and upper bound for
duration can be given
Oct. 4
Anh Le + Tuong
Nguyen
Napster
5
6
4
central DB
3
3. Download
Request
2. Response
1
Peer
37
1. Query
2
4. File
BitTorrent : Overview
File.torrent :
Seeder – peer having entire file
Leecher – peer downloading file
-URL of tracker
-File name
-File length
-Chunk length
-Checksum for each
chunk (SHA1 hash)
Overlay Network
A P2P network is an overlay network. Each link
between peers consists of one or more IP links.
Overlays : All in the application layer
Tremendous design
flexibility
 Topology, maintenance
 Message types
 Protocol
 Messaging over TCP or UDP
Underlying physical
network is transparent to
developer
 But some overlays exploit
proximity
Overlay Graph
Virtual edge
 TCP connection
 or simply a pointer to an IP address
Overlay maintenance
 Periodically ping to make sure neighbor is still
alive
 Or verify aliveness while messaging
 If neighbor goes down, may want to establish
new edge
 New incoming node needs to bootstrap
 Could be a challenge under high rate of churn
 Churn : dynamic topology and intermittent access due
to node arrival and failure
Overlay Graph
 Unstructured overlays
e.g., new node randomly chooses
existing nodes as neighbors
 Structured overlays
e.g., edges arranged in restrictive
structure
Gnutella
 pure peer-to-peer
 very simple protocol
 no routing "intelligence"
 Constrained broadcast
Life-time of packets limited by TTL
(typically set to 7)
Packets have unique ids to detect
loops
43
Oct. 4
Anh Le + Tuong
Nguyen
Gnutella - PING/PONG
3
6
Ping 1
Ping 1
Pong 3 Pong 6,7,8
Pong 6,7,8
Pong 6
Ping 1
1
Known Hosts:
2
Pong 3,4,5
Pong 5
2
Ping 1
5
Pong 7
Ping 1
Ping 1
Pong 2
Ping 1
3,4,5
Pong 4
7
Pong 8
8
6,7,8
4
44
Oct. 4
Query/Response
analogous
Anh Le + Tuong
Nguyen
Free riding
File sharing networks rely on users sharing
data
Two types of free riding
 Downloading but not sharing any data
 Not sharing any interesting data
On Gnutella
 15% of users contribute 94% of content
 63% of users never responded to a query
45
Oct. 4
Didn’t have “interesting” data
Anh Le + Tuong
Nguyen
KaZaA
 Hierarchical approach between
Gnutella and Napster
 Two-layered architecture.
 Powerful nodes (supernodes) act as
local index servers, and client queries
are propagated to other supernodes.
 Each supernode manages around 100150 children
 Each supernode connects to 30-50
other supernodes
 More efficient lookup than
Gnutella and more scalable than
Structured Overlay Networks / DHTs
Chord, Pastry, Tapestry, CAN,
Kademlia, P-Grid, Viceroy
Keys of Nodes
Set of Nodes
Common Identifier
Space
Connect
The nodes
Smartly
Keys of Values
…
Node Identifier
Value Identifier
47
Oct. 4
Anh Le + Tuong
Nguyen
Hash Tables
 Store arbitrary keys and
satellite data (value)
 put(key,value)
 value = get(key)
 Lookup must be fast
 Calculate hash function
h() on key that returns a
storage cell
 Chained hash table: Store
key (and optional value)
there
Distributed Hash Table
The Principle Of Distributed Hash Tables

A dynamic distribution of a hash table onto a set of cooperating
nodes
Key
Value
1
Algorithms
9
Routing
11
DS
12
Peer-to-Peer
21
Networks
22
Grids
• Basic service: lookup operation
• Key resolution from any node
node A
node B
node D
node C
→Node D : lookup(9)
• Each node has a routing table
• Pointers to some other nodes
• Typically, a constant or a logarithmic number of pointers
50
Oct. 4
Anh Le + Tuong
Nguyen
Chord [MIT]
consistent hashing (SHA-1) assigns each
node and object an m-bit ID
IDs are ordered in an ID circle ranging
from 0 – (2m-1).
New nodes assume slots in ID circle
according to their ID
Key k is assigned to first node whose ID ≥
k
 successor(k)
Anh Le + Tuong Nguyen
51
Oct. 4
Consistent Hashing - Successor Nodes
identifier
node
6
1
0
successor(6) = 0
6
identifier
circle
6
5
Oct. 4
2
2
successor(2) = 3
3
4
52
key
successor(1) = 1
1
7
X
2
Anh Le + Tuong
Nguyen
Consistent Hashing – Join and
Departure
 When a node n joins the network, certain keys previously
assigned to n’s successor now become assigned to n.
 When node n leaves the network, all of its assigned keys are
reassigned to n’s successor.
53
Oct. 4
Anh Le + Tuong
Nguyen
Consistent Hashing – Node Join
keys
5
7
keys
1
0
1
7
keys
6
2
5
3
keys
2
4
54
Oct. 4
Anh Le + Tuong
Nguyen
Consistent Hashing – Node
Dep.
keys
7
keys
1
0
1
7
keys
6
6
2
5
3
keys
2
4
55
Oct. 4
Anh Le + Tuong
Nguyen
Scalable Key Location – Finger Tables
 Each node n’ maintains a routing table with up to m
entries (which is in fact the number of bits in
identifiers), called finger table.
 The ith entry in the table at node n contains the
identity
of the first node s that succeeds n by at least
i-1
2 on the identifier circle.
 s = successor(n+2i-1).
 s is called the ith finger of node n, denoted by
n.finger(i)
56
Oct. 4
Anh Le + Tuong
Nguyen
Scalable Key Location – Finger
Tables
finger table
start
For.
0+20
0+21
0+22
1
2
4
1
6
3
4
Oct. 4
1
3
0
0
1+2
1+21
1+22
2
3
5
succ.
keys
1
3
3
0
2
5
57
succ.
finger table
For.
start
0
7
keys
6
finger table
For.
start
0
3+2
3+21
3+22
4
5
7
succ.
keys
2
0
0
0
Anh Le + Tuong
Nguyen
Chord key location
Lookup in finger
table the
furthest node
that precedes
key
-> O(log n)
hops
Anh Le + Tuong Nguyen
58
Oct. 4
An Example of CAN
1
59
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
1
60
Oct. 4
2
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
3
1
2
61
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
3
1
2
62
Oct. 4
4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
63
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
I
64
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
65
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
66
Oct. 4
+ Tuong
x =AnhaLe Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
y=b
67
Oct. 4
+ Tuong
x =Anha Le Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
(2) route(K,V) -> (a,b)
68
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
(2) route(K,V) -> (a,b)
(K,V)
(3) (a,b) stores (K,V)
69
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node J::retrieve(K)
(1) a = hx(K)
b = hy(K)
(2) route “retrieve(K)” to (a,b)
(K,V)
J
70
Oct. 4
Anh Le + Tuong
Nguyen
Routing in CAN
71
Oct. 4
Anh Le + Tuong
Nguyen
Routing in CAN (cont)
(a,b)
(x,y)
72
Oct. 4
Anh Le + Tuong
Nguyen
Motivation
 Online storage services are getting increasingly popular
 Amazon’s S3, EMC’s Mozy …
 Rely on data centers.
 Challenges
 Threatened by the single point of failure.
 Amazon suffers outages (3 times); Gmail is down (4 times) …
 Social networks make downtime harder to hide.
 Incur high hardware, network and cooling costs.
 P2P storage
 Use idle resource of users to avoid costs
 Provide low availability because of churn.
Motivation
 Best of both worlds
 Stability of data center
 Low cost of P2P
Data Center
P2P Storage Layer
AmazingStore Design
 Combine data center and P2P storage system
Location of replicas
Location of replicas
Data Center Outage
Master
DHT
Master
Location of replicas
Location of replicas
Availability Improvement
 Overall availability jumps from 93.22% to 99.13%
 Availability gained at peer side is 83.8%
P2P layer
compensation
Fraction of available objects
1
0.8
0.6
0.4
Server-side Availability
Peer-side Availability
0.2
System Availability
0
05/25
05/27
05/26
Power failure
Date
05/28
Erasure codes
AmazingStore Design
 Protect data security
 Encrypted with the key specified by the user.
 Periodically checks peer eviction.
x1
0
0
0
1
0
y
0
1
x2
0
1
0
0
0
1
0
0
1
0
If only
Each
To
check
element
1s
ifappear,
y is
ofinSS,
conclude
ischeck
hashed
that
k ktimes
yhash
is in S
Initial
with
all
0 the
Each
This
location.
may
hashIfyield
location
a 0 false
appears
set
positive
to, 1
y is not in S
Bloom Filter
x
V0
Vm-1
0 0 0 1 0 0 0 1 0 1
h1(x)
h2(x)
h3(x)
0 1 0 0 0
hk(x)
BloomaErrorsb
c
d
V0
Vm-1
0 0 0 1 0 0 0 1 0 1
h1(x)
h2(x)
h3(x)
0 1 0 0 0
hk(x)
x didn’t appear, yet its bits are already set
ICP – With Summary Cache
Proxy
Proxy
Cache
Cache
Client
Proxy
Cache
Proxy
Cache
Internet
Outline
 2.1 Architectural styles
 2.2 System architectures
 2.3 Architectures versus middleware
2.3.1 Interceptor
2.3.2 General Approaches to adaptive
software
2.3.3 Discussion
 2.4 Self-management in distributed systems
 We have talked about the physical architecture.
 Does middleware also have an architectural style?
If it does, how does it affect flexibility,
extensibility?
Sometimes, the “native” style may not be
optimal.
 Can we build messaging over RPC?
 Can we build RPC over messaging?
2.3 Architectures vs. middleware
Interceptors
 Request level could handle
replication.
 Message-level could
handle fragmentation.
2.3 Architectures vs. middleware
Adaptive Middleware
 Separation of concerns: Try to separate extra functionalities and
later weave them together into a single implementation ⇒ only
toy examples so far.
 Computational reflection: Let a program inspect itself at runtime
and adapt/change its settings dynamically if necessary ⇒
mostly at language level and applicability unclear.
 Component-based design: Organize a distributed application
through components that can be dynamically replaced when
needed ⇒ highly complex, also many intercomponent
dependencies.
 Observation: Do we need adaptive software at all, or is the issue
adaptive systems?
2.3 Architectures vs. middleware
Outline
 2.1 Architectural styles
 2.2 System architectures
 2.3 Architectures versus middleware
 2.4 Self-management in distributed systems
Self-managing Distributed
Systems
 Observation: Distinction between system and software
architectures blurs when automatic adaptivity needs to
be taken into account:
Self-configuration
Self-managing
Self-healing
Self-optimizing
Self-*
 Note: There is a lot of hype going on in this field of
autonomic computing.
2.4 Self-management in distributed systems
Feedback Control Model
 Observation: In many cases, self-* systems are organized
as a feedback control system:
2.4 Self-management in distributed systems
Example: Differentiating Replication
Strategies in Globule
 Globule: Collaborative CDN that analyzes traces to
decide where replicas of Web content should be placed.
Decisions are driven by a general cost model:
cost = (w1 × m1) + (w2 × m2) + · · · + (wn × mn)
 Globule origin server collects traces and does what-if analysis by
checking what would have happened if page P would have been
placed at edge server S.
 Many strategies are evaluated, and the best one is chosen.
The dependency between prediction
accuracy and trace length
2.4 Self-management in distributed systems
Summary
 Architectural styles
 System architectures
 Architectures versus middleware
 Self-management in distributed systems