Transcript ppt
Distributed Systems
http://net.pku.edu.cn/~course/cs501/2013
Zhi Yang
School of EECS, Peking University
Load balance
14:39
15:52
17:05
18:18
19:31
20:44
21:57
23:10
00:23
01:36
02:49
04:02
05:15
06:28
07:41
08:54
10:07
11:20
12:33
13:46
Load
(reqs/s/volume)
Exchange server load
100000
10000
1000
100
Time of day
Everest: write off-loading for I/O peaks
5
Write off-loading
Reads and Writes
Everest store
Everest client
No
Off-loading
Reclaiming
off-loading
Reclaims
Writes
Everest store
Reads
Everest store
Volume
Everest: write off-loading for I/O peaks
6
Reclaiming in the background
Everest client
“Read any”
Everest store
write
Volume
Everest store
delete(block range, version)
<block range, version, data>
Everest store
Multiple concurrent reclaim “threads”
Efficient utilization of disk/network resources
Everest: write off-loading for I/O peaks
7
Scaling with #stores
OLTP client
Store
SQL Server binary
Detours DLL redirection
LAN
Everest client
Store
Log
Dushyanth Narayanan
Data
Store
8
Circular on-disk log
HEAD
H
........ 8 7
1 2
TAIL
7 X
X
8 X
9 4 X
1 X
2 ........
7-9
Write
Reclaim
Spin up
9
Why not always off-load?
OLTP client
OLTP client
SQL Server 1
SQL Server 2
Everest client
Read
Write
Dushyanth Narayanan
Read
Write
Read
Write
Data
Data
Store
Store
10
Architectural styles: Layered style
Observation: Layered style is used for client-server system;
2.1 Architectural styles
2.2 System architecture
Another Description of 3-Tier Architecture
Physical 3-Tiered architecure
2.2 System architecture
(c)
(a)
(b)
(d)
(e)
2.2 System architecture
Horizontally distributed servers may talk to each other.
2.2 System architecture
E.g., Internet Search Engine
2.2 System architecture
3-Tier Example: Clients Invoke
Individual Servers
Client
Web
server
storage
server
storage
server
Client
Process:
Computer:
2.2 System architecture
An example of horizontal distribution of a Web service.
Disks
2.2 System architecture
Stateless Servers
Never keep accurate information about the status of a
client after having handled a request:
Don’t record whether a file has been opened (simply close it
again after access)
Don’t promise to invalidate a client’s cache
Don’t keep track of your clients
Consequences:
Clients and servers are completely independent
State inconsistencies due to client or server crashes are
reduced
Possible loss of performance
20/N
because, e.g., a server cannot anticipate client behavior (think of
prefetching file blocks)
Stateful Servers
Keeps track of the status of its clients:
Record that a file has been opened, so that
prefetching can be done
Knows which data a client has cached, and allows
clients to keep local copies of shared data
Observation: The performance of stateful
servers can be extremely high, provided clients
are allowed to keep local copies.
Session state vs. permanent state
As it turns out, reliability is not a major problem.
21/N
Thin Client
The XWindow System
Protocol tends to be heavyweight.
Other examples of similar systems?
23/N
VNC
Remote desktop
Why the hype now?
Desktops are out of control, users load and run whatever
they like on “their” workstations
Hard to support, no standard software
Hard to manage configuration, no
telling what’s installed
Risk of data loss, company data not
safe
Businesses need to “prove” compliance with IT security
policies
Shared Services
Many users share one machine
A user can “run away” with RAM or CPU
Different users may need different apps
that conflict with each other
Can be relatively simple to deploy
Sun Global Desktop, Citrix XenApp, Windows Terminal
Services
Shared Services
(a la Terminal Services, XenApp)
Desktop Virtualization
Similar to shared services but:
Each user gets “their own” virtual machine
Machines can be spawned on demand
from a golden image
Desktop controller server manages user
connections, VM power states, load
balancing
Users CAN share a machine if appropriate
Citrix XenDesktop, VMware View
Desktop Virtualization
Production
Example –
Vmware
View
P2P Systems
Use the vast resources of
machines at the edge of the
Internet to build a network
that allows resource sharing
without any central authority.
More than a system for
sharing pirated music/movies
Characteristics of P2P Systems
Exploit edge resources.
Storage, content, CPU, Human presence.
Significant autonomy from any centralized
authority.
Each node can act as a Client as well as a Server.
Resources at edge have intermittent connectivity,
constantly being added & removed.
Infrastructure is untrusted and the components are
unreliable.
Client/Server Architecture
Well known,
powerful, reliable
server is a data
source
Server
Client
Client
Clients request data
from server
Internet
Client
Very successful
model
WWW (HTTP), FTP, Web
services, etc.
* Figure from http://project-iris.net/talks/dht-toronto-03.ppt
Client
P2P Architecture
All nodes are both
clients and servers
Provide and consume
data
Any node can initiate a
connection
No centralized data
source
Node
Node
Node
Internet
Node
“The ultimate form of
democracy on the
Internet”
“The ultimate threat to
copy-right protection on
the Internet”
* Content from http://project-iris.net/talks/dht-toronto-03.ppt
Node
2.2 Summary of P2P
Centralized P2P: (Napster)
Server is a index of addressing
Pure P2P: (Gnutella 0.4)
No central entities
Any terminal can be removed
Hybrid P2P: (Gnutella 0.6)
Dynamic central entities
Any terminal can be removed
35
02.04.2017
Napster
Further services:
Chat program, instant messaging service,
tracking program,…
Centralized system
Single point of failure => limited fault tolerance
Limited scalability (server farms with load
balancing)
36
Query is fast and upper bound for
duration can be given
Oct. 4
Anh Le + Tuong
Nguyen
Napster
5
6
4
central DB
3
3. Download
Request
2. Response
1
Peer
37
1. Query
2
4. File
BitTorrent : Overview
File.torrent :
Seeder – peer having entire file
Leecher – peer downloading file
-URL of tracker
-File name
-File length
-Chunk length
-Checksum for each
chunk (SHA1 hash)
Overlay Network
A P2P network is an overlay network. Each link
between peers consists of one or more IP links.
Overlays : All in the application layer
Tremendous design
flexibility
Topology, maintenance
Message types
Protocol
Messaging over TCP or UDP
Underlying physical
network is transparent to
developer
But some overlays exploit
proximity
Overlay Graph
Virtual edge
TCP connection
or simply a pointer to an IP address
Overlay maintenance
Periodically ping to make sure neighbor is still
alive
Or verify aliveness while messaging
If neighbor goes down, may want to establish
new edge
New incoming node needs to bootstrap
Could be a challenge under high rate of churn
Churn : dynamic topology and intermittent access due
to node arrival and failure
Overlay Graph
Unstructured overlays
e.g., new node randomly chooses
existing nodes as neighbors
Structured overlays
e.g., edges arranged in restrictive
structure
Gnutella
pure peer-to-peer
very simple protocol
no routing "intelligence"
Constrained broadcast
Life-time of packets limited by TTL
(typically set to 7)
Packets have unique ids to detect
loops
43
Oct. 4
Anh Le + Tuong
Nguyen
Gnutella - PING/PONG
3
6
Ping 1
Ping 1
Pong 3 Pong 6,7,8
Pong 6,7,8
Pong 6
Ping 1
1
Known Hosts:
2
Pong 3,4,5
Pong 5
2
Ping 1
5
Pong 7
Ping 1
Ping 1
Pong 2
Ping 1
3,4,5
Pong 4
7
Pong 8
8
6,7,8
4
44
Oct. 4
Query/Response
analogous
Anh Le + Tuong
Nguyen
Free riding
File sharing networks rely on users sharing
data
Two types of free riding
Downloading but not sharing any data
Not sharing any interesting data
On Gnutella
15% of users contribute 94% of content
63% of users never responded to a query
45
Oct. 4
Didn’t have “interesting” data
Anh Le + Tuong
Nguyen
KaZaA
Hierarchical approach between
Gnutella and Napster
Two-layered architecture.
Powerful nodes (supernodes) act as
local index servers, and client queries
are propagated to other supernodes.
Each supernode manages around 100150 children
Each supernode connects to 30-50
other supernodes
More efficient lookup than
Gnutella and more scalable than
Structured Overlay Networks / DHTs
Chord, Pastry, Tapestry, CAN,
Kademlia, P-Grid, Viceroy
Keys of Nodes
Set of Nodes
Common Identifier
Space
Connect
The nodes
Smartly
Keys of Values
…
Node Identifier
Value Identifier
47
Oct. 4
Anh Le + Tuong
Nguyen
Hash Tables
Store arbitrary keys and
satellite data (value)
put(key,value)
value = get(key)
Lookup must be fast
Calculate hash function
h() on key that returns a
storage cell
Chained hash table: Store
key (and optional value)
there
Distributed Hash Table
The Principle Of Distributed Hash Tables
A dynamic distribution of a hash table onto a set of cooperating
nodes
Key
Value
1
Algorithms
9
Routing
11
DS
12
Peer-to-Peer
21
Networks
22
Grids
• Basic service: lookup operation
• Key resolution from any node
node A
node B
node D
node C
→Node D : lookup(9)
• Each node has a routing table
• Pointers to some other nodes
• Typically, a constant or a logarithmic number of pointers
50
Oct. 4
Anh Le + Tuong
Nguyen
Chord [MIT]
consistent hashing (SHA-1) assigns each
node and object an m-bit ID
IDs are ordered in an ID circle ranging
from 0 – (2m-1).
New nodes assume slots in ID circle
according to their ID
Key k is assigned to first node whose ID ≥
k
successor(k)
Anh Le + Tuong Nguyen
51
Oct. 4
Consistent Hashing - Successor Nodes
identifier
node
6
1
0
successor(6) = 0
6
identifier
circle
6
5
Oct. 4
2
2
successor(2) = 3
3
4
52
key
successor(1) = 1
1
7
X
2
Anh Le + Tuong
Nguyen
Consistent Hashing – Join and
Departure
When a node n joins the network, certain keys previously
assigned to n’s successor now become assigned to n.
When node n leaves the network, all of its assigned keys are
reassigned to n’s successor.
53
Oct. 4
Anh Le + Tuong
Nguyen
Consistent Hashing – Node Join
keys
5
7
keys
1
0
1
7
keys
6
2
5
3
keys
2
4
54
Oct. 4
Anh Le + Tuong
Nguyen
Consistent Hashing – Node
Dep.
keys
7
keys
1
0
1
7
keys
6
6
2
5
3
keys
2
4
55
Oct. 4
Anh Le + Tuong
Nguyen
Scalable Key Location – Finger Tables
Each node n’ maintains a routing table with up to m
entries (which is in fact the number of bits in
identifiers), called finger table.
The ith entry in the table at node n contains the
identity
of the first node s that succeeds n by at least
i-1
2 on the identifier circle.
s = successor(n+2i-1).
s is called the ith finger of node n, denoted by
n.finger(i)
56
Oct. 4
Anh Le + Tuong
Nguyen
Scalable Key Location – Finger
Tables
finger table
start
For.
0+20
0+21
0+22
1
2
4
1
6
3
4
Oct. 4
1
3
0
0
1+2
1+21
1+22
2
3
5
succ.
keys
1
3
3
0
2
5
57
succ.
finger table
For.
start
0
7
keys
6
finger table
For.
start
0
3+2
3+21
3+22
4
5
7
succ.
keys
2
0
0
0
Anh Le + Tuong
Nguyen
Chord key location
Lookup in finger
table the
furthest node
that precedes
key
-> O(log n)
hops
Anh Le + Tuong Nguyen
58
Oct. 4
An Example of CAN
1
59
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
1
60
Oct. 4
2
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
3
1
2
61
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
3
1
2
62
Oct. 4
4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
63
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
I
64
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
65
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
66
Oct. 4
+ Tuong
x =AnhaLe Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
y=b
67
Oct. 4
+ Tuong
x =Anha Le Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
(2) route(K,V) -> (a,b)
68
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node I::insert(K,V)
I
(1) a = hx(K)
b = hy(K)
(2) route(K,V) -> (a,b)
(K,V)
(3) (a,b) stores (K,V)
69
Oct. 4
Anh Le + Tuong
Nguyen
An Example of CAN (cont)
node J::retrieve(K)
(1) a = hx(K)
b = hy(K)
(2) route “retrieve(K)” to (a,b)
(K,V)
J
70
Oct. 4
Anh Le + Tuong
Nguyen
Routing in CAN
71
Oct. 4
Anh Le + Tuong
Nguyen
Routing in CAN (cont)
(a,b)
(x,y)
72
Oct. 4
Anh Le + Tuong
Nguyen
Motivation
Online storage services are getting increasingly popular
Amazon’s S3, EMC’s Mozy …
Rely on data centers.
Challenges
Threatened by the single point of failure.
Amazon suffers outages (3 times); Gmail is down (4 times) …
Social networks make downtime harder to hide.
Incur high hardware, network and cooling costs.
P2P storage
Use idle resource of users to avoid costs
Provide low availability because of churn.
Motivation
Best of both worlds
Stability of data center
Low cost of P2P
Data Center
P2P Storage Layer
AmazingStore Design
Combine data center and P2P storage system
Location of replicas
Location of replicas
Data Center Outage
Master
DHT
Master
Location of replicas
Location of replicas
Availability Improvement
Overall availability jumps from 93.22% to 99.13%
Availability gained at peer side is 83.8%
P2P layer
compensation
Fraction of available objects
1
0.8
0.6
0.4
Server-side Availability
Peer-side Availability
0.2
System Availability
0
05/25
05/27
05/26
Power failure
Date
05/28
Erasure codes
AmazingStore Design
Protect data security
Encrypted with the key specified by the user.
Periodically checks peer eviction.
x1
0
0
0
1
0
y
0
1
x2
0
1
0
0
0
1
0
0
1
0
If only
Each
To
check
element
1s
ifappear,
y is
ofinSS,
conclude
ischeck
hashed
that
k ktimes
yhash
is in S
Initial
with
all
0 the
Each
This
location.
may
hashIfyield
location
a 0 false
appears
set
positive
to, 1
y is not in S
Bloom Filter
x
V0
Vm-1
0 0 0 1 0 0 0 1 0 1
h1(x)
h2(x)
h3(x)
0 1 0 0 0
hk(x)
BloomaErrorsb
c
d
V0
Vm-1
0 0 0 1 0 0 0 1 0 1
h1(x)
h2(x)
h3(x)
0 1 0 0 0
hk(x)
x didn’t appear, yet its bits are already set
ICP – With Summary Cache
Proxy
Proxy
Cache
Cache
Client
Proxy
Cache
Proxy
Cache
Internet
Outline
2.1 Architectural styles
2.2 System architectures
2.3 Architectures versus middleware
2.3.1 Interceptor
2.3.2 General Approaches to adaptive
software
2.3.3 Discussion
2.4 Self-management in distributed systems
We have talked about the physical architecture.
Does middleware also have an architectural style?
If it does, how does it affect flexibility,
extensibility?
Sometimes, the “native” style may not be
optimal.
Can we build messaging over RPC?
Can we build RPC over messaging?
2.3 Architectures vs. middleware
Interceptors
Request level could handle
replication.
Message-level could
handle fragmentation.
2.3 Architectures vs. middleware
Adaptive Middleware
Separation of concerns: Try to separate extra functionalities and
later weave them together into a single implementation ⇒ only
toy examples so far.
Computational reflection: Let a program inspect itself at runtime
and adapt/change its settings dynamically if necessary ⇒
mostly at language level and applicability unclear.
Component-based design: Organize a distributed application
through components that can be dynamically replaced when
needed ⇒ highly complex, also many intercomponent
dependencies.
Observation: Do we need adaptive software at all, or is the issue
adaptive systems?
2.3 Architectures vs. middleware
Outline
2.1 Architectural styles
2.2 System architectures
2.3 Architectures versus middleware
2.4 Self-management in distributed systems
Self-managing Distributed
Systems
Observation: Distinction between system and software
architectures blurs when automatic adaptivity needs to
be taken into account:
Self-configuration
Self-managing
Self-healing
Self-optimizing
Self-*
Note: There is a lot of hype going on in this field of
autonomic computing.
2.4 Self-management in distributed systems
Feedback Control Model
Observation: In many cases, self-* systems are organized
as a feedback control system:
2.4 Self-management in distributed systems
Example: Differentiating Replication
Strategies in Globule
Globule: Collaborative CDN that analyzes traces to
decide where replicas of Web content should be placed.
Decisions are driven by a general cost model:
cost = (w1 × m1) + (w2 × m2) + · · · + (wn × mn)
Globule origin server collects traces and does what-if analysis by
checking what would have happened if page P would have been
placed at edge server S.
Many strategies are evaluated, and the best one is chosen.
The dependency between prediction
accuracy and trace length
2.4 Self-management in distributed systems
Summary
Architectural styles
System architectures
Architectures versus middleware
Self-management in distributed systems