ppt - 北京大学网络与信息系统研究所

Download Report

Transcript ppt - 北京大学网络与信息系统研究所

Introduction to Distributed System
&
PageRank Algorithm
http://net.pku.edu.cn/~course/cs402/2009/
彭波
[email protected]
北京大学信息科学技术学院
7/7/2009
大纲




作业回顾
分布式系统基础
PageRank算法的MapReduce实现
课程项目
Review of Lecture 1
SaaS
PaaS
Utility Computing
“Data Center is a Computer”
Parallelism everywhere
Massive Scalable Reliable
Resource Management
Data Management
Programming Model & Tools
Divide and Conquer
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
“Result”
Combine
What’s Mapreduce

Parallel/Distributed Computing Programming
Model
Input split
shuffle
output
CodeLab1



1.开始eclipse无法运行,主要是因为没有添
加本地jar文件
2.在0.13.0中找不到input目录,也没有/user
目录,并且也无法上传,有同学说重启后就
能看到/user目录,我没有证实。我是在命令
行下添加input目录,就会自动创建/user目
录。后来的0.17.0虚拟机中我已经把Input目
录放好了,就没有这个问题。
3.对Java不熟悉。特别是要根据0.13.0虚拟
机修改的时候很多人都反应有困难,看起来
大部分人都没有什么Java基础。
HW1 Exercises




Have you ever encountered a Heisenbug? How
did you isolate and fix it?
For the different failure types listed above,
consider what makes each one difficult for a
programmer trying to guard against it. What
kinds of processing can be added to a program to
deal with these failures?
Explain why each of the 8 fallacies is actually a
fallacy.
Contrast TCP and UDP. Under what circumstances
would you choose one over the other?
Exercises




What's the difference between caching and data
replication?
What are stubs in an RPC implementation?
What are some of the error conditions we need
to guard against in a distributed environment that
we do not need to worry about in a local
programming environment?
Why are pointers (references) not usually passed
as parameters to a Remote Procedure Call?
Exercises

Here is an interesting problem called partial
connectivity that can occur in a distributed
environment. Let's say A and B are systems that
need to talk to each other. C is a master that also
talks to A and B individually. The communications
between A and B fail. C can tell that A and B are
both healthy. C tells A to send something to B
and waits for this to occur. C has no way of
knowing that A cannot talk to B, and thus waits
and waits and waits. What diagnostics can you
add in your code to deal with this situation?
Exercises

This is the Byzantine Generals problem: Two
generals are on hills either side of a valley. They
each have an army of 1000 soldiers. In the
woods in the valley is an enemy army of 1500
men. If each general attacks alone, his army will
lose. If they attack together, they will win. They
wish to send messengers through the valley to
coordinate when to attack. However, the
messengers may get lost or caught in the woods
(or brainwashed into delivering different
messages). How can they devise a scheme by
which they either attack with high probability, or
not at all?
Introduction to Distributed System
Parallelization & Synchronization
Parallelization Idea

如果任务可以被cleanly split into n units,并行is very
“easy”.
work
Partition
problem
w1
w2
w3
Parallelization Idea (2)
w1
w2
w3
Spawn worker threads:
thread
thread
thread
Parallelization Idea (3)
Workers process data:
thread
w1
thread
w2
thread
w3
Parallelization Idea (4)
thread
w1
thread
w2
thread
w3
Report
results
results
Parallelization Pitfalls
But this model is too simple!
 怎样分配任务:


执行单元和任务数不匹配


How do we aggregate the results at the end?
怎样知道任务都完成了


What if we have more work units than threads?
怎样把结果聚合


How do we assign work units to worker threads?
How do we know all the workers have finished?
如果任务不能分割为完全独立的子任务

What if the work cannot be divided into completely separate tasks?
What is the common theme of
all of these problems?
Parallelization Pitfalls (2)


这些问题的关键都在于:multiple threads must
communicate with one another, or access a
shared resource.
Golden rule: Any memory that can be used by
multiple threads must have an associated
synchronization system!
Process synchronization refers to the
coordination of simultaneous threads or
processes to complete a task in order to
get correct runtime order and avoid
unexpected race conditions.
And if you thought I was joking…
What is Wrong With This?
Thread 1:
void foo() {
x++;
y = x;
}
Thread 2:
void bar() {
y++;
x++;
}
If the initial state is y = 0, x = 6, what
happens after these threads finish
running?
Multithreaded = Unpredictability

Many things that look like “one step” operations actually
take several steps under the hood:
Thread 1:
void foo() {
eax = mem[x];
inc eax;
mem[x] = eax;
ebx = mem[x];
mem[y] = ebx;
}

Thread 2:
void bar() {
eax = mem[y];
inc eax;
mem[y] = eax;
eax = mem[x];
inc eax;
mem[x] = eax;
}
When we run a multithreaded program, we don’t know
what order threads run in, nor do we know when they
will interrupt one another.
Multithreaded = Unpredictability
This applies to more than just integers:



Pulling work units from a queue
Reporting work back to master unit
Telling another thread that it can begin the “next
phase” of processing
… All require synchronization!
Synchronization Primitives
Semaphore
Thread 1:
void foo() {
sem.lock();
x++;
y = x;
sem.unlock();
Lock
Thread 2:
void bar() {
sem.lock();
y++;
x++;
sem.unlock();
Barriers
Too Much Synchronization? Deadlock
Thread A:
semaphore1.lock();
semaphore2.lock();
/* use data guarded by
semaphores */
semaphore1.unlock();
semaphore2.unlock();
Thread B:
semaphore2.lock();
semaphore1.lock();
/* use data guarded by
semaphores */
semaphore1.unlock();
semaphore2.unlock();
(Image: RPI CSCI.4210 Operating Systems notes)
And if you thought I was joking…
The Moral: Be Careful!

Synchronization is hard




Need to consider all possible shared state
Must keep locks organized and use them consistently
and correctly
Knowing there are bugs may be tricky; fixing
them can be even worse!
Keeping shared state to a minimum reduces
total system complexity
Introduction to Distributed System
Fundamentals of Networking
TCP
ROUTER
IP
FTP
UDP
HTTP
Gateway
Protocol
SOCKET
Firewall
SWITCH
PORT
What makes this work?


Underneath the socket layer are several more protocols
Most important are TCP and IP (which are used hand-inhand so often, they’re often spoken of as one protocol:
TCP/IP)
IP header
TCP
header
Your data
Why is This Necessary?


Not actually tube-like “underneath the hood”
Unlike phone system (circuit switched), the packet
switched Internet uses many routes at once
you
www.google.com
Networking Issues




If a party to a socket disconnects, how much
data did they receive?
Did they crash? Or did a machine in the middle?
Can someone in the middle intercept/modify our
data?
Traffic congestion makes switch/router topology
important for efficient throughput
Introduction to Distributed System
Distributed Systems
Outline



Models of computation
Connecting distributed modules
Failure & reliability
Models of Computation
Single (SD)
Multiple (MD)
Data
Instructions
Single (SI)
Multiple (MI)
SISD
single-threaded
process
MISD
pipeline
architecture
SIMD
vector
processing
MIMD
multi-threaded
processes
Flynn’s Taxonomy
SISD
Processor
D
D
D
D
Instructions
D
D
D
SIMD
Processor
D0
D0
D0
D0
D0
D0
D0
D1
D1
D1
D1
D1
D1
D1
D2
D2
D2
D2
D2
D2
D2
D3
D3
D3
D3
D3
D3
D3
D4
D4
D4
D4
D4
D4
D4
…
…
…
…
…
…
…
Dn
Dn
Dn
Dn
Dn
Dn
Dn
Instructions
MIMD
Processor
D
D
D
D
D
D
D
D
D
D
Instructions
Processor
D
D
D
D
Instructions
Memory
(Instructions and Data)
Instructions
Data
Processor
Interface to external world
Interface to external world
Interface to external world
Processor
Processor
Instructions
Data
Data
Instructions
Memory
(Instructions and Data)
Instructions
Data
Data
Instructions
Processor
Processor
Interface to external world
Interface to external world
Memory
Memory
(Instructions and Data)
Instructions
(Instructions and Data)
Data
Data
Instructions
Processor
Processor
Interface to external world
Interface to external world
Network
Interface to external world
Interface to external world
Processor
Processor
Instructions
Memory
(Instructions and Data)
Data
Data
Instructions
Memory
(Instructions and Data)
Memory
Memory
(Instructions and Data)
Instructions
Data
Processor
Data
(Instructions and Data)
Instructions
Instructions
Processor
Data
Processor
Interface to external world
Data
Instructions
Processor
Interface to external world
Network
Interface to external world
Processor
Instructions
Data
Interface to external world
Processor
Data
Memory
(Instructions and Data)
Processor
Instructions
Instructions
Data
Processor
Data
Memory
(Instructions and Data)
Instructions
Outline



Models of computation
Connecting distributed modules
Failure & reliability
System Organization

Having one big memory would make it a huge
bottleneck

Eliminates all of the parallelism
CTA: Memory is Distributed
Local RAM
I
D
Local RAM
I
D
Local RAM
Local RAM
I
I
D
D
Local cache
Local cache
Local cache
Local cache
CPU
CPU
CPU
CPU
interface
interface
interface
interface
Interconnect network
Interconnect Networks



Bottleneck in the CTA is transferring values from
one local memory to another
Interconnect network design very important;
several options are available
Design constraint: How to minimize interconnect
network usage?
A Brief History… 1985-95



“Massively parallel architectures” start rising in
prominence
Message Passing Interface (MPI) and other
libraries developed
Bandwidth was a big problem

For external interconnect networks in particular
A Brief History… 1995-Today




Cluster/grid architecture increasingly dominant
Special node machines eschewed in favor of
COTS technologies
Web-wide cluster software
Companies like Google take this to the extreme
(10,000 node clusters)
More About Interconnects

Several types of interconnect possible




Bus
Crossbar
Torus
Tree
Interconnect Bus
Common bus
P
P
P
P
P
•Simplest possible layout
•Not realistically practical
•Too much contention
•Little better than “one big memory”
Crossbar
Crossbar
circuit
P
P
P
P
P
•All processors have “input” and “output” lines
•Crossbar connects any input to any output
•Allows for very low contention, but lots of wires,
complexity
•Will not scale to many nodes
Toroidal networks

P
P
P
P
A 1-dimensional torus


P
P
P
P
P
P
P
P
P
P
P
P
A 2-dimensional torus
Nodes are connected to
their logical neighbors
Node-node transfer may
include intermediaries
Reasonable trade-off for
space/scalability
Tree
S
S
P
P
S
P
P
P
P
• Switch nodes transfer data “up” or “down” the tree
• Hierarchical design keeps “short” transfers fast,
incremental cost to longer transfers
• Aggregate bandwidth demands often very large at
top
• Most natural layout for most cluster networks today
Outline



Models of computation
Connecting distributed modules
Failure & reliability
“A distributed system is one in which the failure
of a computer you didn't even know existed can
render your own computer unusable”
-- Leslie Lamport
Reliability Demands

Support partial failure


Data Recoverability


Total system must support graceful decline in
application performance rather than a full halt
If components fail, their workload must be picked up
by still-functioning units
Individual Recoverability

Nodes that fail and restart must be able to rejoin the
group activity without a full group restart
Reliability Demands

Consistency


Scalability



Concurrent operations or partial internal failures should
not cause externally visible nondeterminism
Adding increased load to a system should not cause
outright failure, but a graceful decline
Increasing resources should support a proportional
increase in load capacity
Security


The entire system should be impervious to
unauthorized access
Requires considering many more attack vectors than
single-machine systems
Ken Arnold, CORBA designer:
“Failure is the defining difference between
distributed and local programming”
Component Failure

Individual nodes simply stop
Data Failure



Packets omitted by overtaxed router
Or dropped by full receive-buffer in kernel
Corrupt data retrieved from disk or net
Network Failure

External & internal links can die




Some can be routed around in ring or mesh topology
Star topology may cause individual nodes to appear to
halt
Tree topology may cause “split”
Messages may be sent multiple times or not at all or in
corrupted form…
Timing Failure

Temporal properties may be violated


Lack of “heartbeat” message may be interpreted as
component halt
Clock skew between nodes may confuse version-aware
data readers
Byzantine Failure

Difficult-to-reason-about circumstances arise

Commands sent to foreign node are not confirmed:
What can we reason about the state of the system?
Malicious Failure

Malicious (or maybe naïve) operator injects
invalid or harmful commands into system
Correlated Failures


Multiple CPUs/hard drives from same
manufacturer lot may fail together
Power outage at one data center may cause
demand overload at failover center
Preparing for Failure


Distributed systems must be robust to these
failure conditions
But there are lots of pitfalls…
The Eight Design Fallacies








The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.
-- Peter Deutsch and James Gosling, Sun Microsystems
Dealing With Component Failure



Use heartbeats to monitor component availability
“Buddy” or “Parent” node is aware of desired
computation and can restart it elsewhere if
needed
Individual storage nodes should not be the sole
owner of data

Pitfall: How do you keep replicas consistent?
Dealing With Data Failure

Data should be check-summed and verified at
several points


Never trust another machine to do your data validation!
Sequence identifiers can be used to ensure
commands, packets are not lost
Dealing With Network Failure

Have well-defined split policy


Networks should routinely self-discover topology
Well-defined arbitration/leader election protocols
determine authoritative components
 Inactive components should gracefully clean up and
wait for network rejoin
Dealing With Other Failures



Individual application-specific problems can be
difficult to envision
Make as few assumptions about foreign machines
as possible
Design for security at each step
Conclusions


Parallel systems evolved toward current
distributed systems usage
Hard to avoid failure



Determine what is reasonable to plan for
Keep protocols as simple as possible
Be mindful of common pitfalls
PageRank算法
Background
Digital Equipment Corporation



1995,10M 网页;1995/12/15发布
altavista.digital.com时已经索引了16M 网页
1997,25M queries/day,$50M revenue
But … How to calculate the importance
of a Web page?
社会网络(social network)

任何一种用于建立个体之间联系的自然现象、社会
活动或技术机制都可能形成一张网







“朋友关系”(对称,无向图)
“知晓关系”(不对称,有向图)
“文献引用关系”(不对称,有向图)
co-author关系(对称,无向图,成块“clique”)
通电话,通信
病毒传染(生物、计算机)
网页链接关系(不对称,有向图)
 还可以考虑不同的“尺度”:网站之间,城市之间,
省份之间,国家之间,…
研究这些“关系图”有什么意义?

一阶指标(“入度”)



“高阶指标”

Paul Erdös

知晓关系:社会知名度
引用关系:认可程度
和一个著名人物“共同发表”论文的
“距离”:越短似乎显得越“有荣誉”
(例如,Erdos number,)
仅仅是“结构”就可以带来丰富
的“语义”

例如省份之间的链接数差别可能有有
意义的解释
Reputation, Prestige, Importance, …

“入度”


(2,4)〉(1,3)
More than “入度”



认识甲的人可能和认识乙的人一样多,但认识乙的人都
是些“重要人物”,于是通常应该认为乙比甲重要
不仅是人,论文也是一样,被重要的文章引用的文章可
能就比较重要些
谁重要一些?
Random Walks Over the Web

Model:




User starts at a random Web page
User randomly clicks on links, surfing from page to
page
What’s the amount of time that will be spent on
any given page?
This is PageRank
PageRank: Defined
Given page x with in-bound links t1…tn, where



C(t) is the out-degree of t
 is probability of random jump
N is the total number of nodes in the graph
n
PR(ti )
1
PR( x)      (1   )
N
i 1 C (ti )
t1
X
t2
…
tn
example
n
PR(ti )
1
PR( x)      (1   )
N
i 1 C (ti )

Converge to:p = (0.165, 0.321, 0.230, 0.284)
Computing PageRank

Properties of PageRank



Can be computed iteratively
Effects at each iteration is local
Sketch of algorithm:




Start with seed PRi values
Each page distributes PRi “credit” to all pages it links to
Each target page adds up “credit” from multiple inbound links to compute PRi+1
Iterate until values converge
PageRank in MapReduce
Map: distribute PageRank “credit” to link targets
Reduce: gather up PageRank “credit” from multiple sources
to compute new PageRank value
Iterate until
convergence
...
PageRank: Issues




Is PageRank guaranteed to converge? How
quickly?
What is the “correct” value of , and how
sensitive is the algorithm to it?
What about dangling links?
How do you know when to stop?
Homework


Lab 3 - PageRank over Wikipedia Corpus
Hw3 – Read GFS[2]

[Ghemawat, et al.,2003] S. Ghemawat, H. Gobioff, and
S.-T. Leung, "The Google file system," SIGOPS Oper.
Syst. Rev., vol. 37, pp. 29-43, 2003. (pdf)
Q&A
Summary
Failure Types







Halting failures: A component simply stops. There is no way to detect the
failure except by timeout: it either stops sending "I'm alive" (heartbeat)
messages or fails to respond to requests. Your computer freezing is a halting
failure.
Fail-stop: A halting failure with some kind of notification to other components.
A network file server telling its clients it is about to go down is a fail-stop.
Omission failures: Failure to send/receive messages primarily due to lack of
buffering space, which causes a message to be discarded with no notification
to either the sender or receiver. This can happen when routers become
overloaded.
Network failures: A network link breaks.
Network partition failure: A network fragments into two or more disjoint subnetworks within which messages can be sent, but between which messages
are lost. This can occur due to a network failure.
Timing failures: A temporal property of the system is violated. For example,
clocks on different computers which are used to coordinate processes are not
synchronized; when a message is delayed longer than a threshold period, etc.
Byzantine failures: This captures several types of faulty behaviors including
data corruption or loss, failures caused by malicious programs, etc.
8 fallacies








The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.
Error Conditions



Network data loss resulting in retransmit: Often, a
system tries to achieve 'at most once' transmission tries.
In the worst case, if duplicate transmissions occur, we try
to minimize any damage done by the data being received
multiple time.
Server process crashes during RPC operation: If a
server process crashes before it completes its task, the
system usually recovers correctly because the client will
initiate a retry request once the server has recovered. If
the server crashes completing the task but before the RPC
reply is sent, duplicate requests sometimes result due to
client retries.
Client process crashes before receiving response:
Client is restarted. Server discards response da