ecs251_w2013_final

Download Report

Transcript ecs251_w2013_final

ecs251 Winter 2013 final
Name:
Student ID:
Email:
Open book/laptop, but NO Internet, or any type of electronic communication. Totally 5 questions, 6%
each, total 8 pages.
Please write precise and clean. Please READ the questions VERY CAREFULLY before putting down the
final answer. And, also please mark your answer clearly. You can use the back of the pages. Every page of
this exam book needs to be returned back. Some of the questions might not have a standard solution – i.e.,
you can argue either way (and, the grading will be based on your description/explanation).
If we suspect any cheating behavior, we will pass the case to the academic committee immediately.
Score: ___/30
03/22/2013
ecs251 winter 2013, final
1
Q-01
With the “backward and serial validation” of OCC (Optimistic Concurrency Control), for the following
execution history, which transaction(s) will be aborted and WHY?
T1:
read
T2:
T3:
Z
read
W
read
read
read
read
read
validation
T4:
read
X
read
Y
X
read
read
Y
Z
read
W
write
write
Z
W
Y
W
X
Y
validation
write
write
validation
write
write
validation
4/12/2016
Y
X
X
Y
ecs251 winter 2013, sample final
2
Q-02
(Two Phase Commit) Under exactly which condition, under the 2PC protocol, we have to wait
for a crash server to recover before we can continue the commitment protocol?
4/12/2016
ecs251 winter 2013, sample final
3
Q-03
(Transaction Support for GFS and HDFS) Under this question, we will consider to build a
transaction system on top of GFS/HDFS (instead of on top of a Database). When a transaction
needs to access an object (either read or write access), it needs to contact the Master Index
server (the Master under GFS or Name node under HDFS, but with some extensions to
support transaction) first. For supporting the ACID properties, we can potentially use either
2PL (Two-Phase Locking) or OCC (Optimistic Concurrency Control) to guarantee serilizability.
As the system architect, please decide which option (2PL or OCC) is better for GFS/HDFS
transactions and WHY?
4/12/2016
ecs251 winter 2013, sample final
4
Q-04
(DHT/Chord) Here is a short description of Chord from Wikipedia:
The Chord protocol is one solution for connecting the peers of a P2P network. Chord consistently maps
a key onto a node. Both keys and nodes are assigned an m-bit identifier. For nodes, this identifier is a
hash of the node's IP address. For keys, this identifier is a hash of a keyword, such as a file name. It is
not uncommon to use the words "nodes" and "keys" to refer to these identifiers, rather than actual nodes
or keys. There are many other algorithms in use by P2P, but this is a simple and common approach.
A logical ring with positions numbered 0 to 2 m − 1 is formed among nodes. Key k is assigned to node
successor(k), which is the node whose identifier is equal to or follows the identifier of k. If there are N
nodes and K keys, then each node is responsible for roughly K / N keys.
When a new node joins or leaves the network, responsibility for O(K / N) keys changes hands.
If each node knows only the location of its successor, a linear search over the network could locate a
particular key. This is a naive method for searching the network, since any given message could
potentially have to be relayed through most of the network. Chord implements a faster search method.
Chord requires each node to keep a "finger table" containing up to m entries. The ith entry of node n will
contain the address of successor((n + 2i − 1) mod 2m).
With such a finger table, the number of nodes that must be contacted to find a successor in an N-node
network is O(logN).
QUESTION: Should the last part be O(logN) or O(m) (i.e., the size of the network versus the size of the
ring)? Please justify your answer!
4/12/2016
ecs251 winter 2013, sample final
5
Q-04
(Cont.)
4/12/2016
ecs251 winter 2013, sample final
6
Q-05
(Map and Reduce) from the wikipedia of MapReduce “… David DeWitt and Michael Stonebraker,
computer scientists specializing in parallel databases and shared-nothing architectures, have been
critical of the breadth of problems that MapReduce can be used for. They called its interface too lowlevel and questioned whether it really represents the paradigm shift its proponents have claimed it is.
They challenged the MapReduce proponents' claims of novelty, citing Teradata as an example of prior
art that has existed for over two decades. They also compared MapReduce programmers to Codasyl
programmers, noting both are "writing in a low-level language performing low-level record
manipulation.” MapReduce's use of input files and lack of schema support prevents the performance
improvements enabled by common database system features such as B-trees and hash partitioning,
though projects such as Pig (or PigLatin), Sawzall, Apache Hive, Ysmart, HBase and BigTable are
addressing some of these problems.
Greg Jorgensen wrote an article rejecting these views. Jorgensen asserts that DeWitt and Stonebraker's
entire analysis is groundless as MapReduce was never designed nor intended to be used as a database.
DeWitt and Stonebraker have subsequently published a detailed benchmark study in 2009 comparing
performance of Hadoop's MapReduce and RDBMS approaches on several specific problems. They
concluded that databases offer real advantages for many kinds of data use, especially on complex
processing or where the data is used across an enterprise, but that MapReduce may be easier for users to
adopt for simple or one-time processing tasks. They have published the data and code used in their
study to allow other researchers to do comparable studies.”
Question: what is your view about DeWitt/Stonebraker’s opinion (versus Jorgensen’s, e.g.) about Map
and Reduce? Please briefly describe the rationale behind your view.
4/12/2016
ecs251 winter 2013, sample final
7
Q-05
(Cont.)
4/12/2016
ecs251 winter 2013, sample final
8