Transcript Document

Homework 2

What is the role of the secondary database
that we have to create?

A relational DBMS supports multiple index
structures on a table:




Create B-tree index on Salary attribute of Emp
Create Hash index on SS# of Emp
When an application deletes a record from the Emp
table (say using the hash index given the SS# of the
employee to be fired), all index structures are updated.
The same concept exists with a relational
storage manager (Berkeley DB).

Create primary and secondary databases and associate
them with one another:



Create a B+-tree primary database on Salary
Create a Hash secondary database on SS#
Associate the two indexes together.
Homework 2

When we verify that all of our records have been
stored correctly, is it sufficient to just count the
number of retrieved records, or do we have to keep
our own copy of all the records and verify each row
in the database
against our "shadow" copy?



Either approach is acceptable.
Hint 1: Start by defining very small amount of
memory for your main memory database (say 10
MB) to minimize time required to debug your
program. Once your program is stable, scale to a
large amount of memory.
Hint 2: Do not be surprised if Berkeley DB stores 12
MB of data into 10 MB – read the documentation
carefully!
Google’s Bigtable
Shahram Ghandeharizadeh
Computer Science Department
University of Southern California
Overall Architecture

Shared-nothing architecture consisting of
thousands of nodes!

A node is an off-the-shelf, commodity PC.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Bigtable





A data model (a schema).
A sparse, distributed persistent multi-dimensional
sorted map.
Data is partitioned across the nodes seamlessly.
The map is indexed by a row key, column key, and a
timestamp.
Output value in the map is an un-interpreted array of
bytes.

(row: byte[ ], column: byte[ ], time: int64)  byte[ ]
Rows

A row key is an arbitrary string.





Every read or write of data under a single row is
atomic.
Data is maintained in lexicographic order by row
key.
The row range for a table is dynamically partitioned.
Each partition (row range) is named a tablet.


Typically 10-100 bytes in size, up to 64 KB.
Unit of distribution and load-balancing.
Objective: make read operations single-sited!

E.g., In Webtable, pages in the same domain are grouped
together by reversing the hostname components of the
URLs: com.google.maps instead of maps.google.com.
Column Families




Column keys are grouped into sets called
column families.
A column family must be created before data
can be stored in a column key.
Hundreds of static column families.
Syntax is family:key, e.g., Language:English,
Language:German, etc.
Timestamps


64 bit integers
Assigned by:




Bigtable: real-time in microseconds,
Client application: when unique timestamps are
a necessity.
Items in a cell are stored in decreasing
timestamp order.
Application specifies how many versions (n)
of data items are maintained in a cell.

Bigtable garbage collects obsolete versions.
Bigtable

Used in different applications supported by
Google.
Application 1: Google Analytics

Enables webmasters to analyze traffic
pattern at their web sites. Statistics such as:



Number of unique visitors per day and the page
views per URL per day,
Percentage of users that made a purchase given
that they earlier viewed a specific page.
How?



A small JavaScript program that the webmaster
embeds in their web pages.
Every time the page is visited, the program is
executed.
Program records the following information about
each request:


User identifier
The page being fetched
Application 1: Google Analytics (Cont…)

Two of the Bigtables

Raw click table (~ 200 TB)


Single-sited 


A row for each end-user session.
Row name include website’s name and the time at
which the session was created.
Clustering of sessions that visit the same web site. And
a sorted chronological order.
Compression factor of 6-7.
Summary table (~ 20 TB)





Stores predefined summaries for each web site.
Generated from the raw click table by periodically
scheduled MapReduce jobs.
Each MapReduce job extracts recent session data from
the raw click table.
Row name includes website’s name and the column
family is the aggregate summaries.
Compression factor is 2-3.
Application 2: Google Earth & Maps


Functionality: Pan, view, and annotate
satellite imagery at different resolution
levels.
One Bigtable stores raw imagery (~ 70 TB):

Single-sited


Row name is a geographic segments. Names are
chosen to ensure adjacent geographic segments
are clustered together.
Column family maintains sources of data for
each segment.
There are different sets of tables for serving
client data, e.g., index table.
Application 3: Personalized Search



Records user queries and clicks across
Google properties.
Users browse their search histories and
request for personalized search results
based on their historical usage patterns.
One Bigtable:
Single-sited



Row name is userid
A column family is reserved for each action type,
e.g., web queries, clicks.
User profiles are generated using MapReduce.


These profiles personalize live search results.
Replicated geographically to reduce latency and
increase availability.
Bigtable API

Implements interfaces to






create and delete tables and column families,
modify cluster, table, and column family
metadata such as access control rights,
Write or delete values in Bigtable,
Look up values from individual rows,
Iterate over a subset of the data in a table,
Atomic R-M-W sequences on data stored in a
single row key (No support for Xacts across
multiple rows).
Function Shipping

Similar to Gamma, Bigtable is based on
function shipping.
Yay!
Very smart!
Assumptions


Uses GFS to store log and data files.
Bigtable processes share the same
machines with processes from other
applications.


A shared cluster of commodity PCs.
A cluster management system:



Schedules jobs,
Manages resources on shared machines,
Monitors PC status and handles failures.
Building Blocks

Google File System


SSTable


High availability.
A key/value database.
Chubby

Name space.
SSTable

A database similar to a BDB database:

Stores and retrieves key/data pairs.




Key and data are arbitrary byte arrays.
Cursors to iterate key/value pairs given a
selection predicate (exact and range).
Configurable to use either persistent store (disk)
or main-memory based.
A SSTable is stored in GFS.
Bigtable: Hybrid Range Partitioning [VLDB’90]

To minimize the impact of load imbalance,
construct more (HN) ranges than (N) nodes,
e.g., 10 ranges for a 5 node system; H = 2.
0-10
51-60


11-20
61-70
21-30
71-80
31-40
81-90
41-50
91-100
H is higher in practice; 10 in the
experimental section of the paper.
A range is named a tablet. A tablet is
represented as:


A set of SSTable files.
A set of redo points which are pointers into any
commit logs that main contain data for the tablet.
Chubby




A persistent and distributed lock service.
Consists of 5 active replicas, one replica is
the master and serves requests.
Service is functional when majority of the
replicas are running and in communication
with one another – when there is a quorum.
Implements a nameservice that consists of
directories and files.
Software Infrastructure
1.
2.
A Bigtable library linked to every client.
Many tablet servers.



3.
One master server responsible for:





Tablet servers are added and removed
dynamically.
Ten to a thousand tablets assigned to a tablet
server.
Each tablet is typically 100-200 MB in size.
Assigning tablets to tablet servers,
Detecting the addition and deletion of tablet
servers,
Balancing tablet-server load,
Garbage collection of files in GFS.
Client communicates directly with tablet
server for reads/writes.
Location of Tablets (Ranges)

A 3-level hierarchy:

1st Level: A file stored in chubby contains location
of the root tablet, i.e., a directory of ranges (tablets)
and associated meta-data.




The root tablet never splits.
2nd Level: Each meta-data tablet contains the
location of a set of user tablets.
3rd Level: A set of SSTable identifiers for each
tablet.
Analysis:


Each meta-data row stores ~ 1KB of data,
With 128 MB tablets, the three level store addresses 234
tablets (261 bytes in 128 MB tablets).

Approaches a Zetabyte (million Petabytes).
Client/Master

Client caches tablet locations.
Bigtable and Chubby

Bigtable uses Chubby to:






Ensure there is at most one active master at a
time,
Store the bootstrap location of Bigtable data
(Root tablet),
Discover tablet servers and finalize tablet server
deaths,
Store Bigtable schema information (column
family information),
Store access control list.
If Chubby becomes unavailable for an
extended period of time, Bigtable becomes
unavailable.
Placement of Tablets


A tablet is assigned to one tablet server at a time.
Master maintains:



The set of live tablet servers,
Current assignment of tablets to tablet servers (including
the unassigned ones)
Chubby maintains tablet servers:



A tablet server creates and acquires an eXclusive lock on a
uniquely named file in a specific chubby directory (named
server directory),
Master monitors server directory to discover tablet server,
A tablet server stops processing requests if it loses its X
lock (network partitioning).


Tablet server will try to obtain an X lock on its uniqely named
file as long as it exists.
If the uniquely named file of a tablet server no longer exists
then the tablet server kills itself. Goes back to a free pool to
be assigned tablets by the master.
Placement of Tablets

Master detects when a tablet server is in the
free pool.

How? Master periodically probes each tablet
server for the status of its lock.
Master

Should the Master die, a new Master is
initiated. The master executes the following
steps:
Client Write & Read Operations

Write operation arrives at a tablet server:




Server ensures the client has sufficient privileges for the
write operation (Chubby),
A log record is generated to the commit log file,
Once the write commits, its contents are inserted into the
memtable.
Read operation arrives at a tablet server:


Server ensures client has sufficient privileges for the read
operation (Chubby),
Read is performed on a merged view of (a) the SSTables
that constitute the tablet, and (b) the memtable.
Write Operations


As writes execute, size of memtable increases.
Once memtable reaches a threshold:






Memtable is frozen,
A new memtable is created,
Frozen metable is converted to an SSTable and written to
GFS.
This minor compaction minimizes memory usage of
tablet server, and reduces recovery time in the
presence of crashes (checkpoints).
Merging compaction (in the background) reads a
few SSTables and memtable to produce one
SSTable. (Input SSTables and memtable are
discareded.)
Major compaction rewrites all SSTables into exactly
one SSTable (containing no deletion entries).
System Performance

Experiments involving random reads (from
GFS and main memory) and writes,
sequential reads and writes, and scans.

Scan: A single RPC fetches a large sequence of
values from the tablet server.
Random Reads
Random Reads
Sequential Reads

A read request is for 1000 bytes.
Sequential reads perform better because a
tablet server caches the 64 KB SSTable
block (from GFS) and uses it to serve the
next 64 read requests.
Random Reads from Memory
Random reads from memory avoid the overhead
of fetching a 64 KB block from GFS.
Data is mapped onto the memory of the tablet server
Writes

Tablet server appends all incoming writes to
a single commit log and uses group commit
to stream these writes to GFS efficiently.
Scale-up

As the number of tablet servers is increased by a factor of 500:



Performance of random reads from memory increases by a factor
of 300.
Performance of scans increases by a factor of 260.
Why?
Scale-up

As the number of tablet servers is increased by a factor of 500:



Performance of random reads from memory increases by a factor
of 300.
Performance of scans increases by a factor of 260.
Why?
1
R1
R1
R2
6
R2
Ideal
R3
cases
R3
{R1, R3}
{R1, R3}
R2
2
3
R2
R3
R1
R3
R1
R2
R2
R3
R2
R3
R1
R2
R1
{R1, R3}
{R1, R3}
R2
R2
{R2, R3}
{R2, R3}
R1
R1
{R2, R3}
{R2, R3}
R1
R1
{R2, R1}
{R2, R1}
R3
R3
{R2, R1}
{R2, R1}
R3
R3
{R1, R2, R3}
R2
R2
{R1, R3}
{R1, R3}
R1
R1
{R2, R3}
{R2, R3}
R3
R3
{R2, R1}
{R2, R1}
{R1, R2, R3}
{R1, R2, R3}
27 ways to
assign 3
requests to
the 3
nodes!
Brain Teaser

Given N servers and M requests,

compute the probability of:



M/N requests per node.
Number of ways M requests may map onto N servers
and the probability of each scenario.
Reward for correct answer:
Data Shipping?


Data shipping will saturate resources.
Do not be fooled by this discussion because
Bigtable has “function shipping” (not
reported in the evaluation section).
Lessons



Many types of errors in a real system.
Delay adding new features until it is clear
how the new feature will be used.
Very important to have eyes that can see:
Conclusion

From the very first lecture of this semester: