Transcript Document
Homework 1: Common Mistakes
Memory Leak
Storing of memory pointers instead of data
Memory Leak
A program that uses “new” (“malloc”)
without “delete” (“free”) suffers from
memory leak. Why?
C++ “new” (C “malloc”) allocate space from
heap. If a program loses access to this space
then this memory space remains unused until
the program stops executing.
Example pseudo-code:
For cntr ranging from 1 to 100 do BEGIN
Record = new byte[1024];
Set Record fields
Insert the record into the database
END
Memory Leak
Why is memory leak bad?
As the program executes, the available heap
space shrinks.
This space is allocated from the virtual memory
managed by the operating system.
If the virtual address space exceeds the available
memory space and starts to thrash, the program
becomes very, very, …, very slow!
Memory Leak
A program that uses “new” (“malloc”)
without “delete” (“free”) suffers from
memory leak. Why?
C++ “new” (C “malloc”) allocate space from
heap. If a program loses access to this space
then this memory space remains unused until
the program stops executing.
Correct pseudo-code:
For cntr ranging from 1 to 100 do BEGIN
Record = new byte[1024];
Set Record fields
Insert the record into the database
delete record;
END
Returns the 1 kilobyte of data back to the heap!
Bad Design: Store Pointers
Insertion of record 1 inserts pointers (not
data):
Shahram
id
Record 1
Age
Name
5 25
Why is this design bad?
When the program stops execution, all the
memory addresses (pointers) stored in the
database become invalid.
Bad Design: Store Pointers
Insertion of record 1 inserts pointers (not
data):
Shahram
id
Record 1
0101010101010
Age
Name
5 25
Why is this design bad?
When the program stops execution, all the
memory addresses (pointers) stored in the
database become invalid.
The operating system may move “Shahram”
around, invalidating the stored memory address.
Good Design: Store Data
Serialize the record to generate data:
id
Record 1
Age
5 25
Name
Shahram
Insertion of record 1 inserts data into the
DBMS. This is the right design!
Homework 2
Posted on the 585 web site
(http://dblab.usc.edu/csci585) and is due on
Feb 24th.
Objective:
Use of primary and secondary indexes
Highlight a few limitations of BDB when
configured in main memory.
Use of transactions to maintain integrity of a
main-memory database.
Read the homework description for in-class
review this Thursday.
Gamma DBMS (Part 3):
Function Shipping versus Data Shipping
Evaluation
Shahram Ghandeharizadeh
Computer Science Department
University of Southern California
Data Shipping
Client retrieves data
from the node.
Client performs
computation locally.
Limitation: Dumb
servers, utilizes the
limited network
bandwidth.
Process f(x)
Xmit
Data
Data
A Node
Function Shipping
Client ships the
function to the node
for processing.
Relevant data is sent
to client.
Function f(x) should
produce less data than
the original data
stored in the database.
Minimizes demand for
the network
bandwidth.
Process function
f(x)
Output of f(x)
A Node
Gamma
Gamma is based on function shipping.
Hybrid-hash join partitions the referenced tables
across the nodes of the shared-nothing
architecture. Data does not leave the realm of
the shared-nothing hardware.
Service Time
Focus on query service time (only 1 request
executing in the system) as a function of
input table size.
Hash partition the table.
Store the results of each query back in the
database.
Why?
Why?
Seek time is a function of the distance
traveled by the disk head.
Join Queries
Join tables A and Bprime.
A is 10X Bprime.
Produces the same number of records as
BPrime.
Note that re-partitioning the table is not that
expensive.
Join Queries
Join tables A and Bprime.
A is 10X Bprime.
Produces the same number of records as
BPrime.
Why?
How to Evaluate?
Focus on use of parallelism
and scalability of the
system. How?
Speedup:
Given a table with r rows
and a query, if the service
time of the system is X
with one node, does it
speedup by a factor of n
with n nodes?
# of Nodes
Scaleup:
Speedup
If the service time of a
query referencing a table
with r rows and a system
with n nodes is X, does
the service time remain X
with a table of mr rows
and mn nodes?
Scaleup
Both metrics measure
service time of the system
because only one request
is submitted to the system.
# of Nodes
Selection Predicates: Speedup
Super-linear speed-up with 1% nonclustered index and 10% clustered index
selection. Referenced table consists of 1
million rows.
Selection Predicates: Scaleup
Join Predicates: Speedup
1 Bucket starting with 5 nodes.
Results would have been superlinear if Bprime
did not fit in main memory of 5 nodes.
Join Predicates: Scaleup
Overhead of parallelism: The scheduler
coordinating the activation, coordination,
and de-activation of different operators.
2009: Evolution of Gamma
Shared-nothing architecture consisting of
thousands of nodes!
A node is an off-the-shelf, commodity PC.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Gamma in 2009
Shared-nothing architecture consisting of
thousands of nodes!
A node is an off-the-shelf, commodity PC.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Divide & Conquer
Gamma in 2009
Source code for Pig and hadoop are
available for free download.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Pig
Hadoop
References
Pig Latin
Map Reduce
Dean and Ghemawat. MapReduce: Simplified
Data Processing on Large Clusters.
Communications of the ACM, Vol. 51, No. 1,
January 2008.
Bigtable
Olston et. al. Pig Latin: A Not-So-Foreign
Language for Data Processing. SIGMOD 2008.
Chang et. al. Bigtable: A Distributed Storage
System for Structured Data. In OSDI 2006.
GFS
Ghemawat et. al. The Google File System. In
SOSP 2003.
Overview: Pig Latin
A high level program that specifies a query
execution plan.
Example: For each sufficiently large category,
retrieve the average pagerank of high-pagerank
urls in that category.
SQL assuming a table urls (url, category, pagerank)
SELECT
FROM
WHERE
GROUP BY
HAVING
category, AVG(pagerank)
urls
pagerank > 0.2
category
count(*) > 1,000,000
Overview: Pig Latin
A high level program that specifies a query
execution plan.
Example: For each sufficiently large category,
retrieve the average pagerank of high-pagerank
urls in that category.
Pig Latin:
1.
2.
3.
4.
Good_urls = FILTER urls BY pagerank > 0.2;
Groups = GROUP Good_urls BY category;
Big_groups = FILTER Groups by COUNT(Good_urls) > 1,000,000;
Output = FOREACH Big_groups GENERATE category,
AVG(Good_urls, AVG(Good_urls.pagerank);
Overview: Map/Reduce (Hadoop)
A programming model to make parallelism
transparent to a programmer.
Programmer specifies:
a map function that processes a key/value pair to
generate a set of intermediate key/value pairs.
a reduce function to merge all intermediate values
associated with the same intermediate key.
Divides the problem into smaller “intermediate key/value”
sub-problems.
Solve each sub-problem.
Final results might be stored across R files.
Run-time system takes care of:
Partitioning the input data across nodes,
Scheduling the program’s execution,
Node failures,
Coordination among multiple nodes.
Overview: Bigtable
A data model (a schema).
A sparse, distributed persistent multi-dimensional
sorted map.
Data is partitioned across the nodes seamlessly.
The map is indexed by a row key, column key, and a
timestamp.
Output value in the map is an un-interpreted array of
bytes.
(row: byte[ ], column: byte[ ], time: int64) byte[ ]
Overview: Bigtable
Used in different applications supported by
Google.
Overview: GFS
A highly available, distributed file system for
inexpensive commodity PCs.
Supports node failures as the norm rather than
the exception.
Stores and retrieves multi-GB files.
Assumes files are append only (instead of
updates that modify a certain piece of existing
data).
Atomic append operation to enable multiple clients to
append to a file with minimal synchronization.
Relaxed consistency model to simplify the file
system and enhance performance.
How to start?
Bottom-up, starting with GFS.
Yahoo’s Pig Latin
Google’s Map/Reduce Framework
Google’s Bigtable Data Model
Google File System
…….
Google File System: Assumptions
Google File System: Assumptions (Cont…)
GFS: Interfaces
Create, delete, open, close, read, and write
files.
Snapshot a file:
Create a copy of the file.
Record append operation:
Allows multiple clients to append data to the
same file concurrently, while guaranteeing the
atomicity of each individual client’s append.
GFS: Architecture
1 Master
Multiple chunkservers
File is partitioned into fixedsize chunks.
Each chunk has a 64 bit chunk
handle that is unique globally.
Each chunk is replicated on
several chunkservers.
Degree of replication is
application specific; default is 3.
Software
Master maintains all file
system meta-data:
namespace, access control
info, mapping from files to
chunks, current location of
chunks.
GFS client caches meta-data
about file system.
Client receives data from
chunkserver directly.
Client and chunkserver do
not cache file data.
GFS: Architecture
1 Master
Multiple chunkservers
File is partitioned into fixedsize (64 MB) chunks.
Each chunk has a 64 bit chunk
handle that is unique globally.
Each chunk is replicated on
several chunkservers.
Degree of replication is
application specific; default is 3.
Software
Client
chooses
one of the
replicas.
Master maintains all file
system meta-data:
namespace, access control
info, mapping from files to
chunks, current location of
chunks.
GFS client caches meta-data
about file system.
Client receives data from
chunkserver directly.
Client and chunkserver do
not cache file data.
GFS Master
1 master simplifies software design.
Master monitors availability of chunkservers
using heart-beat messages.
1 master is a single point of failure:
Master does not store chunk location information
persistently: When the master is started, it asks
each chunkserver about its chunks (and
whenever a chunkserver joins).
File and chunk namespaces,
Mapping from files to chunks,
Location of each chunk’s replica.
Mutation = Update
Mutation is an operation that changes the
contents of or metadata of a chunk.
Content mutation:
Performed on all chunk’s replicas.
Master grants a chunk lease to one of the
replicas, primary.
Primary picks a serial order for all mutations to
the chunk.
Lease:
Granted by master, typically 60 seconds.
Primary may request extensions.
If master loses communication with a primary, it
can safely grant a new lease to another replica
after the current lease expires.
Updates