Transcript Document
Lecture 2: Beyond Relational
Databases
Prof. Shahram Ghandeharizadeh
Director of USC Database Lab
http://dblab.usc.edu
Computer Science Department
University of Southern California
An Emerging Phenomena
User 1
Application
programs
DBMS
User 2
Application
programs
Data
managed by
DBMS
Why?
Marketing campaigns have become too
exaggerated!
Relational vendors claim RDBMS is the
answer to all data management needs.
Not true.
What are some examples?
Data Warehousing
Retail organizations record every customer
transaction, producing Terabytes of data.
Objective: Mine database for information
about customers purchasing patterns,
trends in product popularity, geographical
preferences, and others.
Database characteristics:
Large tables (tens of Terabytes in size),
Updated in bulk periodically,
Read by analysts invoking tools to mine trends.
Queries access only a few of the many
columns in a table, and scan tables sorted in
different ways.
Directory Services
International organizations with distributed
resources and personnel.
LDAP standard
Requirement: fast lookup of entities arranged in
a hierarchical structure that corresponds to a
hierarchy of the organizaiton.
Core of identification and authentication system
from a number of vendors, e.g., IBM Tivoli,
Microsoft Active Directory Server, SUN ONE
Directory Server.
Bulk updates similar to data warehousing.
Multi-valued attributes.
Queries are single-row retrieval or lookups
based on attribute values.
Web Search
Semi-structured data
Queries are keyword lookups and the
desired response is a sorted list of
possible answers.
HTML pages instead of raw data.
Need for efficient inverted indices.
Bulk updates, read mostly.
Need for nontraditional indexing.
Other Examples
Mobile device caching
Stream management
Your cell phone’s directory as a transient
cache of a global directory.
Real-time filtering of streams for
interesting patterns. Example: identify
hotly traded stock, or a stock that is not
traded as heavily as expected.
Filters look like SQL selection predicates,
causing developers to mistake a RDBMS
as the right choice.
XML management
Summary
Relational DBMS have been designed for transaction
processing and workloads consisting of ad hoc
queries and significant amount of updates.
Example applications are read-dominated:
25 years ago, One market for DBMS: Business data
processing. This has changed to include different
applications with different requirements.
No need for transactional guarantees.
SQL is the wrong choice for stream processing.
One software architecture will not support the diverse
needs of these applications. Possible solutions:
1) each application re-builds its own storage manager from
scratch,
2) provide a flexible solution that can be tailored to the needs of
a particular application.
A handful of configurable storage systems, each of
which is useful across a broad application class.
Evolution
Before having a
handful of
configurable
storage managers:
After having a
handful of
configurable
storage managers:
Requirements
A flexible storage manager must be:
Modular
Configurable
An application should not have to “pay” for
a functionality that it does not use.
“Pay” means:
Adapt to the hardware and software environment of the
application.
Physical data design (physical clustering, choice of
indexes, internal structure of items in the database).
Memory consumption,
Disk and CPU utilization.
Application developer should be able to
exclude major subsystems.
Ultimate goal
Modularity
Simple, re-usable, plug-n-play components.
View a transaction processing system as:
A single table selection component that has a B+
tree index that supports simple indexing,
updating and selection.
Add concept of transactions.
Add a select-project-join operator
Add aggregates
…..
Transforms a sophisticated system to a
collection of components. Each component
may support a large number of application
domains.
Data Availability
High availability and data replication as
components
Challenge: the component must fit in a
company’s high availability infrastructure,
e.g., heartbeat protocols to detect
failures, fail-over techniques, and
redundant communication channels.
Modularity: Advantages
Modularity manages size and complexity of
the final application while also enabling the
application and data management
capabilities to seamlessly interact.
Modularity provides for extensibility (not
provided by a monolithic system). Example:
A transactional system consists of a transaction
manager, a lock manager, and a log manager.
If these modules are open and extensible then
the developer may build systems that
incorporate items that are not managed by the
database itself.
A network switch with an operation such as
“power up the backup network interface card” as
a transaction using locking and logging
components.
Configurability
While modularity is an architectural mechanism, configuration is
mostly an initialization and runtime mechanism.
Configure the system with a buffer pool size of 1 GB and disk page size of
32K.
Space consumed by transaction logs.
Buffer pool size is a run-time parameter.
Disk page size is an initialization parameter. Why?
Configurability refers to how well a system can be matched to its
environment and application requirements.
Underlying hardware platform: PDAs, embedded systems, 64-way
multiprocessor with gigabytes of DRAM.
Neutral to different network protocols and how a developer may decide to
use a protocol.
Different Operating Systems: Linux, Windows (to be portable, storage
manager must use common services to different OSs).
Whether database is main memory resident or not.
A configurable system would try to use the CPU cache.
Use of compression is another good example. It depends on the
hardware platform, and tradeoff associated with the amount of power
consumed by the processor to compress versus transmitting larger
chunks.
Flash as a layer in the memory hierarchy. Challenge: sensitive to the
number of write operations.
Spectrum of configuration
Ideally, a full spectrum of possible
choices should be supported (relative
to the extreme ends of the spectrum).
Different policies such be implemented
by the same transactional component.
Data in main
memory
with no xact
guarantees
Persistence
with full xact
guarantees
Physical Data Design
A configurable storage manager must
support components for physical layout of
data and indexing techniques:
Physical clustering: design & runtime decisions.
Indexing mechanism (B+-tree, Hash): runtime
configuration decisions.
Grouping of relevant data to enhance cache hit ratio
and minimize seek time.
The criteria used for clustering is paramount.
Extensibility means the developer may introduce a new
indexing mechanism as a component.
Internal structure of items in the database:
design decision.
Application is King!
At the end of the day, the choice of a
storage manager must match the
requirements of an application!
If the requirements of an application
calls for use of relational technology
with SQL then DO use such a system.
Think as follows: you have options
when it comes to data management,
and you should select the right tool to
get the job done as efficiently,
robustly, and simply as possible.
Homework 1
Download C++ version of Berkeley DB
storage manager, compile it using Visual
Studio, author a project to insert 100 records
into a database.
Each record has the following attributes:
Id: an integer (4 bytes)
MemberName: a variable sized array of
characters constructed by concatenating a string
token (JaneDoe) with the id.
Age: an integer (4 bytes) and a function of the
record Id; Age = 20 + (id % 15)
Salary: an integer (4 bytes) and a function of
age; Salary = 40,000 + (Age * 1000)
Homework 1
Due date: January 27th lunch time.
Assumption: You’ll use your own
PC/laptop. Let me know if you need
access to resources and we’ll try to
make SAL 200C available.
Please send email by Thursday, Jan
22nd at the latest.
Visual studio 2005 SP1 is available to
you for free download. Use Google
with keywords “Visual Studio 2005 SP1
Free”
Steps to download BDB
Download Berkeley DB from
http://www.oracle.com/technology/software/products/berkeleydb/index.html
Download Berkeley DB 4.7.25.zip
Extract db-4.7.25 folder
Place it under the projects folder in your Visual Studio
Open project in db-4.7.25\build_windows (help pages in http://doc.gnudarwin.org/ref/build_win/intro.html)
Make sure your Platform is Win32 and Configuration is Debug x86. Check
mark db_dll as your build. Start to compile.
Introduce a new “Win32 Console Application” project with your desired
project name. This will contain the code that you will write for your
homework.
Build your new project to make sure that it does build.
Now, you are ready to add code your software.
Start by including “db.h” file and the BDB library that you just compiled.
In the “Property Pages” add a new reference to your compiled db_dll
In the “Property Pages”, choose “Configuration Properties”, expand
“C/C++”, choose general and type in the absolute path of build_windows
in the “Additional Include Directories”
Make sure your project compiles with a simple “db.h” addition.
Berkeley DB
Start to read the manual (in the docs
directory). Make sure you understand:
A database is a collection of key/data
pairing.
Flags are used to create a database and
open a database.
Use of secondary index structures is
somewhat complicated and requires a
careful read of documentation and sample
code.
Storing Key/Data pair
Make sure you understand the concept
of a pointer and the memory space the
pointer points to.
Storing Key/Data pair
Data must be a
sequence of bytes.
Good:
Do not represent
data as a collection
of pointers.
Bad:
Data
Data