Talk as local powerpoint

Transcript Talk as local powerpoint

Mike Carey
Information Systems Group
Computer Science Department
UC Irvine

Carnegie-Mellon University, 1975-80
 B.S. and M.S. Student, EE/ECE

UC Berkeley, 1980-83
 Ph.D. Student, CS

University of Wisconsin, 1983-95
 Assistant/Associate/Full Professor, CS

IBM, 1995-2000
Trivia tidbit:
Here’s a photo of my
first (ever) CS TA 
 Industrial Researcher & Software R&D Manager

Propel Software, 2000-01
 Startup Company Fellow/CTO/VP of Software

BEA Systems, Inc., 2001-08 (acquired by Oracle)
 Industrial Software Architect & Sr. Engineering Director

And now I’m here…
©2003 BEA Systems, Inc. | 2

Okay, so just what is a database system?
 Based on lecture notes from the UW-Madison database
curriculum, as immortalized in Database Management
Systems (Ramakrishnan & Gehrke, a.k.a.“the Cow book”)

The database field is a vertical slice of all of CS!
 You’ll see what I mean (and why)…

What’s exciting in “database systems” today?
 UCI Information Systems Group (ISG) and beyond!

So what’s a database?
 A very large, integrated collection of data

Usually a model of a real-world enterprise or a
history of real-world events



Entities (e.g., students, courses, Facebook users, …)
Relationships (e.g., Susan is taking CS 234, Susan is a
friend of Lynn, Mike filed a grade change for Lynn, …)
What’s a database management system (DBMS)?
 A software system designed to store, manage, and
provide access to one or more such databases
New
New
Data
New
Data
Data
Relational
Relational DB Systems
CODASYL/IMS
Early DBMS Technologies
Files
Manual Coding
Byte streams
Majority of application
development effort
goes towards building
and then maintaining
data access logic
Records and pointers
Large, carefully tuned
data access programs
that have dependencies
on physical access
paths, indexes, etc.
Declarative approach
Tables + views bring
“data independence”
Details left to system
Designed to simplify
data-centric application
development
???
???
???
…
…
…
…
…
…

Reduced application development time

Efficient (and automatic!) data access

Data independence

Data integrity and security

Uniform data administration

Concurrent access and recovery from crashes

Shift from computation to information



Datasets increasing in diversity and volume



At the “low end”: explosion of the web (a mess!)
At the “high end”: scientific applications
Digital libraries, interactive video, social media,
genomic data, big science data, …
... need for DBMS exploding!
DBMS field encompasses most of CS

OS, languages, theory, AI, multimedia, logic, …
?!
A data model is a collection of concepts for
describing data (to one another or to a DBMS)
 A schema is a description of a particular collection of
data, using a given data model
 The relational model is the most widely used data
model today


Relation – basically a table with rows and (named) columns

Schema – describes the tables and their columns
Lies!

Many views of one conceptual
(logical) schema and an
underlying physical schema

Views describe how different users
or groups see the data
Logical
View 1

View 3
Conceptual Schema
Physical Schema
Model

View 2
Conceptual schema defines the
logical structure of the database
On-Disk
Data
Structures
Physical schema describes the files
and indexes used “under the covers”
Bits

Conceptual schema:
 Students(sid: string, name: string, login: string,
age: integer, gpa: real)
 Courses(cid: string, cname: string, credits: integer)
 Enrolled(sid: string, cid: string, grade: string)

Physical schema:



Relations each stored as unordered files
Have indexes on first and third columns of Students
External schema (a.k.a. view):
 CourseInfo(cid: string, cname: string,
enrollment: integer)

Applications are insulated from how data is actually
structured and stored!

Logical data independence: Protection from
changes in the logical structure of data

Physical data independence: Protection from
changes in the physical structure of data

One of the most important benefits of using a DBMS!
 Allows changes to be made w/o application rewrites

User query (in SQL, against the external schema):


Equivalent query (against the conceptual schema):


SELECT c.cid, c.enrollment
FROM CourseInfo c
WHERE c.cname = ‘Computer Game Design’
SELECT e.cid, count(e.*)
FROM Enrolled e, Courses c
WHERE e.cid = c.cid AND c.cname = ‘Computer Game Design’
GROUP BY c.cid
Under the hood (against the physical schema)
 Access Courses – use index on cname to find associated cid
 Access Enrolled – use index on cid to count the enrollments



A typical DBMS has a
layered architecture
The figure doesn’t show
the concurrency control
and recovery components
This is one of several
possible architectures;
each actual system has its
own variations
Queries
Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management
DB
Note:
These layers
must consider
concurrency
control and
recovery

“I like programming languages and compilers”
 Consider high-level, declarative languages like SQL

“I like low-level operating systems issues”


“I really want to work on distributed systems”


DBMSs manage records, memory, locks, logs, …
Distributed and parallel database systems are ripe with
distributed algorithms and systems issues (!)
“Data structure and algorithm design is really cool”

Database indexes are data structures on disk (or flash)
(And so on!)

The Web is full of database challenges (“Big Data”!)
 A box for keywords only goes so far…
▪ How can I query the web, e.g., “Find me 5-string Fender bass guitars
for sale in the $1000-1500 price range”
 Click streams and social networks generate lots of data
▪ How can I query and analyze all that data (e.g., to act on it)?

Ubiquitous computing is data-rich, too (IoT)
 Build, deploy, and use location-based data services
 Query and aggregate streams of sensor or video data

There’s data everywhere, and of all shapes and sizes
 How do we integrate it, e.g., for rapid crisis response?
 And when we do, how do we ensure privacy/security?

Data store for low-latency, high-traffic Web sites
 Only have a few hundred milliseconds to generate an entire page
 Data heavily cached outside the DBMS today, which is “far from ideal”

Data systems for offline/batch-oriented processing
 I mentioned this before: clickstream analysis, graph analysis, etc.
 Potentially interested in faster, approximate answers
 Would like to do this in real time as well, as data arrives

Hardware trends (always) present new opportunities
 Flash storage, for example
 Multicore CPUs (nobody uses them super well yet)

Some open source work fromFacebook related to DBs
 Hive: Open source SQL on top of Hadoop
 Cassandra: Large-scale distributed storage for semistructured data
Data loads & feeds
from external sources
(XML, JSON, …)
AQL queries &
scripting requests
and programs
Data publishing
to external
sources and apps
Hi-Speed Interconnect
CPU(s)
CPU(s)
CPU(s)
Main
Memory
Main
Memory
Main
Memory
Disk
Disk
Disk
ADM
Data
ADM
Data
ADM
Data
17
ASTERIX Goal:
To ingest, digest,
persist, index,
manage, query,
analyze, and
publish massive
quantities of
semi-structured
information…
(ADM = ASTERIX
Data Model,
AQL = ASTERIX
Query Language)


A DBMS is for storing and querying big datasets
Benefits of using one are many: rapid development of
new applications (“what, not how”), recovery after crashes,
support for (safe) concurrent access, help in ensuring data
integrity and security, …
Levels of schema abstraction  data independence
DB research is a vertical slice of all of CS (“for data”)
 Big Data experts are in high industrial demand! ()
 Data is what it’s all about today! So, consider
taking our three classes: CS 122A/B/C (and
occasionally offered special topics classes)



Talk as local powerpoint

Transcript Talk as local powerpoint

Directory