CS194Lec03DataModelsx - b

Download Report

Transcript CS194Lec03DataModelsx - b

Introduction to Data Science
Lecture 3
Manipulating Tabular Data
CS 194 Spring 2014
Michael Franklin
Dan Bruckner, Evan Sparks,
Shivaram Venkataraman
Outline for this Evening
• Lecture – Data Models, Tables, Structure, etc.
– SQL, NoSQL
– Schema on Read vs. Schema on Write
– Non-Tabluar Structures
• Exercise – Data Manipulation with Pandas
(Shivaram: Tutorial and Lab)
• Review of exercise
• About the course
– Overview of Assignments and rough schedule
Data Science – One Definition
The Big Picture
Extract
Transform
Load
4
LET’S TALK ABOUT STRUCTURE
5
The Structure Spectrum
Structured
Semi-Structured
Unstructured
(schema-first)
(schema-later)
(schema-never)
Relational
Database
Formatted
Messages
Documents
XML
Plain Text
Media
Tagged
Text/Media
When people use the word database,
fundamentally what they are saying is that the
data should be self-describing and it should
have a schema. That’s really all the word
database means.
•-- Jim Gray, “The Fourth Paradigm”
7
Key Concept: Structured Data
A data model is a collection of concepts for
describing data.
A schema is a description of a particular
collection of data, using a given data model.
The Relational Model*
• The Relational Model is Ubiquitous
• MySQL, PostgreSQL, Oracle, DB2, SQLServer, …
• Foundational work done at
• IBM Santa Teresa Labs (now IBM Almaden )“System R”
• UC Berkeley CS – the “Ingres” System
• Note: some Legacy systems use older models
• e.g., IBM’s IMS
E. F., “Ted” Codd
Turing Award
1981
• Object-oriented concepts have been merged in
• Early work: POSTGRES research project at Berkeley
• Informix, IBM DB2, Oracle 8i
• As has support for XML (semi-structured data)
*Codd, E. F. (1970). "A relational model of data for large shared data banks".
Communications of the ACM 13 (6): 37
Relational Database: Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
Schema : specifies name of relation, plus name and type
of each column
Students(sid: string, name: string,
login: string, age: integer, gpa: real)
Instance : the actual data at a given time
• #rows = cardinality
• #fields = degree / arity
Ex: Instance of Students Relation
sid
53666
53688
53650
name
login
Jones jones@cs
Smith smith@eecs
Smith smith @math
age
18
18
19
gpa
3.4
3.2
3.8
• Cardinality = 3, arity = 5 , all rows distinct
• Do all values in each column of a relation instance
have to be unique?
SQL - A language for Relational DBs
• Say: “ess-cue-ell” or “sequel”
– But spelled “SQL”
• Data Definition Language (DDL)
– create, modify, delete relations
– specify constraints
– administer users, security, etc.
• Data Manipulation Language (DML)
– Specify queries to find tuples that satisfy criteria
– add, modify, remove tuples
• The DBMS is responsible for efficient evaluation.
Creating Relations in SQL
• Creates the Students relation.
– Note: the type (domain) of each field is specified,
and enforced by the DBMS whenever tuples are
added or modified.
CREATE TABLE Students
(sid CHAR(20),
name CHAR(20),
login CHAR(10),
age INTEGER,
gpa FLOAT)
Table Creation (continued)
• Another example: the Enrolled table holds
information about courses students take.
CREATE TABLE Enrolled
(sid CHAR(20),
cid CHAR(20),
grade CHAR(2))
Adding and Deleting Tuples
• Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa)
VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2)
•
Can delete all tuples satisfying some condition
(e.g., name = Smith):
DELETE
FROM Students S
WHERE S.name = 'Smith'
Powerful variants of these commands are available;
more later!
Queries in SQL
• Single-table queries are straightforward.
• To find all 18 year old students, we can write:
SELECT *
FROM Students S
WHERE S.age=18
•
To find just names and logins, replace the first line:
SELECT S.name, S.login
Querying Multiple Relations
• Can specify a join over two tables as follows:
SELECT S.name, E.cid
FROM Students S, Enrolled E
WHERE S.sid=E.sid AND E.grade=‘B'
sid
53831
53831
53650
53666
cid
Carnatic101
Reggae203
Topology112
History105
grade
C
B
A
B
sid
name
53666 Jones
login
jones@cs
age gpa
18
3.4
53688 Smith smith@ee 18
3.2
Note: obviously no referential
integrity constraints have
been used here.
result =
S.name
Jones
E.cid
History105
Basic SQL Query
SELECT
FROM
WHERE
[DISTINCT] target-list
relation-list
qualification
• relation-list : A list of relation names
• possibly with a range-variable after each name
• target-list : A list of attributes of tables in relation-list
• qualification : Comparisons combined using AND, OR and
NOT.
• Comparisons are Attr op const or Attr1 op Attr2, where
op is one of =≠<>≤≥
• DISTINCT: optional keyword indicating that the
answer should not contain duplicates.
• In SQL SELECT, the default is that duplicates are not
eliminated! (Result is called a “multiset”)
SQL Query Semantics
Semantics of an SQL query are defined in terms
of the following conceptual evaluation strategy:
1. do FROM clause: compute cross-product of tables
(e.g., Students and Enrolled).
2. do WHERE clause: Check conditions, discard tuples
that fail. (i.e., “selection”).
3. do SELECT clause: Delete unwanted fields. (i.e.,
“projection”).
4. If DISTINCT specified, eliminate duplicate rows.
Probably the least efficient way to compute a
query!
– An optimizer will find more efficient strategies to
get the same answer.
Data Model (Tabular)
• SQLite
– Table: fixed number of named columns of specified
type
– 5 storage classes for columns
•
•
•
•
•
NULL
INTEGER
REAL
TEXT
BLOB
– Data stored on disk in a single file in row-major order
– Operations performed via sqlite3 shell
20
OTHER “TABLE-LIKE” DATA MODELS
Data Model (Tabular)
• Python
– DataFrame: a dict of Series objects
• Each Series object represents a column
– Series: a named, ordered dictionary
• The keys of the dictionary are the indexes
• Built on NumPy’s ndarray
• Values can be any Numpy data type object
– Data stored in memory
– Operations performed from Python shell
22
Operations
•
•
•
•
•
integrate (join), transform, clean, impute
aggregate: sum, count, average, max, min
sort
pivot
Relational
– union, intersection, difference, cartesian product
(CROSS JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join,
etc.
– rename
23
Operations
• Summary() (descriptive statistics)
• map()
• Pandas
– Group By/split-apply-combine
(aggregation/transformation)
– Merge/join
– Pivot/reshape
– Sampling
• Add/remove columns (feature enrichment)
• Clone (CTAS)
• Chaining (correlated subqueries)
24
Data Model (Tabular)
• R
– data.frame: a list of vector objects
• Each vector object represents a column
– Possible vector types
• logical, integer, double, complex, character, raw
– Data stored in memory
– Operations performed from the R shell
25
What’s Wrong with Tables?
• Too limited in structure?
• Too rigid?
• Too old fashioned?
BEYOND TABLES
Column Family Data Models
• Roots in “Big Table” system at Google
• Used in Cassandra and other Key Value Stores
28
NoSQL Storage Systems
29
CouchDB Data Model (JSON)
• “With CouchDB, no schema is enforced, so new
document types with new meaning can be safely
added alongside the old.”
• A CouchDB document is an object that consists of
named fields. Field values may be:
– strings, numbers, dates,
"Subject":
"I like Plankton"
– ordered
lists, associative maps
"Author": "Rusty"
"PostedDate": "5/23/2006"
"Tags": ["plankton", "baseball", "decisions"]
"Body": "I decided today that I don't like baseball. I like plankton."
30
MongoDB Data Model
“With Mongo, you do less
"normalization" than you
would perform designing
a relational schema
because there are no
server-side joins.
Generally, you will want
one database collection
for each of your top level
objects.”
from the MongoDB manual
31
Dremel Nested Data Model
32
Schema: Teaching a Pig to Sing?
• “Pig Latin” [Olston et al. SIGMOD 08]
• Why have a schema?
1) Referential (and other) Consistency
2) Fast point look ups through indexes
3) Curation for future (other) users
• But many Big Data Workloads
• Are Read Mostly/Append Only
• Scan (not look up) Focused
• On fairly Transient data sets
Q: What about Query
Optimization?
• Pig (and other NoSQL systems have a
•
•
Flexible, optional, nested data model
Data remains in files (no admin)
Pig
• Started at Yahoo! Research
• Runs about 50% of Yahoo!’s jobs
• Features:
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
• Schema is optional
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
– Easy to plug in Java functions
An Example Problem
Suppose you have
user data in one file,
website data in
another, and you
need to find the top
5 most visited pages
by users aged 18-25.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
In MapReduce
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
In Pig Latin
Users
= load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages
= load ‘pages’ as (user, url);
Joined
= join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed
= foreach Grouped generate group,
count(Joined) as clicks;
Sorted
= order Summed by clicks desc;
Top5
= limit Sorted 5;
store Top5 into ‘top5sites’;
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Translation to MapReduce
Notice how naturally the components of the job translate into Pig Latin.
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Order by clicks
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Translation to MapReduce
Notice how naturally the components of the job translate into Pig Latin.
Load Users
Load Pages
Filter by age
Join on name
Job 1
Group on url
Job 2
Count clicks
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
Order by clicks
Job 3
Take top 5
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Hive
• Developed at Facebook
• Relational database built on Hadoop
– Maintains table schemas
– SQL-like query language (which can also call
Hadoop Streaming scripts)
– Supports table partitioning,
complex data types, sampling,
some query optimization
• Used for most Facebook jobs
– Less than 1% of daily jobs at Facebook use
MapReduce directly!!! (SQL – or PIG – wins!)
– Note: Google also has several SQL-like systems in use.
(Thanks to David DeWitt for the following slides on Hive)
Tables in Hive
• Like a relational DBMS, data stored in tables
• Richer column types than SQL
– Primitive types: ints, floats, strings, date
– Complex types: associative arrays, lists, structs
Example:
CREATE Table Employees
(
Name string,
Salary integer,
Children List <Struct <firstName: string, DOB:date>>
)
41
(credit: David DeWitt)
•
Hive
Data
Storage
Like a parallel DBMS, Hive tables can be partitioned
• Example data file:
Sales(custID, zipCode, date, amount)
partitioned by state
Hive DDL:
Create Table Sales(
custID INT,
zipCode STRING,
date DATE,
amount FLOAT)
Partitioned By
(state STRING)
HDFS Directory
Sales
custID
zipCode …
custID
zipCode …
201
12345
…
13
54321
…
105
12345
…
67
54321
…
933
12345
…
45
74321
…
Alabama
custID
…
Alaska
zipCode …
78
99221
…
345
99221
…
821
99221
…
Wyoming
1 HDFS file per state
42
(credit: David DeWitt)
HiveQL Example #1
Sales(custID, zipCode, date, amount) partitioned by state
HDFS Directory
custID
HDFS files
Sales
zipCode …
custID
zipCode …
201
12345
…
13
54321
…
105
12345
…
67
54321
…
933
12345
…
45
74321
…
Alabama
custID
…
Alaska
zipCode …
78
99221
…
345
99221
…
821
99221
…
Wyoming
Query 1: For the last 30 days obtain total sales by zipCode:
SELECT zipCode, sum(amount)
FROM Sales
WHERE getDate()-30 < date < getDate()
GROUP BY zipCode
Query will be executed
against all 50 partitions
of Sales
43
(credit: David DeWitt)
HiveQL Example #2
Sales(custID, zipCode, date, amount) partitioned by state
HDFS Directory
custID
HDFS files
Sales
zipCode …
custID
zipCode …
201
12345
…
13
54321
…
105
12345
…
67
54321
…
933
12345
…
45
74321
…
Alabama
Alaska
custID
…
zipCode …
78
99221
…
345
99221
…
821
99221
…
Wyoming
Query 2: For the last 30 days obtain total sales by zipCode for Alabama:
SELECT zipCode, sum(amount)
FROM Sales
WHERE State = ‘Alabama’ and
getDate()-30 < date < getDate()
GROUP BY zipCode
44
(credit: David DeWitt)
Whither Schemas?
DB: Schemas are necessary
for correctness,
robustness, perfomrance
and evolvability
NoSQL: a) Schemas keep me
from getting my job done.
b) messy data is reality
Fact: Most of the world’s
data is unstructured.
Not “IF” But “WHEN”?
• “Schema on Write”
– Traditional Approach
• “Schema on Read”
– Data is simply copied to the file store, no
transformation is needed.
– A SerDe (Serializer/Deserlizer) is applied during read
time to extract the required columns (late binding)
– New data can start flowing anytime and will appear
retroactively once the SerDe is updated to parse it.
46
DATASPACES – WHAT ARE THEY?
47
What’s in a name?
M. Franklin
BNCOD 2009
7 July 2009
Dataspaces
Inclusive
Deal with all the data of interest – in
whatever form
Co-existence not Integration
No integrated schema, no single warehouse, no
ownership required
Pay-as-you-go
– Keyword search is bare minimum.
– More function and increased consistency as
you add work.
M. Franklin
BNCOD 2009
7 July 2009
Compare to Data Integration
A quintessential
schema-first
approach.
Mediated Schema
Semantic mappings
wrapper wrapper wrapper wrapper wrapper
Courtesy of Alon Halevy
M. Franklin
BNCOD 2009
7 July 2009
Whither Structured Data?
• Conventional Wisdom:
only 20% of data is
structured.
• Decreasing due to:
– Consumer applications
– Enterprise search
– Media applications
M. Franklin
BNCOD 2009
7 July 2009
But Structure Matters!
Functionality
Structure enables computers
to help users manipulate and
maintain the data.
Dataspaces
(pay-as-you-go)
Structured
(schema-first)
Unstructured (schema-less)
M. Franklin
2009
Time BNCOD
(and
cost)
7 July 2009
An Alternative View
Weak
Administrative
Control
Strong
Virtual
Organization
Federated
DBMS
DBMS
Strong
Web
Search
Desktop
Search
Weak
Semantic Integration
M. Franklin
BNCOD 2009
7 July 2009
A Recent Example
• Hadapt’s “Schemaless SQL”
A Recent Example
• Hadapt’s “Schemaless SQL”
“Schemaless SQL”
• Schema Evolution – adding a column
Next Time
• Data Cleaning and Integration
• But now – time for a Pandas query processing
and anlytics exercise