SasDay-SQL-Hendel

Download Report

Transcript SasDay-SQL-Hendel

SQL and SAS
[email protected], Ph.d A.S.A
SAS DAY – Oct 31 2007
OVERVIEW - GOALS
 What is SQL?
 When do you use it?
 Virtues of SQL over SAS –
readability, ease of learning
 Virtues of SAS over SQL – speed,compactness
 How can SQL help me when thinking about
requirements for projects? (For laypeople)
 NO PREREQUISITES NEEDED
METHOD TO PRESENT
 We will use a THEMATIC presentation
 We will take one project-the THEME
 And show how each step can be done in
SQL and SAS
 Along the way we will learn everything
we need to know about SQL and its
virtues
THEMATIC PROJECT
 START FILE
 File with Medicare Eligibles
 Contains state, county, codes (SSA and
FIPS), zip, gender, agegroup, Pt A
entitlement, Pt A term, Pt B entitlement,
Part B term, Eligible count for this record
THEMATIC PROJECT
 END FILE
 Want Subtotals and aggregation:
 Official State, County, Code list used in Office
of Actuary, along with All Part D eligibles
(Entitled to Part A or B)
 May also want Part A and B eligibles (Part C)
 Goal of presentation: To get FROM the START
TO the END
SMALL OVERVIEW OF SQL
 What is sql?
 SQL is a computer language designed for
databases
 SQL is a summary of several dozen attempts
to create database computer languages
 It was found that all such languages had 8
items in common. SQL has these 8 items and
is a COMPUTER STANDARD for any database
language
 Each database language should be SQL
compliant (DB2, SAS, ORACLE, etc)
OVERVIEW OF SQL
 A database FILE is simply a file with ROWS
and COLUMNS
 Each ROW is a RECORD
 Each COLUMN is a FIELD
 The most typical example is say the children in
your family
 Each child would have Fields associated with
them FirstName, Age, School they attend,
Birthday etc
 This information is STORED in a rectangular
table
OVERVIEW OF SQL
 A DATABASE is a collection of DATABASE
FILES
 For example besides your database file for
your family you may have a database from the
yellow pages which lists for each age group
stores carrying clothing, toys and other FIELDS
 You can use the DATABASE to find out what
stores are good for each child to buy what they
want
OVERVIEW OF SQL
 We just explained what a DATABASE is
 We also gave a simple example
 Now we explain what a DATABASE
LANGUAGE is
 That is we explain what you would want to do
with your database files
 It turns out there are 8 operations that every
database should be able to do
 We will give an alternate version of these 8
basic operations that are used in practice
SQL OVERVIEW
 A database language should enable you to
 1)Take a SUBSET OF ROWS of a database
file (For example: All children in your family
under 5) In sql we call this SELECTING
 2) Take a SUBSET OF COLUMNS (e.g.
FirstName and School) In SQL we call this
PROJECTING
 3) Make a new database file by UNIONing two
database files with common fields
SQL OVERVIEW
 4) JOIN two tables by keeping the number of
rows the same but adding columns(For
example take my family database file and add
a column showing the store where to buy
clothing for that child)
 5) Suppose you take your FAMILY DATABASE
FILE and JOIN to it your EXPENSES for each
child(Another database file).
 You might want to SUBTOTAL expenses by
GENDER
SQL OVERVIEW
 In other words: You might want a new
table with two rows that tells you BOY –
SO MANY $; GIRLS – SO MANY $
 In SQL we refer to these as
AGGREGATE FUNCTIONS.
 SQL allows 5 types of AGGREGATION:
Sum, count, average, min, max
SQL OVERVIEW








SUMMARY:
Every database file should allow you to
SELECT certain rows (e.g. children <5)
PROJECT columns (e.g. School data)
UNION (Add rows)
JOIN (Add columns (e.g. add cloth stores
AGGREGATE FUNCTIONS (sum,…)
The good news: Only need to learn 5 pieces
of code – Great advantage of SQL-ease of
learning
HOW DO YOU DO A PROJECT






To do a project you
START with certain FIELDS
Decide what FIELDS you want to END
Decide how to GO from START to END
At each step you can do 1 of 5 things
SELECT,PROJECT,JOIN, UNION,
AGGREGATE
QUERY OPTIMIZATION
 In general each project will allow
SEVERAL ways to go from START to
END
 Database experts have identified rules
that OPTIMIZE projects The basic rule is
 Do SELECT,PROJECT early
 Do JOIN, AGGREGATE later
OUR PROJECT: STEP 1




Step 1 in ANY project is data cleaning
I have all this data on eligibles.
But is the data all OK
One way of checking is listing the STATES
involved. I will then DELETE those states that
are BAD
 This is a SELECTION (reduction of ROWS)
and uses the OPTIMIZATION RULE (SELECT
early)
OUR PROJECT STEP 1
 STEP 1 is done the same whether in SAS or
SQL. Here is the code. We will explain the
underlined keywords
 proc sql;

create table d.state as

select distinct substr(sc,1,2) as st

from d.Start;
 ;quit;
 Run;
OUR PROJECT: Step 2
 The output of the code gives me a list of
all states codes in my eligible file
 I find two codes I don’t recognize 00 99
 These are codes for badly coded records
 About 1% of the records are coded this
way
 This is actually quite normal
 I will now get rid of them using a SELECT
PROJET: STEP 2 SELECT
 PROGRAM to delete bad states
 proc sql;

create table d.Start2 as

select distinct *

from d.Start

where substr(sc,1,2) not in
('99','00');
 quit;
PROJECT: Step 2 SAS
 Here is the SAS code accomplishing the same
function – Notice how the SQL and SAS are
about the same in readability and compactness
 data d.Start2;

set d.Start;

if substr(sc,1,2) in ('99','00')
then delete;
 run;
 SAS in general is usually more compact; SQL
is more readable (EXERCISE: Why / How ?)
SQL JOINS - INTERLUDE







PROBLEM: Many versions of SQL
Each one slightly different
Differences lie in JOIN operation
Best not to rely on documentation
Here are some simple programs
We test the 5 types of JOINS
JOIN, INNER JOIN, OUT JOIN, LEFT
JOIN and RIGHT JOIN
SQL JOINS
 We need two datasets TEMP and
TEMP2 each with two fields. Here is
TEMP which has 2 columns A,B
 data d.temp;

input A B;
SQL JOINS








Here is the CARD statement for TEMP
cards;
11
12
13
Etc. FIELD / COL A can be 1,2,3
FIELD colum B can be 1,2,3
We get 9 records
SQL JOIN








Data set TEMP2
data d.temp2;
input A C;
cards;
17
28
49
;
SQL JOIN







What should you notice
TEMP.A has 3 values 1,2,3
TEMP2.A has 3 values 1,2,4
TEMP.A has values not in TEMP2.A
TEMP2.A has values not in TEMP
There are also common values
Can you the audience name the above
SQL JOIN








What else should you notice
TEMP has COLUMNS A,B
TEMP2 has COLUMNS A,C
The two files have a COMMON COLUMN
The two files also have their own columns
The common column allows a JOIN
Think back to CLOTH stores for KIDS
What was the field we JOINED ON (ANS=)
SQL JOIN
 Here is code for an SQL JOIN
 proc sql;

create table d.temp3 as

select *

from d.temp join d.temp2

on temp.A = temp2.A;
 quit;
SQL JOIN







The OUTPUT of the JOIN query is a table
A
B
C Joined
1
1
7
1
2
8 Temp
Temp2
1
3
9 (1,3) join (1,9)
The table only has A=1,2. A cannot equal 3,4
The JOIN in SQL SAS takes only A values
common to BOTH tables and leaves out A
values in only one table
SQL JOIN
 In other languages the JOIN behaves
differently
 When you want only values common to
both files you use the INNER JOIN
 The INNER JOIN query has identical
language to the JOIN query except that
the word JOIN is replaced by INNER
JOIN. In SAS the output is the same
SQL JOIN
 Suppose I wanted all values from my
TEMP table whether or not they are
linked in my temp2 table
 Is this reasonable?
 Sure it is. Think back to our example
 You want all your children listed
WHETHER OR NOT there is store to buy
clothing in
SQL JOIN
 So you want the A column in the JOIN
table to have all A values in the TEMP
table whether or not they have values in
the TEMP2 table
 We call this a LEFT JOIN because in
describing the JOIN OF TEMP WITH
TEMP2, “TEMP” is on the LEFT and
“TEMP2” is on the right. Neat!?!
SQL JOIN
 You make a LEFT JOIN query with identical
language to a JOIN query except that the word
JOIN is replaced by LEFT JOIN
 A typical record in the LEFT JOIN query with
values from TEMP but not from TEMP2 could
look like this
 A=3 B=1 C=.
 Here the “.” in the C column indicates a
MISSING value.
 If the field was not numeric the C col is BLANK
SQL JOIN
 The RIGHT JOIN query has all A values
from the TEMP2 table even if they don’t
occur in the TEMP1 table
 The language is identical except you use
the word RIGHT JOIN
 The output has the PERIOD (or blank) for
missing values.
PROJECT: Step 3
 So far I removed badly coded states
(00,99)
 I want my final table to only list states
and counties found in the office of the
actuary tables for which we have rates
 So I have to do some more ROW
reduction (SELECTION)
 How do I accomplish this
PROJECT STEP 3
 Well I take my START2 table
 I will JOIN START2 (add columns) from the
Office of Actuary Table
 The JOIN column will be _________?
 The STATE field (or its code)
 What type of JOIN? Well I don’t want anything
not in the ACTUARY table and I don’t want
anything not in my START2
 So I want an INNER JOIN (or JOIN in SAS)
PROJECT: STEP 3
 The SQL codes is as follows
 proc sql;

create table d.START3 as

SELECT *

FROM d.Start2 Inner Join d.Actuary

ON Start2.sc = Actuary.sc;
 quit;
PROJECT STEP 3
 I now have in my file
 All records with eligible counts
 Where the state and county codes are
also recognized in the actuary file
 I am now in a position to do some more
subsetting and get my totals
PROJECT STEP 3
 But am I done with step 3?
 Shouldn’t I look at the data deleted
 That is: Shouldn’t I look at the records
whose county codes arent in the actuary
file
 I can do this doing some queries as
follows
PROJECT: STEP 3
 proc sql;

CREATE table d.Start3LeftOut as

SELECT *

FROM d.Start2 Left Join d.Actuary

ON Start2.sc = Actuary.sc ;
CREATE table d.Start3LeftOut as

select *

from d.Start3LeftOut

where d.Start3LeftOut.state = ' ‘; quit;
PROJECT: STEP 3
 IN WORDS
 I Left Join my original file with the actuary
 I then inspect those records with a blank
or period
 Those are precisely the records that
could not be matched
 I now review these codes or subtotal
them to find out how bad the data is
PROJECT STEP 3
 Although SQL is READABLE and carries
across many platforms, SAS is much more
compact. Here is the SAS code to accomplish
all the preceding;
 SAS however requires sorting the data first
 SAS also requires using the IN variable
 This is technical sas code which should only be
used by people who are familiar with it
PROJECT STEP 3









proc sort data=d.Actuary;by sc; run;
proc sort data=d.Start2; by sc; run;
data d.Start3 d.Start3a d.Start3b;
merge d.Start3(in=t1) d.Actuary(in=t2);
by sc;
if t1 and t2 then output d.Start3;
if t1 and not t2 then output d.Start3A;
if not t1 and t2 then output d.Start3B;
run;
PROJECT: STEP 4






What next?
Remember the OPTIMIZE golden rule
SELECT / PROJECT early
GET RID Of unwanted ROWS/COL
We got rid of the unwanted ROWS
So next we get rid of unwanted COLS
PROJECT STEP 4







SAS CODE (Compact)
data d.Start4;
set d.Start3;
keep sc st county elig fips fg fa fat fb fbt;run;
SQL CODE (Slightly more readable?);
PROC SQL; CREATE TABLE d.Start3 AS
SELECT sc,st ,county ,elig ,fips ,fg ,fa ,fat
,fb ,fbt

FROM d.Start3; QUIT;
PROJECT STEP 5
 Next step is given as an exercise
 I explained the importance of LOOKING at the
data vs. EXPECTING it TO LOOK a certain
way
 So I should CHECK on the various possibliities
of A,B Entitlement Term;
 This is similar to checking on the STATES
 It allows me to see all the cases I am revuing
 QUIZ: What are the key words to be used?
PROJECT STEP 6
 I have done all my row/col reductions
 I am now ready to SUBTOTAL by STATE
COUNTY (I don’t want to subtotal by zip)
 The sas code is very compact /technical
 (Fa, fat=Pt A entitlement flag, Part A “T”erm;)
 PROC MEANS Data=d.Start4 noprint;

var elig;

by sc st county fips fa fat fb fbt;

output out=d.Start6 SUM; run;
data d.Start6;set d.Start6 ; drop _Type_ _Freq_;
PROJECT STEP 6
 Here is the more readable SQL code
 proc sql;

create table d.Start6 as

select sc,st,county,fips,FA,FAT,FB, FBT ,
,SUM(Elig) as Elig

from d.Start4

group by sc,st,county,fips,FA,FAT,FB,FBT

order by elig desc;
 Note New keywords (SUM, GROUP BY,
ORDER)
PROJECT STEP 7
 What next? Well we have ENTITLEMENT
STATUSES and TERM STATUSES
 But I want PART D status
 Part D = Entitled to A or B
 = Pt A status=Yes(1) and Part A term=0
 OR Pt B status=Yes and Part B term=0
 Need code to make a new field
 Will only give SQL version since late
 Might also want (to add) A and B for which we
use AND vs OR
PROJECT: STEP 7










PROC SQL;create table d.Start7 as
select *,
CASE
WHEN (fa='1' and fat='0')
and (fb='1' and fbt='0')
THEN 'A and B'
WHEN (fa='1' and fat='0')
or (fb='1' and fbt='0')
THEN 'A or B'
ELSE ' '
END as STATUS
from d.Start6; QUIT; RUN;
PROJECT COMPLETION
 There are two steps left
 First: We have to ONLY select rows where the
status is A and B or A or B (depending what we
want).
 Exercise: What is the KEYWORD? What is a
previous example?
 LASTLY: We have to AGGREGATE AGAIN
(Why again???)
 What do we GROUP BY? What do we
AGGREGATE?
SUMMARY









Databases consist of collections of files
Each file is row/record x column/field
You only have to know 5 things
SELECT(rows);PROJECT(Columns);
UNION(Rows);JOIN(columns); AGGREGATE
Project=Sequence Queries (SQL!)
PROJECT/SELECT early;JOIN/AGG late
Identify START and END files at Project begin
Be sure to check data as you proceed
Sometimes you DEFINE new fields.
Advanced Reference





SQL Processing with the SAS System;
Book Code 58231
SAS INSTITUTE 13MAY92, CARY NC
Excellent REFERENCE TIP
To review/see ex any PARTICULAR SQL
keyword like CASE, JOIN, simply GOOGLE
SQL <KEYWORD>
 Note: Overview given here (optimazation
theory and 5 basic operations is not on web It
should guide you in all you do