SasDay-SQL-Hendel
Download
Report
Transcript SasDay-SQL-Hendel
SQL and SAS
[email protected], Ph.d A.S.A
SAS DAY – Oct 31 2007
OVERVIEW - GOALS
What is SQL?
When do you use it?
Virtues of SQL over SAS –
readability, ease of learning
Virtues of SAS over SQL – speed,compactness
How can SQL help me when thinking about
requirements for projects? (For laypeople)
NO PREREQUISITES NEEDED
METHOD TO PRESENT
We will use a THEMATIC presentation
We will take one project-the THEME
And show how each step can be done in
SQL and SAS
Along the way we will learn everything
we need to know about SQL and its
virtues
THEMATIC PROJECT
START FILE
File with Medicare Eligibles
Contains state, county, codes (SSA and
FIPS), zip, gender, agegroup, Pt A
entitlement, Pt A term, Pt B entitlement,
Part B term, Eligible count for this record
THEMATIC PROJECT
END FILE
Want Subtotals and aggregation:
Official State, County, Code list used in Office
of Actuary, along with All Part D eligibles
(Entitled to Part A or B)
May also want Part A and B eligibles (Part C)
Goal of presentation: To get FROM the START
TO the END
SMALL OVERVIEW OF SQL
What is sql?
SQL is a computer language designed for
databases
SQL is a summary of several dozen attempts
to create database computer languages
It was found that all such languages had 8
items in common. SQL has these 8 items and
is a COMPUTER STANDARD for any database
language
Each database language should be SQL
compliant (DB2, SAS, ORACLE, etc)
OVERVIEW OF SQL
A database FILE is simply a file with ROWS
and COLUMNS
Each ROW is a RECORD
Each COLUMN is a FIELD
The most typical example is say the children in
your family
Each child would have Fields associated with
them FirstName, Age, School they attend,
Birthday etc
This information is STORED in a rectangular
table
OVERVIEW OF SQL
A DATABASE is a collection of DATABASE
FILES
For example besides your database file for
your family you may have a database from the
yellow pages which lists for each age group
stores carrying clothing, toys and other FIELDS
You can use the DATABASE to find out what
stores are good for each child to buy what they
want
OVERVIEW OF SQL
We just explained what a DATABASE is
We also gave a simple example
Now we explain what a DATABASE
LANGUAGE is
That is we explain what you would want to do
with your database files
It turns out there are 8 operations that every
database should be able to do
We will give an alternate version of these 8
basic operations that are used in practice
SQL OVERVIEW
A database language should enable you to
1)Take a SUBSET OF ROWS of a database
file (For example: All children in your family
under 5) In sql we call this SELECTING
2) Take a SUBSET OF COLUMNS (e.g.
FirstName and School) In SQL we call this
PROJECTING
3) Make a new database file by UNIONing two
database files with common fields
SQL OVERVIEW
4) JOIN two tables by keeping the number of
rows the same but adding columns(For
example take my family database file and add
a column showing the store where to buy
clothing for that child)
5) Suppose you take your FAMILY DATABASE
FILE and JOIN to it your EXPENSES for each
child(Another database file).
You might want to SUBTOTAL expenses by
GENDER
SQL OVERVIEW
In other words: You might want a new
table with two rows that tells you BOY –
SO MANY $; GIRLS – SO MANY $
In SQL we refer to these as
AGGREGATE FUNCTIONS.
SQL allows 5 types of AGGREGATION:
Sum, count, average, min, max
SQL OVERVIEW
SUMMARY:
Every database file should allow you to
SELECT certain rows (e.g. children <5)
PROJECT columns (e.g. School data)
UNION (Add rows)
JOIN (Add columns (e.g. add cloth stores
AGGREGATE FUNCTIONS (sum,…)
The good news: Only need to learn 5 pieces
of code – Great advantage of SQL-ease of
learning
HOW DO YOU DO A PROJECT
To do a project you
START with certain FIELDS
Decide what FIELDS you want to END
Decide how to GO from START to END
At each step you can do 1 of 5 things
SELECT,PROJECT,JOIN, UNION,
AGGREGATE
QUERY OPTIMIZATION
In general each project will allow
SEVERAL ways to go from START to
END
Database experts have identified rules
that OPTIMIZE projects The basic rule is
Do SELECT,PROJECT early
Do JOIN, AGGREGATE later
OUR PROJECT: STEP 1
Step 1 in ANY project is data cleaning
I have all this data on eligibles.
But is the data all OK
One way of checking is listing the STATES
involved. I will then DELETE those states that
are BAD
This is a SELECTION (reduction of ROWS)
and uses the OPTIMIZATION RULE (SELECT
early)
OUR PROJECT STEP 1
STEP 1 is done the same whether in SAS or
SQL. Here is the code. We will explain the
underlined keywords
proc sql;
create table d.state as
select distinct substr(sc,1,2) as st
from d.Start;
;quit;
Run;
OUR PROJECT: Step 2
The output of the code gives me a list of
all states codes in my eligible file
I find two codes I don’t recognize 00 99
These are codes for badly coded records
About 1% of the records are coded this
way
This is actually quite normal
I will now get rid of them using a SELECT
PROJET: STEP 2 SELECT
PROGRAM to delete bad states
proc sql;
create table d.Start2 as
select distinct *
from d.Start
where substr(sc,1,2) not in
('99','00');
quit;
PROJECT: Step 2 SAS
Here is the SAS code accomplishing the same
function – Notice how the SQL and SAS are
about the same in readability and compactness
data d.Start2;
set d.Start;
if substr(sc,1,2) in ('99','00')
then delete;
run;
SAS in general is usually more compact; SQL
is more readable (EXERCISE: Why / How ?)
SQL JOINS - INTERLUDE
PROBLEM: Many versions of SQL
Each one slightly different
Differences lie in JOIN operation
Best not to rely on documentation
Here are some simple programs
We test the 5 types of JOINS
JOIN, INNER JOIN, OUT JOIN, LEFT
JOIN and RIGHT JOIN
SQL JOINS
We need two datasets TEMP and
TEMP2 each with two fields. Here is
TEMP which has 2 columns A,B
data d.temp;
input A B;
SQL JOINS
Here is the CARD statement for TEMP
cards;
11
12
13
Etc. FIELD / COL A can be 1,2,3
FIELD colum B can be 1,2,3
We get 9 records
SQL JOIN
Data set TEMP2
data d.temp2;
input A C;
cards;
17
28
49
;
SQL JOIN
What should you notice
TEMP.A has 3 values 1,2,3
TEMP2.A has 3 values 1,2,4
TEMP.A has values not in TEMP2.A
TEMP2.A has values not in TEMP
There are also common values
Can you the audience name the above
SQL JOIN
What else should you notice
TEMP has COLUMNS A,B
TEMP2 has COLUMNS A,C
The two files have a COMMON COLUMN
The two files also have their own columns
The common column allows a JOIN
Think back to CLOTH stores for KIDS
What was the field we JOINED ON (ANS=)
SQL JOIN
Here is code for an SQL JOIN
proc sql;
create table d.temp3 as
select *
from d.temp join d.temp2
on temp.A = temp2.A;
quit;
SQL JOIN
The OUTPUT of the JOIN query is a table
A
B
C Joined
1
1
7
1
2
8 Temp
Temp2
1
3
9 (1,3) join (1,9)
The table only has A=1,2. A cannot equal 3,4
The JOIN in SQL SAS takes only A values
common to BOTH tables and leaves out A
values in only one table
SQL JOIN
In other languages the JOIN behaves
differently
When you want only values common to
both files you use the INNER JOIN
The INNER JOIN query has identical
language to the JOIN query except that
the word JOIN is replaced by INNER
JOIN. In SAS the output is the same
SQL JOIN
Suppose I wanted all values from my
TEMP table whether or not they are
linked in my temp2 table
Is this reasonable?
Sure it is. Think back to our example
You want all your children listed
WHETHER OR NOT there is store to buy
clothing in
SQL JOIN
So you want the A column in the JOIN
table to have all A values in the TEMP
table whether or not they have values in
the TEMP2 table
We call this a LEFT JOIN because in
describing the JOIN OF TEMP WITH
TEMP2, “TEMP” is on the LEFT and
“TEMP2” is on the right. Neat!?!
SQL JOIN
You make a LEFT JOIN query with identical
language to a JOIN query except that the word
JOIN is replaced by LEFT JOIN
A typical record in the LEFT JOIN query with
values from TEMP but not from TEMP2 could
look like this
A=3 B=1 C=.
Here the “.” in the C column indicates a
MISSING value.
If the field was not numeric the C col is BLANK
SQL JOIN
The RIGHT JOIN query has all A values
from the TEMP2 table even if they don’t
occur in the TEMP1 table
The language is identical except you use
the word RIGHT JOIN
The output has the PERIOD (or blank) for
missing values.
PROJECT: Step 3
So far I removed badly coded states
(00,99)
I want my final table to only list states
and counties found in the office of the
actuary tables for which we have rates
So I have to do some more ROW
reduction (SELECTION)
How do I accomplish this
PROJECT STEP 3
Well I take my START2 table
I will JOIN START2 (add columns) from the
Office of Actuary Table
The JOIN column will be _________?
The STATE field (or its code)
What type of JOIN? Well I don’t want anything
not in the ACTUARY table and I don’t want
anything not in my START2
So I want an INNER JOIN (or JOIN in SAS)
PROJECT: STEP 3
The SQL codes is as follows
proc sql;
create table d.START3 as
SELECT *
FROM d.Start2 Inner Join d.Actuary
ON Start2.sc = Actuary.sc;
quit;
PROJECT STEP 3
I now have in my file
All records with eligible counts
Where the state and county codes are
also recognized in the actuary file
I am now in a position to do some more
subsetting and get my totals
PROJECT STEP 3
But am I done with step 3?
Shouldn’t I look at the data deleted
That is: Shouldn’t I look at the records
whose county codes arent in the actuary
file
I can do this doing some queries as
follows
PROJECT: STEP 3
proc sql;
CREATE table d.Start3LeftOut as
SELECT *
FROM d.Start2 Left Join d.Actuary
ON Start2.sc = Actuary.sc ;
CREATE table d.Start3LeftOut as
select *
from d.Start3LeftOut
where d.Start3LeftOut.state = ' ‘; quit;
PROJECT: STEP 3
IN WORDS
I Left Join my original file with the actuary
I then inspect those records with a blank
or period
Those are precisely the records that
could not be matched
I now review these codes or subtotal
them to find out how bad the data is
PROJECT STEP 3
Although SQL is READABLE and carries
across many platforms, SAS is much more
compact. Here is the SAS code to accomplish
all the preceding;
SAS however requires sorting the data first
SAS also requires using the IN variable
This is technical sas code which should only be
used by people who are familiar with it
PROJECT STEP 3
proc sort data=d.Actuary;by sc; run;
proc sort data=d.Start2; by sc; run;
data d.Start3 d.Start3a d.Start3b;
merge d.Start3(in=t1) d.Actuary(in=t2);
by sc;
if t1 and t2 then output d.Start3;
if t1 and not t2 then output d.Start3A;
if not t1 and t2 then output d.Start3B;
run;
PROJECT: STEP 4
What next?
Remember the OPTIMIZE golden rule
SELECT / PROJECT early
GET RID Of unwanted ROWS/COL
We got rid of the unwanted ROWS
So next we get rid of unwanted COLS
PROJECT STEP 4
SAS CODE (Compact)
data d.Start4;
set d.Start3;
keep sc st county elig fips fg fa fat fb fbt;run;
SQL CODE (Slightly more readable?);
PROC SQL; CREATE TABLE d.Start3 AS
SELECT sc,st ,county ,elig ,fips ,fg ,fa ,fat
,fb ,fbt
FROM d.Start3; QUIT;
PROJECT STEP 5
Next step is given as an exercise
I explained the importance of LOOKING at the
data vs. EXPECTING it TO LOOK a certain
way
So I should CHECK on the various possibliities
of A,B Entitlement Term;
This is similar to checking on the STATES
It allows me to see all the cases I am revuing
QUIZ: What are the key words to be used?
PROJECT STEP 6
I have done all my row/col reductions
I am now ready to SUBTOTAL by STATE
COUNTY (I don’t want to subtotal by zip)
The sas code is very compact /technical
(Fa, fat=Pt A entitlement flag, Part A “T”erm;)
PROC MEANS Data=d.Start4 noprint;
var elig;
by sc st county fips fa fat fb fbt;
output out=d.Start6 SUM; run;
data d.Start6;set d.Start6 ; drop _Type_ _Freq_;
PROJECT STEP 6
Here is the more readable SQL code
proc sql;
create table d.Start6 as
select sc,st,county,fips,FA,FAT,FB, FBT ,
,SUM(Elig) as Elig
from d.Start4
group by sc,st,county,fips,FA,FAT,FB,FBT
order by elig desc;
Note New keywords (SUM, GROUP BY,
ORDER)
PROJECT STEP 7
What next? Well we have ENTITLEMENT
STATUSES and TERM STATUSES
But I want PART D status
Part D = Entitled to A or B
= Pt A status=Yes(1) and Part A term=0
OR Pt B status=Yes and Part B term=0
Need code to make a new field
Will only give SQL version since late
Might also want (to add) A and B for which we
use AND vs OR
PROJECT: STEP 7
PROC SQL;create table d.Start7 as
select *,
CASE
WHEN (fa='1' and fat='0')
and (fb='1' and fbt='0')
THEN 'A and B'
WHEN (fa='1' and fat='0')
or (fb='1' and fbt='0')
THEN 'A or B'
ELSE ' '
END as STATUS
from d.Start6; QUIT; RUN;
PROJECT COMPLETION
There are two steps left
First: We have to ONLY select rows where the
status is A and B or A or B (depending what we
want).
Exercise: What is the KEYWORD? What is a
previous example?
LASTLY: We have to AGGREGATE AGAIN
(Why again???)
What do we GROUP BY? What do we
AGGREGATE?
SUMMARY
Databases consist of collections of files
Each file is row/record x column/field
You only have to know 5 things
SELECT(rows);PROJECT(Columns);
UNION(Rows);JOIN(columns); AGGREGATE
Project=Sequence Queries (SQL!)
PROJECT/SELECT early;JOIN/AGG late
Identify START and END files at Project begin
Be sure to check data as you proceed
Sometimes you DEFINE new fields.
Advanced Reference
SQL Processing with the SAS System;
Book Code 58231
SAS INSTITUTE 13MAY92, CARY NC
Excellent REFERENCE TIP
To review/see ex any PARTICULAR SQL
keyword like CASE, JOIN, simply GOOGLE
SQL <KEYWORD>
Note: Overview given here (optimazation
theory and 5 basic operations is not on web It
should guide you in all you do