run - Stanford University

Download Report

Transcript run - Stanford University

HRP 222
Topic 3 –
Showing Data
Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved.
Warning: This presentation is protected by copyright law and international treaties.
Unauthorized reproduction of this presentation, or any portion of it, may result in
severe civil and criminal penalties and will be prosecuted to maximum extent possible
under the law.
From Last Time
Oops - libname
Last time I had the library name and v6
statement transposed. This is correct:
libname ingridv6 v6 ‘c:\projects\ingrid\dis\old’;
From Last Time
New Data
When you get new data do the following:
1. Scan the files for viruses
2. Make the file read only
3. Verify the number or records with the sender
4. Verify the first and last records
5. Verify the content
 Missing values
 Permitted values
From Last Time
The PDV
The program data vector is the storage of all the
variables that SAS is working on. The contents
of the PDV get are used to create new data
sets. Variables and their values get into the PDV
if they appear:
in a source “set” in a data step
in a “input” statement
on the left side of an equal sign
in an retain statement
an automatic variable
Examples of Retain
Here is an example of the use of retain
which counts the cases of gdm.
data blah;
This is an optional default
set grace.analysis; value. You should always
give one.
retain dx_gdm 0;
if gdm=1 then dx_gdm=dx_gdm+1;
/*the same thing as
if gdm then dx_gdm+1;
*/
run;
Complex Retains
Combining the first and last variables with retain
statements gives you real power. This code counts
the total diagnoses for a woman.
data totaldx (keep=id dx_total);
set fakebaby.analysis;
by fake_id;
retain dx_total 0;
if first.fake_id then dx_total = 0;
dx_total=dx_total+sum(gdm--thyroid);
if last.fake_id then output;
run;
Security
Assume that somebody is always looking over
your shoulder on the web and people are
reading your email.
Put a firewall between you and the web.
That said, the biggest threats to computer
security are the legal users of the system.
Walking away from a terminal
Using passwords that are easy to crack by script
kiddies
Taking data off of restricted machines
Viruses and Trojan horses will kill you if you let them!
Security Issues
(2)
The left red arrow points to Norton
Antivirus.
Right click on it to open it up.
Before you send me your homework,
update your definitions and scan the files
of interest.
Security Issues
(2b)
The newest Norton AntiVirus has a lousy
interface.
Click this to
find the file
you want to
scan.
Update your
definitions by
clicking the live
update button.
Security Issues
(2c)
Click on the files you want the scanner to
check.
Security
(3)
Securing your email:
There are programs which will scramble your email
while it is in route, effectively making it impossible for
people to read it without your permission.
The best way to encrypt data is by using PGP
encryption.
If you use a PC or Mac, visit the upper site for the
latest version information.
http://cws.internet.com/encrypt.html
http://web.mit.edu/network/pgp.html
Security
(4)
You can secure the connection between
machines by using encrypted transmissions.
PGP
SSH
SSL
Virtual Private Networks (VPNs) are all the rage.
Machines can recognize each other:
Kerberos – make a .klogin file on your unix account
SSH
More on Finding Problems
I showed you how to identify problems
and write them to the log. This is an
important task but documenting problems
with reports that look good is an equally
important task.
Checking Variables 2
Proc Print
Use proc print to print stuff to the output
(not the log) window.
proc print data= newData;
var id sex;
where sex not in ('M', 'F');
run;
The if statement in a data step is replaced with awhere
statement in a procedure.
Dressing up output
You can add up to five lines of titles and five
lines of footnotes to each page of output.
title1 People who have bad sex;
proc print data= newData noobs;
var id sex;
where sex not in ('M', 'F');
run;
Tell it you do
not want the
observation
number
printed.
Dressing up output
title1;
proc print data= newData noobs label;
label sex = "Gender";
You can tell the
var id sex;
procedure you
want to use
where sex not
labels instead of
in ('M', 'F');
variable names
and provide the
run;
labels like this.
ODS
The Output Delivery System allows you to control
what you print and how it looks. Use it to make
your output web-ready and pretty.
ods html
file=‘blah-body.htm'
contents="blah-contents.htm"
frame="blah-frame.htm"
page="blah-page.htm"
path="c:\projects\blah\LS\" (url=none)
gpath="c:\projects\blah\LS\"(url=none);
A Look at Data
If a variable is categorical (i.e., nominal or
ordinal) you would take your first look at it
with proc freq. You would look at it
graphically with proc gchart.
If a variable is continuous (i.e., interval or
ratio measure) you can take your first look
at it with proc means or proc univariate.
You would visualize it with proc gplot or
proc gchart, proc univariate and proc
boxplot.
Categorical Data
You can represent categorical data as strings of
letters or numbers.
The choice is up to you but most programmers
use numbers. Never use free form text for
categories.
Plotting Frequencies
I prefer to see
my data in chart
format.
SAS/Graph is like
dental surgery.
Your results may
be beautiful but
getting them can
be excruciating.
Plotting Frequencies
(2)
Counting observations
If you want to get a tabular count of all
the different values stored in a variable,
use proc freq (pronounced “freak”) with
this very simple syntax.
proc freq data= gen6sas.at;
tables race;
run;
proc freq data= gen6sas.at;
where center = ‘stan’;
tables race;
run;
Counting observations
Counting the missing
(2)
You can tell SAS to include the missing
records in the body of the table like this:
proc freq data= gen6sas.at;
tables race / missing;
run;
Counting Observations
Lots of Tables
(3)
Cody and Smith mention that double dash
notation can be used to get all tables between
two variables.
tables gender -- cities;
You can also specify just the text or numeric
variables like this:
tables gender - _numeric_ - cities;
tables gender - _character_ - cities;
Counting Observations
Warning!
(4)
Proc freq only examines the first 16 positions
of a character variable. These two strings are
identical to proc freq.
Do not put beans or raisins in your nose
Do not put beans
Capitalization and spacing are both
meaningful to proc freq. These are different:
Spam & Eggs, Spam&Eggs, spam & Eggs, spam
& eggs
Dealing With Strings
Try not to use strings
for your categorical
variables but if you have
to….
SAS has functions that
will convert your
variables to all upper or
lower case and sack the
spaces.
Dealing With Strings(2)
Dealing With Strings(3)
The right way to deal
with strings is to not
use them at all!
Code your variables
numerically and
translate them with a
format.
Dealing With Strings
(4)
Dealing With Strings
(5)
Continuous Variables
You can now describe numerically or
graphically a categorical variable.
Continuous variables are generally easier
to work with.
Proc means by default will give you min
max mean and SD for one or more
variables.
Proc Means (1)
Easy Examples
proc means data = x;
var age_st yob;
run;
proc means data = x;
var age_st yob;
where age_st not in (0, 9999) and yob not in (0, 8888, 9999) ;
run;
Proc Means (2)
Easy Examples
If your data is sorted then you can do
statistics for subgroups of your data by
using the keyword by.
proc sort data= x; by sex; run;
proc means data = x nonobs mean maxdec=0;
by sex;
var age_st yob;
where age_st not in (0,9999)
and yob not in (0,8888,9999);
run;
Proc Means (3)
Easy Examples
A couple of procedures, including proc means, will
allow you to use a class statement instead of sorting
and using by. If you have the RAM try it because it
is faster.
proc means data = x
by sex;
var age_st yob;
where age_st not in
run;
proc means data = x
class sex;
var age_st yob;
where age_st not in
run;
nonobs mean maxdec=0;
(0,9999) and yob not in (0,8888,9999);
nonobs mean maxdec=0;
Don’t print the N used in the stats.
(0,9999) and yob not in (0,8888,9999);
Proc Means (4)
A Complex Example
You can make procedures, including proc
means, create new data sets:
proc means data = x nonobs mean std maxdec=0 noprint;
by sex;
where age_st not in (0,9999) and yob not in (0,8888,9999);
var
output out = work.themeans
mean =
std =
run;
age_st yob;
age_m yob_m Line these up!
age_s yob_s;
Many other procedures produce datasets
which can be used for further work.
Proc Means (4)
A Complex Example - 2
The outputted data set includes the statistics you
requested plus two automatic variables. The
_freq_ value tells you how many values were used
in the stats. The _type_ value comes into play
when you invoke means with a class statement or
by statement. You can use it to see the means
for the group and within the levels.
Proc Univariate
Proc univariate generates a sea of
information on your numeric variables. It
is syntactically easy.
Like proc means, it can output into a new
data set and you can use it for further
analysis (high resolution plots).
Proc Univariate
(2)
I like to do this:
proc univariate data=junk.babyweight noprint;
var fetal_wgt_;
This suppresses all the statistical output.
histogram;
run;
Proc Univariate
(3)
Actually, I do something like this….
proc univariate data=junk.babyweight noprint;
var fetal_wgt_;
histogram /midpoints = 1350 to 4300 by 100;
run;
Statistic
Number of nonmissing observations
Number of missing observations
Total number of observations
Mean
Median
Mode
Sum
Standard deviation
Variance
Minimum
Max
Range
Uncorrected sum of squares
Corrected sum of squares
Covariance
Skewness
Kurtosis
Student's t
Probability of non-0 t
Quartiles
Interquartile range
Percentiles
Signed rank test
Kolmogorov statistic
Shapiro-Wilk statistic
Test for H0: normally distribution
Box plots (Low Resolution)
Stem-and-leaf plots (Low Resolution)
Normal probability plot (Low Resolution)
Histogram (High Resolution)
Probability plot (High Resolution)
QQ plot (High Resolution)
Proc Means
Available
Default
Option
Y
N
NMISS
Y
Y
Y
Y
MEAN
MEDIAN
SUM
STD
VAR
MIN
MAX
RANGE
USS
CSS
CV
SKEW
KURT
T
PRT
Q1 Q3
QRANGE
P1 P5 P10 P25…
Proc Univariate
Available Available
Default Option
Statement
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Based on
DiIorio
page 89.
NORMAL
PLOTS
PLOTS
PLOTS
HISTOGRAM
PROBPLOT
QQPLOT
Formats
Formats are typically used to indicate that
numeric value corresponds to a text value.
You can also use formats to deal
affectively with missing or invalid values.
Using Formats and Nulls
proc format;
value badAge
.U = Unknown
.N = Not Applicable;
run;
data blah;
input ageAtCancer @@;
format ageAtCancer badAge.;
datalines;
34 35
.U
.N
36
; run;
Using Formats and Nulls
(2)
When you do statistics on the variables
that include the null values the null values
are removed.
proc means data = blah maxdec = 0;
var ageAtCancer;
run;
Dates
You know how to import numbers and
character data. I have alluded to the fact
that dates in SAS are difficult to work with
because dates are stored as number of
days since Jan 01, 1960. Importing
requires an informat and viewing a date
requires a date format.
Dates (2)
Importing a Date
To import a date you need to tell SAS how
the date is structured:
This is optional
data form; input id dob : mmddyy10.;
datalines;
1 06/24/1967
2 01/18/1967
;
run;
Dates (3)
Importing a Date
Dates are stored as the number of days since
Jan 01, 1960. If you need to specify a lot of
dates you can use an informat statement:
data form;
informat dob dom mmddyy10.;
input id dob dom @@;
datalines;
1 06/24/1967 06/10/1990
2 01/18/1967 06/10/1990
;
run;
Dates (4)
Displaying a Date
To see the date correctly, specify a format
in the importing datastep or later:
data form;
informat dob dom mmddyy10.;
format
dob dom mmddyy10.;
input id dob dom;
datalines;
1 06/24/1967 06/10/1990
2 01/18/1967 06/10/1990
; run;
Formats stick around when you create
new data sets but can be changed.
Dates (5)
Changing a Date Format
data form;
informat dob dom mmddyy10.;
input id dob dom; datalines;
1 06/24/1967 06/10/1990
2 01/18/1967 06/10/1990
; run;
data blah; set form;
format dob dom mmddyy10.;
run;
data blah2; set blah;
format dob dom date8.;
run;
Dates (6)
Two Digit Dates and Y2K
SAS has done a lousy job with this…
Don’t use two digit dates if you can help it.
You can specify a year cut-off of
something like 1920. If you use yearcutoff
=1920 then your two digit dates refer to
this range:
Converting
From Text to Dates
Converting a text date to a SAS date is
useful for determining study eligibility:
data eligible; set blah;
if dom > "01jan1990"d then output;
run;
You also have a pack of useful date
functions to do things like:
data eligible; set blah;
if (("01jan1990"d-mdy(monthOfB,dayOfB,yearOfB))/365.25)
> 65 then output;
run;
Before Next Time
Cody & Smith – Read the rest of Chapter
2, and all of Chapter 3
In Class Exercise
Import the data.
Get the contents.
Verify the contents
Generate frequency tables on all the
variables.
Get descriptive statistics on the numeric
variables.