Slides - Zhangxi Lin`s - Texas Tech University

Download Report

Transcript Slides - Zhangxi Lin`s - Texas Tech University

ISQS 6339, Data Management & Business Intelligence
Data Preparation for
Analytics Using SAS
Zhangxi Lin
Texas Tech University
ISQS 6347, Data & Text Mining
1
Outline


An overview of data preparation for analytics
SAS Programming Essentials




Running SAS programs
Mastering fundamental concepts
SAS program debugging
Make use of SAS Enterprise Guide for programming
ISQS 6347, Data & Text Mining
2
Structure and Components of
Business Intelligence
ISQS 6347, Data & Text Mining
3
Overview: From Data Warehousing to
Data Analysis

Previous major topics in data warehousing (using SQL Server
2008)




Dimensional model design
ETL
Cubes design and OLAP
Data analysis topics (using SAS)

Data preparation






Analytic business questions
Data format and data conversion
Data cleansing
Data exploratory
Data analysis
Data visualization
ISQS 6347, Data & Text Mining
4
US Car Theft

The number of U.S. motor vehicle thefts decreased
by 1.9 percent from 2003 to 2004, the first decrease
since 1999. In 2004, the value of stolen motor
vehicles was $7.6 billion, down from $8.6 billion in
2003. The average value of a motor vehicle reported
stolen in 2004 was $6,143, compared with $6,797 in
2003.
ISQS 6347, Data & Text Mining
5
2004 Theft Statistics







Every 26 seconds, a motor vehicle is stolen in the United States.
The odds of a vehicle being stolen were 1 in 190 in 2003. The
odds are highest in urban areas.
U.S. motor vehicle thefts fell 1.9 percent in 2004 from 2003,
according to the FBI's Uniform Crime Reports. In 2004, 1,237,114
motor vehicles were reported stolen.
The West was the only region with an increase in motor vehicle
thefts from 2003 to 2004, up 3.2 percent. Thefts fell 9.7 percent in
the Northeast, 4.4 percent in the Midwest and 2.9 percent in the
South.
Nationwide, the 2004 motor vehicle theft rate per 100,000 people
was 421.3, down 2.9 percent from 433.7 in 2003.
Only 13.0 percent of thefts were cleared by arrests in 2004.
Carjackings occur most frequently in urban areas. They account
for only 3.0 percent of all motor vehicle thefts.
The average comprehensive insurance premium in the U.S. rose
11.2 percent from 1999 to 2003
ISQS 6347, Data & Text Mining
6
Business Question


If the number of used Honda Accord thefts is
ranked the top in auto theft, should the
premium of insurance for Honda Accord be
high enough than other brand of cars?
Should the insurance for a user Honda higher
than a brand new Honda?
Why?
ISQS 6347, Data & Text Mining
7
Analytic Business Questions





How do factors such as time, branch, promotion, and price influence
the sale of a soft drink?
Which customers have a high cancellation risk in the next month?
How can customers be segmented based on their purchase
behavior?
Statistics showed that an online recommendation system may
increase the sale 20%, and the accuracy rate of the system is 40%.
A newer algorithm can increase the accuracy rate to 50%. Should
the sale be promoted to 20%*125% = 25%?
The airline companies are considering allowing seats over-booked
because certain percentage of customers will cancel their flight at
the last minute. If the average cancellation rate is 10%, should the
over-booking rate be 10% as well? If a cancellation is charged 5% of
the fare and how much should the penalty for sold-out situation with
over-booking?
ISQS 6347, Data & Text Mining
8
Analysis Process






Selecting an analysis method
Identify data source
Prepare the data (collecting, cleansing, reorganizing, extracting
transforming, loading)
Execute the analysis
Interpret the analysis
Automate data preparation and execution of analysis, if the business
question has to be answered more than once




ETL
Stored procedures
The above steps can also be iterated, not necessarily performed in
sequential order
We focus on the data preparation step
ISQS 6347, Data & Text Mining
9
Characteristics of Analytic Business
Questions









Analysis complexity: real analysis or reporting
Analysis paradigm: statistics or data mining
Data preparation paradigm: as much as data as possible or
business knowledge first
Analysis method: supervised or unsupervised analysis
Scoring needed – yes/no
Periodicity of analysis: one-shot or re-run
Historic data needed, yes/no
Data structure: one row or multiple rows per subject
Complexity of the analysis team
ISQS 6347, Data & Text Mining
10
Components of the SAS System
Reporting
And
Graphics
Data Access
And
Management
User
Interface
Analytical
Base SAS
Application
Development
Visualization
And Discovery
Business
Solutions
Web
Enablement
ISQS 6347, Data & Text Mining
11
SAS Programming Essentials

Find more information from

http://support.sas.com
ISQS 6347, Data & Text Mining
12
Data-driven Tasks

The functionality of the SAS System is built around
four data-driven tasks common to virtually any
applications




Data access
Data management
Data analysis
Data presentation
ISQS 6347, Data & Text Mining
13
Turning Data into Information

Process of delivery meaningful information

80% data-related






Access
Scrub
Transform
Mange
Store and retrieve
20% analysis
ISQS 6347, Data & Text Mining
14
Turning Data into Information
Data
DATA
Step
SAS
Data Sets
PROC
Steps
Information
ISQS 6347, Data & Text Mining
15
Design of the SAS System
MultiVendor Architecture
90%
independent
PC
10%
dependent
Workstation
Servers/
Midrange
Mainframe
ISQS 6347, Data & Text Mining
Super
Computer
16
...
Design of the SAS System
MultiEngine Architecture
DB2
Teradata
SAP
dBase
DATA
ORACLE
SYBASE
Microsoft Excel
ISQS 6347, Data & Text Mining
17
SAS Programming – Level I





Fundamentals (ch1-3)
Producing list reports (ch4)
Enhancing output (ch5)
Creating data sets (ch6)
Data step programming (ch7)








Reading data
Creating variables
Conditional processing
Keeping and dropping variables
Reading Excel files
Combining SAS data sets (ch8)
Producing summary reports (ch9)
SAS graphing (ch10)
ISQS 6347, Data & Text Mining
18
Course Scenario
In
this course, you work with business data
from International Airlines (IA). The various
kinds of data that IA maintains are listed below:





flight data
passenger data
cargo data
employee data
revenue data
ISQS 6347, Data & Text Mining
19
Course Scenario
The
following are some tasks that you will
perform:





importing data
creating a list of employees
producing a frequency table of job codes
summarizing data
creating a report of salary information
ISQS 6347, Data & Text Mining
20
SAS Programs
A SAS program is a sequence of steps that the user
submits for execution.
Raw
Data
DATA steps are typically used to create SAS
data sets.
DATA
Step
SAS
Data
Set
SAS
Data
Set
PROC
Step
Report
PROC steps are typically used to process
SAS data sets (that is, generate reports
and graphs, edit data, and sort data).
ISQS 6347, Data & Text Mining
21
SAS Programs
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
DATA
Step
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
PROC
Steps
22
Step Boundaries
SAS steps begin with either of the following:
 DATA statement
 PROC statement
SAS detects the end of a step when it encounters
one of the following:
 a RUN statement (for most steps)
 a QUIT statement (for some procedures)
 the beginning of another step (DATA statement
or PROC statement)
ISQS 6347, Data & Text Mining
23
Step Boundaries
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
24
Running a SAS Program
You can invoke SAS in the following ways:
 interactive windowing mode (SAS windowing
environment)
 interactive menu-driven mode (SAS Enterprise Guide,
SAS/ASSIST, SAS/AF, or SAS/EIS software)
 batch mode
 noninteractive mode
ISQS 6347, Data & Text Mining
25
Preparation of SAS Programming


Data sets: \SAS-Programming
Create a user defined library reference

Statement
LIBNAME libref ‘SAS-data-library’ <options>;

Example
LIBNAME ia ‘c:\workshop\winsas\prog1’;

Two-levels of SAS files names
Libref.fielname
ISQS 6347, Data & Text Mining
26
SAS Programming Essentials


Demon: c02s2d1
Exercise: c02ex1
ISQS 6347, Data & Text Mining
27
Browsing the Descriptor Portion

General form of the CONTENTS procedure:

PROC CONTENTS DATA=SAS-data-set;
RUN;

Example:
proc contents data=work.staff;
run;
ISQS 6347, Data & Text Mining
c02s3d1
28
SAS Data Sets: Data Portion
The data portion of a SAS data set is a rectangular table
of character and/or numeric data values.
JobTitle
Salary
TORRES
LANGKAMM
SMITH
WAGSCHAL
TOERMOEN
JAN
SARAH
MICHAEL
NADJA
JOCHEN
Pilot
Mechanic
Mechanic
Pilot
Pilot
50000
80000
40000
77500
65000
Character values
Variable
values
FirstName
Variable
names
LastName
Numeric
values
Variable names are part of the descriptor portion, not the
data portion.
ISQS 6347, Data & Text Mining
29
SAS Variable Values
There are two types of variables:
character
contain any value: letters, numbers, special
characters, and blanks. Character values are
stored with a length of 1 to 32,767 bytes. One
byte equals one character.
numeric
stored as floating point numbers in 8 bytes
of storage by default. Eight bytes of floating point
storage provide space for 16 or 17 significant
digits. You are not restricted to
8 digits.
ISQS 6347, Data & Text Mining
30
SAS Data Set and Variable Names
SAS




names have these characteristics:
can be 32 characters long.
can be uppercase, lowercase, or mixed-case.
are not case sensitive.
must start with a letter or underscore.
Subsequent characters can be letters,
underscores, or numerals.
ISQS 6347, Data & Text Mining
31
Valid SAS Names

Select the valid default SAS names.
 data5mon
ISQS 6347, Data & Text Mining
32
...
Valid SAS Names

Select the valid default SAS names.
 data5mon
ISQS 6347, Data & Text Mining
33
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
ISQS 6347, Data & Text Mining
34
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
ISQS 6347, Data & Text Mining
35
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
ISQS 6347, Data & Text Mining
36
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
ISQS 6347, Data & Text Mining
37
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
 five months data
ISQS 6347, Data & Text Mining
38
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
 five months data
ISQS 6347, Data & Text Mining
39
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
 five months data
 fivemonthsdata
ISQS 6347, Data & Text Mining
40
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
 five months data
 fivemonthsdata
ISQS 6347, Data & Text Mining
41
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
 five months data
 fivemonthsdata
 FiveMonthsData
ISQS 6347, Data & Text Mining
42
...
Valid SAS Names

Select the valid default SAS names.
 data5mon

 5monthsdata
 data#5
 five months data
 fivemonthsdata
 FiveMonthsData
ISQS 6347, Data & Text Mining
43
...
Missing Data Values
A value must exist for every variable for each observation.
Missing values are valid values.
LastName
FirstName
JobTitle
Salary
TORRES
LANGKAMM
SMITH
WAGSCHAL
TOERMOEN
JAN
SARAH
MICHAEL
NADJA
JOCHEN
Pilot
Mechanic
Mechanic
Pilot
50000
80000
.
77500
65000
A character missing
value is displayed as
a blank.
ISQS 6347, Data & Text Mining
A numeric
missing value
is displayed
as a period.
45
Browsing the Data Portion
The
PRINT procedure displays the data
portion
of a SAS data set.
By
default, PROC PRINT displays the
following:



all observations
all variables
an Obs column on the left side
ISQS 6347, Data & Text Mining
46
Browsing the Data Portion

General form of the PRINT procedure:

PROC PRINT DATA=SAS-data-set;
RUN;

Example:
proc print data=work.staff;
run;
ISQS 6347, Data & Text Mining
c02s3d1
47
SAS Data Set Terminology
SAS documentation and text in the SAS windowing
environment use the following terms interchangeably:
SAS Data Set
SAS Table
Variable
Column
Observation
Row
ISQS 6347, Data & Text Mining
48
SAS Syntax Rules
SAS statements have these characteristics:
 usually begin with an identifying keyword
 always end with a semicolon
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
49
SAS Syntax Rules
SAS statements are free-format.
 One or more blanks or special characters can
be used to separate words.
 They can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining

50
...
SAS Syntax Rules
SAS statements are free-format.
 One or more blanks or special characters can
be used to separate words.
 They can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining

52
...
SAS Syntax Rules
SAS statements are free-format.
 One or more blanks or special characters can
be used to separate words.
 They can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining

53
...
...
SAS Syntax Rules
SAS statements are free-format.
 One or more blanks or special characters can
be used to separate words.
 They can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining

54
...
...
SAS Syntax Rules
SAS statements are free-format.
 One or more blanks or special characters can
be used to separate words.
 They can begin and end in any column.
 A single statement can span multiple lines.
 Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining

55
SAS Syntax Rules
Good spacing makes the program easier to read.
Conventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
56
SAS Comments

Type /* to begin a comment.

Type your comment text.

Type */ to end the comment.
/* Create work.staff data set */
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
/* Produce listing report of work.staff */
proc print data=work.staff;
run;
ISQS 6347, Data & Text Mining
c02s3d2
57
Syntax Errors
Syntax errors include the following:
misspelled keywords
 missing or invalid punctuation
 invalid options
daat work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;

proc print data=work.staff
run;
proc means data=work.staff average max;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
58
Debugging a SAS
Program
c02s4d1.sas
userid.prog1.sascode(c02s4d1)
c02s4d2.sas
userid.prog1.sascode(c02s4d2)

This demonstration illustrates how to
submit a
SAS program that contains errors,
diagnose the errors, correct the
errors, and save the corrected
program.
ISQS 6347, Data & Text Mining
59
Recall a Submitted Program
Program statements accumulate in a recall buffer
each time you issue a SUBMIT command.
daat work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff
run;
proc means data=work.staff average max;
class JobTitle;
var Salary;
run;
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff mean max;
class Jobtitle;
var Salary;
ISQS 6347, Data & Text Mining
run;
Submit
Number 1
Submit
Number 2
60
Recall a Submitted Program
Issue the RECALL command once to recall the most
recently submitted program.
Submit
Number 1
Issue RECALL
once.
Submit
Number 2
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff mean max;
class JobTitle;
var Salary;
run;
Submit Number 2 statements
are recalled.
ISQS 6347, Data & Text Mining
61
Recall a Submitted Program
Issue the RECALL command again to recall Submit
Number 1 statements.
Submit
Number 1
Issue RECALL
again.
Submit
Number 2
daat work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff
run;
proc means data=work.staff average max;
class JobTitle;
var Salary;
run;
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff mean max;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
62
Exercise 8: Basic SAS Programming






Define library IA and Out
Go through all SAS programs in Chapter 2-5.
Write a SAS program to read a dataset created by
yourself or simply use Person0.txt in
\\TechShare\coba\d\ISQS3358\OtherDatasets\ .
The dataset is output to your library Out.
Try to apply whatever SAS features in Chapter 5
of Prog-I to general a nice looking report.
Go through all exercises for Ch 2, 3, 4, 5, 6 (answer keys
are available, so no need to submit the results)
ISQS 6347, Data & Text Mining
63
Hands-on exercise



Write a SAS program to calculate the number
of dates passed in 2012 to 3/3/2012. The
input is in the format: date9.
01JAN2012 03MAR2012
Answer: 62 days
ISQS 6347, Data & Text Mining
64
Making Use of SAS Enterprise Guide
Code

Import a text file


Example: Orders.txt
Import an Excel file

Example: SupplyInfo.xls
ISQS 6347, Data & Text Mining
65
Learn from Examples

SAS Help


Contents -> Learning to use SAS -> Sample SAS
Programs -> Base SAS
“Base Usage Guide Examples”

Chapter 3, 4
ISQS 6347, Data & Text Mining
66
Import an Excel Sheet
proc import out=work.commrex
datafile ="C:\Lin\Shared\ISQS6339\Commrex_3358.xls" dbms=excel
replace;
sheet="Company";
getnames=yes;
mixed=no;
scantext=yes;
usedate=yes;
scantime=yes;
run;
proc print data=work.commrex;
run;
ISQS 6347, Data & Text Mining
67
Excel SAS/ACCESS LIBNAME Engine
libname xlsdata 'C:\Lin\Shared\ISQS6339\Commrex_3358.xls';
proc print data=xlsdata.New1;
run;
ISQS 6347, Data & Text Mining
68
Exercise 9: SAS Data Step Programming

http://zlin.ba.ttu.edu/6339/ExerciseInstructions9.htm
ISQS 6347, Data & Text Mining
69