SAS: Reading Census Data

Census data is exceptionally large. The 2016 census profile for Ontario is 4.5G and has more than 46,694,909 lines of records. To extract the data efficiently, StatsCan provides a csv file that identifies the starting row number for each geography. Using this file, you can compile the parameter list for the geographical area of interest at the selected geographic level, eg. province level, census division level, and census subdivision level.
Census file can be downloaded at link
Step 1: Compile parameter lists

%put &name.;
Canada Ontario Durham York Toronto Peel Halton
 %put &start.;
2 2249 7287023 9513800 12198965 20521853 25289987
 %put &end.;
2248 4495 7289269 9516046 12201211 20524099 25292233

Step 2: Extract and Compile data

%macro ext (namelst=, startlst=, endlst=);
proc datasets library=work noprint;
delete census;
quit;
%let i=1;
%do %while (%scan(&namelst., &i, ' ') ne );
%let parm1=%scan(&namelst., &i, ' ');
%let parm2=%scan(&startlst., &i, ' ');
%let parm3=%scan(&endlst., &i, ' ');
data census_&parm1.;
infile 'X:\Work\Stats Can\98-401-X2016044_ONTARIO_eng_CSV\98-401-X2016044_ONTARIO_English_CSV_data.csv'
delimiter = ',' firstobs=&parm2. obs=&parm3. TRUNCOVER  DSD LRECL=32767 ;
INFORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $80.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $50.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
FORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $50.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $80.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
input
year 
geo_code $
geo_level 
geo_name $
gnr
gnr_lf
quality_flag 
alt_geo_code 
Item $
itemID 
Notes 
Total
Male $
Female $
;
run;
proc append base = census data = census_&parm1.;
run;
%let i = %eval (&i +1);
%end;
%mend ext;
%ext (namelst=&name., startlst=&start., endlst=&end. );

The resulting dataset is only 4.1mb, which you can manipulate efficiently.

Reading Data from fixed position txt file

Read data from fixed position txt file to SAS.

This type of file normally don’t have the variable name as the first observation because the length of the variable name might not be consistent with the length of the variable. The starting position for each variable is the same for all the records in the txt file.

Below in the admit table in fixed position txt file.

2458 Murray, W      M 27 1  72 168 HIGH 85.20
2462 Almers, C      F 34 .  66 152 HIGH 124.80
2501 Bonaventure, T F 31 .  61 123 LOW  149.75
2523 Johnson, R     M 43 31 63 137 MOD  149.75
2539 LaMance, K     M 51 4  71 158 LOW  124.80
2544 Jones, M       M 29 6  76 193 HIGH 124.80
  • Require . for missing value
  • use $ in input statement to identify character variable
  • use X-X to identify the starting position and ending position for the character variables.  This also determines the length of the character variable.
  • use @x to identify the starting position for the numeric variable.
  • use x. to identify the length of the numeric variable. eg: @23 age 2. indicates to read from position 23 the age variable with a length of 2.

Method1:  not identify the position for the numeric variable.

data admit;
infile 'C:\Users\irisan\Documents\My SAS Files(32)\9.3\Tutorial\admit_fixed.txt' ;
input ID $ 1-4 name $ 6-19 sex $ 21 age date height weight actlevel $ 36-39 fee;
run;

Method2: identify the position for each variable.

data admit;
infile 'C:\Users\irisan\Documents\My SAS Files(32)\9.3\Tutorial\admit_fixed.txt' ;
input
ID $ 1-4
name $ 6-19
sex $ 21
@23 age 2.  /* length for age is 2 */
@26 date 
@29 height
@32 weight
actlevel $ 36-39
@41 fee ;
run;

Read only  certain observations that meet the criteria for selected variables from the txt file, using 2 input statements and if statement.  Second input statement only contains the required variables, so not all the variables are read into the target SAS dataset.

data admit;
infile 'C:\Users\irisan\Documents\My SAS Files(32)\9.3\Tutorial\admit_fixed.txt' ;
input actlevel $ 36-39 @23 age  @;
if  age < 35 and actlevel EQ 'HIGH' then delete;
input
name $ 6-19
sex $ 21
@29 height
@32 weight
@41 fee ;
run;

 

SAS: Organizing data through Proc Format

Proc Format can be used to group categorical variables or numeric values for reporting.  The following examples show the coversion of sex, age, and id variables into desired format. Method 1 uses put and input statements and creates new variables, while method 2 use format statement to just format the existing variables.

  • Need $ in the format name for the existing character variable. eg. $gender
  • Need to put . after the format name when applying the format. eg. gender = put(sex, $gender.); format age agerange. sex $gender.;

/**** Proc Format ****/
proc format;
value
$gender
‘M’ = ‘Male’
‘F’ = ‘Female’
;
value
agerange
1 – 30 = ‘le 30’
30<-40 = '31 to 40'
40<-50 = '41 to 50'
50<-60 = '51 to 60'
60<-100 = 'gt 60'
other = 'Missing'
;
run;
options fmtsearch = (work.formats);

Method1:

data admit;
set sasuser.admit;
agecatgory = put(age, agerange.);
gender = put (sex, $gender.);
numericid = input(id, 4.);
run;

Method2:

data admit;
set sasuser.admit;
format age agerange. sex $gender.;
run;

Character to Numeric, using INVALUE.
Example: convert character gpa to numeric gpa.

/*** calculate the ESL grade points that needs to be backed out from the GPA ***/
/** format grade to index **/
proc format ; 
invalue   /* UES INVALUE, NOT VALUE */
$grade
'A+' = 9
'A'= 8
'B+'= 7
'B'= 6
'C+' = 5
'C' = 4
'D+' = 3
'D' = 2
'E' = 1 
'F' = 0
'P' = 0
;
run;
proc options  option=fmtsearch;run;
data backgpa;
set pes;
array crs{3} esl1000 esl1010 esl1015  ;
array ncr{3} esl1000ncr esl1010ncr esl1015ncr  ;
array cr{3} esl1000cr esl1010cr esl1015cr  ;
array back{3} esl1000bo esl1010bo esl1015bo;
do i =1 to dim(crs);
if crs{i} ne '' and ncr{i} ne 'NCR' then  back{i} =  input(crs{i}, $Grade.) * input(cr{i}, 8.);
end;
drop course1-course20 gr1-gr20 credit1-credit20 ncr1-ncr20 i;
run;
Heuristic Andrew

Good-enough solutions for an imperfect world

Social Media Oreo

Social Media Oreo

Real Data

Adventures in Data Science

A. C. Cockerill

Past to Future in a Picosecond

Social Mathematics

The interaction between Mathematics and the modern day world.

PH Bytes

Code is the New Literacy

Data Science Insights

In Pursuit of Smarter Decisions for Performance