SAS: Reading Census Data

Census data is exceptionally large. The 2016 census profile for Ontario is 4.5G and has more than 46,694,909 lines of records. To extract the data efficiently, StatsCan provides a csv file that identifies the starting row number for each geography. Using this file, you can compile the parameter list for the geographical area of interest at the selected geographic level, eg. province level, census division level, and census subdivision level.
Census file can be downloaded at link
Step 1: Compile parameter lists

%put &name.;
Canada Ontario Durham York Toronto Peel Halton
 %put &start.;
2 2249 7287023 9513800 12198965 20521853 25289987
 %put &end.;
2248 4495 7289269 9516046 12201211 20524099 25292233

Step 2: Extract and Compile data

%macro ext (namelst=, startlst=, endlst=);
proc datasets library=work noprint;
delete census;
quit;
%let i=1;
%do %while (%scan(&namelst., &i, ' ') ne );
%let parm1=%scan(&namelst., &i, ' ');
%let parm2=%scan(&startlst., &i, ' ');
%let parm3=%scan(&endlst., &i, ' ');
data census_&parm1.;
infile 'X:\Work\Stats Can\98-401-X2016044_ONTARIO_eng_CSV\98-401-X2016044_ONTARIO_English_CSV_data.csv'
delimiter = ',' firstobs=&parm2. obs=&parm3. TRUNCOVER  DSD LRECL=32767 ;
INFORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $80.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $50.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
FORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $50.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $80.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
input
year 
geo_code $
geo_level 
geo_name $
gnr
gnr_lf
quality_flag 
alt_geo_code 
Item $
itemID 
Notes 
Total
Male $
Female $
;
run;
proc append base = census data = census_&parm1.;
run;
%let i = %eval (&i +1);
%end;
%mend ext;
%ext (namelst=&name., startlst=&start., endlst=&end. );

The resulting dataset is only 4.1mb, which you can manipulate efficiently.

SAS: Batch Macro Variable Assignment by Class and Application in Dynamic Labeling

Set up “Class” table for Proc Tabulation.  “Class” table can force missing value in the table so the layout of the combination of class variables will not change due to 0 counts of the frequency of the variables.

data class;
do rank= 1, 2, 3;
do area = '0', '1';
do gender = 'F', 'M';
output;
end;
end;
end;
run;

Output Proc Tabulate summary to data set using “Class” layout. “Layout” data set contains frequency counts by rank by area and by gender.

proc tabulate  data = data out = layout missing classdata = class;
class rank area gender ;
var count;
table rank*(area="" all), (gender all)*count*(sum="");
run;

Add customary titles by rank by area in the “Layout” data set. Assign counts and titles to macro variables.

data layout;
set layout;
length varname $11 sum $2 title1 $60 title2 $60 vartitle1 $12 vartitle2 $12;
if count_sum = . then count_sum = 0;
sum = put(count_sum, 2.);
drop _TYPE_ _PAGE_ _TABLE_;
if gender = '' then gender = 'N';
if rank = 1 then title1 = "2017 Assistant Professors/Lecturers";
if rank=2 then title1 = "2017 Associate Professors/Lecturers";
if rank =3 then title1 = "2017 Fall Professors/Senior Lecturers";
if area = 0 then title2 = "General Arts"; 
if area = 1 then title2 = "Engineering";
if area = '' then do;
title2 = "All Areas of Specialization";
varname ="rank"||trim(left(rank_cd))||gender;
vartitle1 = "rank"||trim(left(rank_cd))||'t1';
vartitle2 = "rank"||trim(left(rank_cd))||'t2';
end;
else do;
varname = "rank"||trim(left(rank_cd))||"area"||trim(left(area))||gender;
vartitle1 = "rank"||trim(left(rank_cd))||"area"||trim(left(area))||'t1';
vartitle2 = "rank"||trim(left(rank_cd))||"area"||trim(left(area))||'t2';
end;
call symput(varname, sum);
call symput (vartitle1, title1);
call symput (vartitle2, title2);
run;
options symbolgen;

Application
Print the regression line by rank by area with corresponding titles and insert counts information of total observation and by gender on the respective chart. The “Inset” statement will translate the macro variables into format like N=75(46F/29M) and put into a text box inside of the axes of the plot.

ods pdf file="c:\test.pdf";
%macro plots;
%do j = 1 %to 3;
%do i = 0 %to 1;
title1 height=14pt "&&rank&j.area&i.t1";
title2 height=12pt "&&rank&j.area&i.t2";
proc sgplot data=glm_out (where=(rank=&j. and area = "&i.")) ;
scatter x=experience y=salary / group=gender grouporder=ascending  name='plot' markerattrs=(symbol=circlefilled) ;
series x=experience y=yhat / name='predict' legendlabel='ln(Predicted Sal)' lineattrs=(color=blue ) transparency = 0.5 ;
inset "N=&&rank&j.area&i.N (&&rank&j.area&i.f.F/&&rank&j.area&i.m.M)"  / position = bottomright  BORDER;
run;
quit;
%end;
%end;
%mend plots;
%plots
Ods pdf close;

SAS: Compile varlist and run report for each variable

Recently, I found Proc Sql can allow you to create macro variable with long string values separated by either space or other delimiters directly without extra step for table generation and null data step. Here is the modify codes.

proc sql;
select distinct varname into :varlist separated by ' ' from table;
quit;
%put &varlist.;

Sometime the macro variable is created within a macro, then the macro variable will be just local macro variable and it will not be available beyond the macro statement. To force the macro variable to be available in the global environment, using %Global.

%macro new;
proc sql;
select distinct varname into :varlist separated by ' ' from table;
quit;
%global varlist;
%mend new;
%put &varlist.;

If you would like the varlist items to be wrapped with quotation and separated with comma, then use the following code.

%macro new;
proc sql;
select distinct '"'||varname||'"' as temp into :varlist separated by ', ' from table;
quit;
%global varlist;
%mend new;
%put &varlist.;

If the same macro variable is defined or used in multiple macros, making sure the macro variable is set at the global level, because macro variables are local at default. You might get error message like below if the macro variable is not forced to be global variable.

ERROR: Attempt to %GLOBAL a name (NAME) which exists in a local environment.

Here is an example of use previously created macro variable in another macro.

%macro del;
%global varlist;
proc datasets nolist;
delete &varlist;
run;
%mend del;

previous reference: http://www2.sas.com/proceedings/sugi30/028-30.pdf

proc sql;
create table varlist as
select distinct varname from table;
quit;

%let varlist =;

data _null_;
set varlist;
call symput('varlist', trim(resolve('&varlist'))||' '||trim(varname));
run;

null is a special data set name, although sas doesn’t create a data set called “_null”. It allows sas to carry out commands in the data steps.  There is no output data set.

Omitting value in the %let syntax will produce a null value for the macro variable.

resolve function: resolve the value of macro variable in data step.

CALL SYMPUT  assigns value produced in a DATA step to macro variable(S).

SAS: Variables/Fields with same Prefix

I have a SAS file that stores students course selection (up to 20 fields) and grades for each course (up to 20 fields) for a certain semester. To set the sas working file, I use the follow data step. Both course variables and grades variables have the common prefix “crs” and “grade”.  It is more efficient to list the 20 course variables using “crs1-crs20” than list each and every one of them.

%let varlist = crs1-crs20 grade1-grade20 studentid major;

Data tem1;

Set  coursefile_fw16 (keep = &varlist.);

run;

 

Heuristic Andrew

Good-enough solutions for an imperfect world

Social Media Oreo

Social Media Oreo

Real Data

Adventures in Data Science

A. C. Cockerill

Past to Future in a Picosecond

Social Mathematics

The interaction between Mathematics and the modern day world.

PH Bytes

Code is the New Literacy

Data Science Insights

In Pursuit of Smarter Decisions for Performance