Economics 321: Applied Econometrics Prof. George Jakubson A Quick Primer on SAS Commands A SAS job has two distinct components. 1. A DATA step (or steps) readies a dataset for analysis. 2. A PROC (for procedure) step (or steps) performs the analysis on the dataset created from a prior DATA step. I'm not going to show you everything there is to know about SAS, but enough so you can do the exercises for this course and be able to do similar things for other courses (e.g. for a term paper). I. General Comments 1. SAS commands end with a semicolon (;). They can span multiple lines - the program considers everything up to the next semicolon as part of the same logical statement. A common source of errors is forgetting a semicolon. 2. In all environments, comments start with an asterisk (*) and end with a semicolon, as follows: ** this is a comment ; In some but not all environments, the following syntax will also work: /* this is also a comment sometimes*/ 3. There are two kinds of variables. Numeric variables contain numerical values (numbers) and character variables contain text. Therefore, the number 1 and the character '1' are different. Means, regressions, etc., can only be calculated on numerical values. Frequency distributions are the exception - they can be calculated on character variables (but it's a real nuisance). II. The DATA Step version 1 (reading in raw data from within the program) DATA NEW ; /* create the temporary dataset called new */ INPUT A $ B C ; /* read in the variables A, B, and C. A is character, B and C numeric */ CARDS ; /* The data are on the lines following this statement */ George 1 2 Jen 3 4 ; /* semicolon marks the end of the data */ RUN ; /* run command marks the end of a DATA or PROC step */ III. The DATA Step version 2 (reading in data from an existing dataset and creating new variables) DATA NEW2; /* create a new temporary dataset called new2 */ SET NEW ; /* read in the data from a dataset called new. If both names are the same, this will overwrite the original dataset */ LNB = LOG(B) ; /* take the natural logarithm of variable B */ IF (C GT 3) THEN DUM1 = 1 ; ELSE DUM1 = 0 ; /* Create a dummy variable which takes the value 1 if variable C is greater than 3 and 0 otherwise. You have the following operators available to you: EQ for equals, NE for not equal to, GT for greater than, GE for greater than or equal to, LT for less than, LE for less than or equal to, AND for a logical and, OR for a logical or, NOT for a logical not, MIN and MAX for minimum and maximum, respectively. */ D = (B+LNB)**3 ; /* create a new variable D using arithmetic operations on existing variables. You have the following symbols available: + for addition, - for subtraction, * for multiplication, / for division, and ** for exponentiation. You can use parentheses to group operations as I did above. */ IF (A EQ 'Jen') ; /* this is a subsetting if statement - only keep those observations for which the variable A takes the value 'Jen' Note that this is a character variable, so the value we match must be character and not numeric */ IF (B LT 2) ; /* Keep the observation if variable B has a value less than 2. Note that successive subsetting if statements will have a cumulative effect - in this example, only those observations for which A equals 'Jen' and B is less than 2 will be kept. */ RUN ; /* end the data step */ IV. The PROC Step PROC steps perform analyses. The syntax varies with the procedure. They all start PROC procname DATA=dataset ; I'll sketch out means, correlations, frequencies, and regression for you: A. PROC MEANS to get means, standard deviations, etc. PROC MEANS DATA=NEW; /* take means of variables in dataset new */ VAR B C ; /* only analyze variables B and C. If not included, the default action is to analyze all numeric variables */ RUN ; /* end the PROC step */ B. PROC CORR to get correlations PROC CORR DATA=NEW ; VAR B C ; RUN ; C. PROC FREQ to get frequency distributions PROC FREQ DATA=NEW ; TABLES DUM1 A DUM1*A ; /* TABLES tells the procedure which variables to analyze. To get a frequency distribution on a variable, include its name in the tables command. To get a 2-way crosstabulation of the values of DUM1 against the values of A, use the DUM1*A syntax. */ RUN ; D. PROC REG to run regressions PROC REG DATA=NEW ; MODEL LNB = C DUM1 ; /* The MODEL command specifies a regression equation to be estimated from the data. It starts with the word MODEL. The next element is the name of the dependent (Y) variable. Then there is an equals sign. Then come the names of the explanatory (X) variables. By default SAS will include an intercept for you. */ RUN ; V. Temporary and Permanent SAS datasets SAS datasets are either permanent or temporary. a. A temporary dataset has a one-level name, for example, new. Temporary datasets are erased when the job has completed. b. A permanent dataset has a two-level name, for example, sasdat.new. Permanent datasets remain in existence until they are explicitly deleted. The first level of the name (sasdat, above) refers to the location of the directory which contains the dataset. That is specified using a LIBNAME statement: LIBNAME sasdat 'directory location' ; so to make the current directory the location you could put '.' (Un*x speak for the current directory) or to make it /usr2/gj10 you would put '/usr2/gj10'. My examples are using temporary datasets - to use permanent datasets just use two level names. There's lots more that one can do, but these basics will cover the vast majority of the tasks you'll ever need. If you start using SAS more regularly, or for an honors thesis, the manuals are a reasonable investment. Alternatively, check
out the information under "Introduction to SAS" for a middle ground between
this primer and the (expensive) investment in manuals.