PROC SQL for DATA Step Die-Hards

Christianna S. Williams, Yale University

ABSTRACT

PROC SQL can be rather intimidating for those who have

learned SAS data management techniques exclusively

using the DATA STEP. However, when it comes to data

manipulation, SAS often provides more than one method

to achieve the same result, and SQL provides another

valuable tool to have in one’s repertoire. Further,

Structured Query Language is implemented in many

widely used relational database systems with which SAS

may interface, so it is a worthwhile skill to have from that

perspective as well.

This tutorial will present a series of increasingly complex

examples. In each case I will demonstrate the DATA

STEP method with which users are probably already

familiar, followed by SQL code that will accomplish the

same data manipulation. The simplest examples will

include subsetting variables (columns, in SQL parlance)

and observations (rows), while the most complex

situations will include MERGEs (JOINS) of several types

and the summarization of information over multiple

observations for BY groups of interest. This approach

will clarify for which situations the DATA STEP method

or, conversely, PROC SQL would be better suited. The

emphasis will be on writing clear, concise, debug-able

SAS code, not on which types of programs run the fastest

on which platforms.

INTRODUCTION

The DATA step is a real workhorse for virtually all SAS

users. Its power and flexibility are probably among the

key reasons why the SAS language has become so

widely used by data analysts, data managers and other

“IT professionals”. However, at least since version 6.06,

PROC SQL, which is the SAS implementation of

Structured Query Language, has provided another

extremely versatile tool in the base SAS arsenal for data

manipulation. Still, for many of us who began using SAS

prior to the addition of SQL or learned from hardcore

DATA step programmers, change may not come easily.

We are often too pressed for time in our projects to learn

something new or venture from the familiar, even though

it may save us time and make us stronger programmers

in the long run. Often SQL can accomplish the same

data manipulation task with considerably less code than

more traditional SAS techniques.

This paper is designed to be a relatively painless

introduction to PROC SQL for users who are already

quite adept with the DATA step. Several examples of row

selection, grouping, sorting, summation and combining

information from different data sets will be presented.

For each example, I’ll show a DATA step method

(recognizing that there are often multiple techniques to

achieve the same result) followed by an SQL method.

Throughout the paper, when I refer to “DATA step

methods”, I include under this term other base SAS

procedures that are commonly used for data

manipulation (e.g. SORT, SUMMARY). In each code

example, SAS keywords are in ALL CAPS, while arbitrary

user-provided parameters (i.e. variable and data set

names) are in lower case.

THE DATA

First, a brief introduction to the data sets. Table 1

describes the four logically linked data sets, which

concern the hospital admissions for twenty make-believe

patients. The variable or variables that uniquely identify

an observation are indicated in bold; the data sets are

sorted by these keys. Complete listings are included at

the end of the paper. Throughout the paper, it is

assumed that these data sets are located in a data library

referenced by the libref EX.

Table 1. Description of data sets for examples

Data set Variable Description

admits pt_id patient identifier

admdate date of admission

disdate date of discharge

hosp hospital identifier

bp_sys systolic blood pressure

(mmHg)

bp_dia diastolic blood pressure