Monday, June 5, 2017

SQL Advanced Joins (SAS)

This article is a short tutorial pertaining to the joining of data sets within SAS. Though SAS code does contain its own capabilities for data merging, I personally feel that data merging is far cleaner and efficient, from a coding perspective, when performed within SQL, or within SAS through the utilization of PROC SQL.

I am assuming that you are familiar with how SQL joins are utilized. However, if you are not, here is a quick summary of the most useful types of joins and how they operate:

The Left Join - All data from the left data set is joined with the data from the right data set. The matches are made based on primary unique identifiers from each table. All data that does not match from the right table is discarded. However, all data from the left data set remains regardless.

The Right Join - All data from the right data set is joined with the data from the left data set. The matches are made based on primary unique identifiers from each table. All data that does not match from the left set is discarded. However, all data from the right data set remains regardless.

The Inner Join - All data from the right data set is joined with data from the left data set The matches are made based on primary unique identifiers from each table. All data that does not match from either set is discarded.

The Full Outer Join - All data from the right data set is joined with data from the left data set. The matches are made based on primary unique identifiers from each table. All data that does not match from either set is kept regardless.

The easiest way to join two tables is to utilize a modified version of the code below:

proc sql ;
create table <tablename> as /* Name of data set that will be created */
select a.<varnames or *>, b.<varnames or *> /* Name of the variables that will be included within the new set */

from <TABLEA> as a left join /* type of join */ <TABLEB> as b /* The names of the original two sets that will be joined */

on a.<VARA> = b.<VARB> /* The primary keys from each data set from which matches will be assessed */

;

quit;


I would recommend enabling the OPTION NOTHREADS; option that was discussed in a previous article. This option prevents SAS from potentially corrupting the original data sets while merging them to create a new set.

As an example, let us say that we wanted to left join SETA with SETB based on matching id variables.

proc sql ;

create table SETC as 
/* Name of data set that will be created */

select a.*, b.* 
/* Name of the variables that will be included within the new set */

from SETA as a left join /* All data that does not match SETA from SET B will be dropped from the set */ SETB as b 
/* The names of the original two sets that will be joined */

on a.ID1 = b.ID2 /* In this particular case, ID1 is primary key utilized from SETA, and ID2 is the primary key utilized from SETB */

;

quit;


For the code to run correctly, the primary keys which are being compared must be of the same data type. Also, be sure that the data variables which will be consolidated through the join do not have the same variable name.

Format/Informat/Length (SAS)

There exists some confusion, even amongst talented SAS Programmers, as to what the exact difference is between Format, Informat, and Length. This article attempts the explain the characteristics which differentiate each statement.

Format - The format statement specifies how stored SAS variable data is displayed.

Length - The length statement specifies the amount of memory allotted to a specific data variable. If length is modified, data could be potentially modified if the original variable length exceeds the newly established length.

Informat - The informat statement specifies the format in which a variable from an input source is read into SAS.

SQL Advanced Queries (SAS/SQL)

There are many useful SQL Queries that can be utilized within The SAS Enhanced Editor. Below are examples and descriptions of query types that I found to be particularly useful.

/* Selects multiples variables from the same data set and creates a new variable within the query results. This variable reflects the sum of all VARD observation values which satisfy the specifications of the WHERE clause. The query is ordered by the observation values of VARB. */

PROC SQL;
SELECT VARA, VARB, VARC, sum(VARD) as VARDSUM
FROM TABLEA
WHERE VARA >500
ORDER BY VARB
;
QUIT;


/****************************************************************************/

/* Selects multiple variables from the same data set and creates a new variable within the query results. This variable reflects the average of all VARD observation values which satisfy the specifications of the WHERE clause. The query is ordered the by the observation values of VARB. */

PROC SQL;
SELECT VARA, VARB, VARC, avg(VARD) as VARDSUM
FROM TABLEA
WHERE VARA >500
ORDER BY VARB
;
QUIT;


/****************************************************************************/

/* Selects multiple variables from the same data set and creates a new variable within the query results. This variable is comprised of the product value of VARD observation values multiplied by .5. These values must satisfy the specifications of the WHERE clause. The query is ordered by the observation values of VARB. */

PROC SQL;
SELECT VARA, VARB, VARC, VARD * .5 as VARDPRODUCT
FROM TABLEA
WHERE VARA >500
ORDER BY VARB
;
QUIT;


/****************************************************************************/

/* Selects multiple variables from the same data set. Only displays results which have a VARA value greater than 500. Order the data by VARB in descending order.*/

PROC SQL;
SELECT VARA, VARB, VARC, VARD
FROM TABLEA
WHERE VARA >500
ORDER BY VARB DESC
;
QUIT;


/****************************************************************************/

/* Select multiple variables from the same data set. Only display results which have a VARA value greater than 500. Order the results first by VARB and then by VARA.*/

PROC SQL;
SELECT VARA, VARB, VARC, VARD
FROM TABLEA
WHERE VARA >500
ORDER BY VARB VARA
;
QUIT;


/****************************************************************************/

/* Selects multiple variables from multiple data sets. Only display results which have a VARA value greater than 500. Order the results by VAR2.*/

PROC SQL;
SELECT TABLEA.VARA, TABLEA.VARB, TABLEA.VARC, TABLEB.VAR1, TABLEB.VAR2
FROM TABLEA, TABLEB
WHERE VARA >500
ORDER BY VAR2
;
QUIT;


/****************************************************************************/

/* Selects results only from a range that occurs between two limits. These limits will also be included in the query results. This particular operator is useful when querying  by date values. */

PROC SQL;
SELECT VARA
FROM TABLEA
WHERE VARA between 800 and 900
;
QUIT;


/****************************************************************************/

/* Selects results from a group of variable observations that contain certain string. These limits will also be included in the query results. */

PROC SQL;
SELECT VARA
FROM TABLEA
WHERE VARA contains ‘SAS’
;
QUIT;


/****************************************************************************/

/* Selects results from a group of variable observations that potentially match a selection of varying values. */

PROC SQL;
SELECT VARA
FROM TABLEA
WHERE VARA in (‘SPADE’, ‘DIAMOND’, CLUB’)
;
QUIT;


/****************************************************************************/

/* Selects results from a group of variable observations that potentially match. However, in this case, wildcard characters are utilized to specify potential matches. */

PROC SQL;
SELECT VARA
FROM TABLEA
WHERE VARA LIKE ’S_ADE’
;
QUIT;


/****************************************************************************/

/* Select results from a group of variable observations that potentially match against another value. However, in this case, wildcard characters are utilized to specify potential matches. This differs from the previous example as the ‘%’ character is used to act as a match against entire words */

PROC SQL;
SELECT VARA
FROM TABLEA
WHERE VAR LIKE ’% Jones’ /*This would return all individuals in a query with the last name of Jones */
;
QUIT;


/****************************************************************************/

/* When sorting by a variable that was defined within the submitted query, the keyword CALCULATED must be utilized. */

PROC SQL;
SELECT VARA, VARB, VARC, VARD * .5 as VARDPRODUCT
FROM TABLEA
WHERE CALCULATED VARD < 500
ORDER BY VARB
;
QUIT;


/****************************************************************************/

/* If you want to specify a title for a query that will display in the output, utilize the title1 statement. However, be aware that this title will be utilized as a title for every subsequent query unless a blank title is provided prior to the execution of the next query.*/

Title1 ‘This is a test query’;
PROC SQL;
SELECT VARA, VARB, VARC, VARD * .5 as VARDPRODUCT
FROM TABLEA
WHERE CALCULATED VARD < 500
ORDER BY VARB
;
QUIT;
Title1 ‘ ‘; /*This prevents the title from re-displaying on subsequent query results */


/****************************************************************************/

/* GROUP BY allows you to group by the observation types specified within the variable column specified.*/

PROC SQL;
SELECT VARA, VARB, VARC
FROM TABLEA
WHERE CALCULATED VARD < 500
GROUP BY VARA
;
QUIT;


/****************************************************************************/

/* Stacks two data sets vertically. */

PROC SQL;
SELECT *
FROM TABLEA
UNION ALL
SELECT *
FROM TABLEB
;
QUIT;


/****************************************************************************/

/* Formats a newly created variable displayed within the query results. */

PROC SQL;
SELECT VARA, VARB, VARC, VARD * .5 as VARDPRODUCT format=best32.
FROM TABLEA
WHERE CALCULATED VARD < 500
ORDER BY VARB
;
QUIT;