Comprehensive guide for Data Exploration in SAS (using Data step and Proc SQL)

Sunil Ray Last Updated : 19 Jul, 2020

9 min read

Introduction

I would like to extend my sincere gratitude to our readers for their overwhelming response on my previous articles on data exploration. These articles featured: variable identification, Univariate and Bivariate analysis, Missing and Outlier identification and treatment and feature engineering.

In this guide, I will take a step ahead and show all these steps to explore data sets practically in SAS. I will also perform some exercises that will help you understand the concept better. You can look at this article as practical implementation of my previous articles (in SAS).

I am hoping that this guide can act as a ready reference for our followers trying to navigate SAS on their own. Let’s get down to work!

Part 1: How to load data file(s) into SAS Data set?

The sources of input data sets can be in various formats (.XLS, .TXT, .CSV) and sources like databases. In SAS, we can use multiple methods to load data from these sources. Let’s look at the commands to load data from each dataset type mentioned above:

Proc Import
Data Step

Importing XLS/ CSV file using PROC Import:

Notes:

Both .xls and .xlsx files can be imported.
To import a .xlsx file, simply change xls to xlsx in both the datafile and the dbms line.
For importing CSV file, use CSV for dbms line.

We can also create a library from excel files using Libname statement (Each worksheet in the Excel workbook is treated as a SAS data set. Worksheet name appears with a dollar sign at the end of the name).

If SAS has a libref assigned to an Excel workbook, the workbook cannot be opened in Excel. To disassociate a libref, use a LIBNAME statement and specify the libref and the CLEAR option.

Importing raw text file (Txt) using PROC Import:

If your data file is a simple text file, you can use following commands:

It is assumed that the first row of the data set contains column names. If first row is not the column name, then we would change getnames=yes to getnames=no. After that, names of the columns would get stored as VAR1 to VARn.

Importing using Data step:

You can also make use of Data Step to import data from csv or text file.

Syntax:

Data output_set;
 INFILE 'raw_data_file_name';
 Input specifications;
 <additional statements>;
 Run;

Example: Import data from a csv file using data step, assuming values are separated by comma(,).

Above, we looked at multiple methods to load data set in SAS. To load data set from databases like ORACLE, SQL SERVER and others, we would require authorization from both SAS Admin or Database admin.

To explore this in detail, you can refer to links below:

Part 2: How to convert a variable to different data type?

We can convert character to numeric and numeric to character and also change the format of variable like number to date, date to number, number to currency format etc. Let’s look at some of the commands to perform these conversions:

Convert numeric variables to Character variables and vice versa

To perform this, we will use INPUT function. It takes two arguments: the name of a character variable and a SAS informat or user-defined informat to read the data.

Syntax:

INPUT (Source, Informat)

In snapshot below, you can see that variable Avg is in character format. Now to convert it into number, we’ll use Input function.

See below codes:

Similarly, if we want to convert a numeric variable to character, it can be done using PUT function.

Syntax:

Put(Source, Format)

2. Convert character date to Date

For more details on Input and Put function, you can refer below links:

Part 3: How to transpose a Data set?

Let us say, we want to transpose Table A into Table B on variable Product. This task can be accomplished in SAS using PROC Transpose:

For more detail on PROC Transpose, refer below link:

Part 4: How to sort Data set in SAS?

Sorting of data can be done using procedure PROC SORT. It can be based on multiple variables and ascending or descending both order.

Syntax:

PROC SORT Data = Input_data_set <Out = Output_data_set>;
 By <Descending> Variable_1 <Descending Variable_2 ....;
 Run;

Above, we have a table with variables ID, Product and Sales. Now, we want to sort it by Product and Sales (in descending order) as shown in table 2. This can be done using Proc Sort as shown below.

Part 5: How to create plots (Histogram, Scatter, Box Plot) in SAS?

Let’s understand plots using the example shown above. We have employee details with their EmpID, Gender, Age and Sales Detail. We want to understand:

The distribution of age
Relation between age and sales; and
If sales are normally distributed or not?

These tasks can be accomplished by using Scatter, Box and Histogram representation.

Histogram:

Now to understand the distribution and check whether the data is distributed normally or not, we will plot a Histogram. In SAS, histograms can be produced using PROC UNIVARIATE, PROC CHART, or PROC GCHART. Here we will use PROC UNIVARIATE with the HISTOGRAM statement.

Scatter plot:

It is used to find the relation b/w two continuous variables. Here we will use PROC SGPLOT to plot scatter graph.

Box-plot:

Box-Plot is used to understand the distribution of continuous variables. This is also known as five number summary plot of Min, First Quartile, Median, Third Quartile and Max. We will again use PROC SGPLOT to display the Box-plot.

For more details on PROC Univariate and PROC SGPLOT, you can refer below links:

Part 6: How to generate frequency tables in SAS?

Frequency Tables can be used to understand the distribution of a categorical variable or n categorical variables using frequency tables. We will use PROC FREQ procedure to perform this.

PROC FREQ is capable of producing statistical test and other statistical measures in order to analyze categorical data based on the cell frequencies in 2-way or higher tables.

I have added another variable BMI to above mentioned employee table. Now, to understand the distribution between GENDER and BMI, I will use PROC FREQ procedure with CHISQ statistical test.

For more detail on PROC FREQ, you can refer below link:

Resource

Part 7: How to do sampling of Data set in SAS?

To select an unbiased sample from a larger data set in SAS, we use procedure PROC SURVEYSELECT. Here we will go with PROC SURVEYSELCT.

Let’s say, from EMP table, I want to select random sample of 3 employee.

Part 8: How to remove duplicate values of a variable?

Often, we encounter duplicate observations. To tackle this, SAS has multiple options like FIRST., LAST., NODUPKEY with PROC SORT ,PROC SQL and others. Let’s understand these options one by one:

Method 1. Using First. or Last.

To use First. or Last. option, data set must be sorted by variable(s) on which we want to identify the unique records. First. and Last. automatic variables created by SAS when using by-group processing. It has value of 0 and 1.

If it is first observation of duplicate values of ‘by variable’ then value of variable First. would be 1 else 0
For Last variable, it would be 1 if last observation of duplicate values of ‘by variable’ else 0.

Above, you can see that how value of First. and last. is populated. Now, let’s see how can we use these two values to identify unique records.

Above, we have used first. to filter first observation and to filter last observation, we can use Last.

Method 2. NODUPKEY with Proc SORT

We can use NODUPKEY option with Proc Sort to remove duplicate values.

Method 3. Binning Numerical Variable

We can use conditional statements and logical operators to bin numerical variables. In Emp data set, we have variable Age. Here we will bin variable Age as <25, >=25 and <35, >=35.

Part 9: How to group variables in SAS to calculate count, average, sum?

To understand the count, average and sum of variable, I would suggest you to use PROC SQL with group by. There are other methods also like Proc FREQ and PROC Means to perform.

Let’s look at the syntax of these Procedures:

PROC SQL:

PROC SQL;
 Create table <Output Data set> as
 Select Count(Var1), Sum(Var2), Average(Var2) from <Input Data set> group by Var4, Var5...;
 Quit;

PROC Means:

PROC MEANS Data=<Input Data Set>;
 VAR Varibales(s);
 Class Classification_Varibale(s);
 Run;

Part 10: How to recognize missing values and outliers?

To identify outliers in a variable, we can go with Proc Univariate procedure and use PROC FREQ to identify missing values. Let’s look at the output below to understand these two procedures:

Above, you can see that PROC Univariate as shown top and bottom 5 values whereas PROC FREQ shows the distribution of unique values of variable.

Part 11: How to impute missing values and outliers?

There are various imputation methods available for missing and outlier imputation. You can refer these articles for methods to detect Outlier and Missing values. Imputation methods for both missing and outlier values are almost similar. Here we will discuss general case imputation methods to replace missing values. Let’s do it using an example:

Let’s say we have an employee data set comprising of multiple variables like Empid, Name, Gender, Sales, Age, Region, Product and other. Here, we want to predict the sales of employee. But, one of the concern is variable Age has missing values and variable Age appeared as significant variable.

Now to deal with this missing values, I have written below SAS statements:

Identify Values to Impute Using General Case Method (Average of Age):

Imputation Using Data Step

Second

Above, you have seen one of the methods to deal with it. You can also use multiple methods using SAS statements. I would suggest you to practice all the discussed method in my previous post on missing values and outliers.

Part 12: How to drop and rename variables in a data set?

Let’s say, during data exploration stage, we want to exclude variables those are not required in the data modelling exercise or want to rename few variables also. These two operations can be performed using DROP and RENAME options using DATA STEP.

Let’s say, we want to drop variable AGE and rename variable Gender as Sex. This can be performed using below statement.

Part 13: How to merge / join data set effectively?

Merging / Joining can be of various types. It depends on the business requirement and relationship between data sets. In SAS, we can perform this in various ways using DATA STEP, PROC SQL and PROC Format. Now, question is, which is the most appropriate method to perform merging and joining?

You can refer on of my post on this topic for detailed info. here: Introduction to Merging.

End Notes:

In this guide, we looked at the SAS statements for various steps in data exploration and munging like loading of data, converting data type, transposing tables, sorting, plotting, removing duplicate values, binning, grouping, identifying missing & outlier values, dropping & renaming variables, merging & joining tables and imputing values for missing and outlier values. We also looked at the basic SAS statement to perform this and have given links to look at more advance methods.

In one of the next article, I will reveal the codes to perform these steps in Python. Stay Tuned!

Did you find the article useful? Do let us know your thoughts about this guide through comments below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Sunil Ray

Sunil Ray is Chief Content Officer at Analytics Vidhya, India's largest Analytics community. I am deeply passionate about understanding and explaining concepts from first principles. In my current role, I am responsible for creating top notch content for Analytics Vidhya including its courses, conferences, blogs and Competitions.

I thrive in fast paced environment and love building and scaling products which unleash huge value for customers using data and technology. Over the last 6 years, I have built the content team and created multiple data products at Analytics Vidhya.

Prior to Analytics Vidhya, I have 7+ years of experience working with several insurance companies like Max Life, Max Bupa, Birla Sun Life & Aviva Life Insurance in different data roles.

Industry exposure: Insurance, and EdTech

Major capabilities: Content Development, Product Management, Analytics, Growth Strategy.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

krishna kant Dixit

Many Thanks Kunal, Great Work ! Thanks for sharing your understanding on different topics of data analytics , I didn't find better and easy explanation of things than in your blog so couldn't stop myself to thank you. Please keep posting ,Many analytics Beginners/enthusiast are following you . Thanks, Krishna Kant Dixit

Mohit

Hi , Thanks for your awesome tutorials.I need to understand Array with do loop? Do you have any theory explanations for that?

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Comprehensive guide for Data Exploration in SAS (using Data step and Proc SQL)

Introduction

Contents:

Part 1: How to load data file(s) into SAS Data set?

Importing XLS/ CSV file using PROC Import:

Importing raw text file (Txt) using PROC Import:

Importing using Data step:

Part 2: How to convert a variable to different data type?

Convert numeric variables to Character variables and vice versa

2. Convert character date to Date

Part 3: How to transpose a Data set?

Part 4: How to sort Data set in SAS?

Part 5: How to create plots (Histogram, Scatter, Box Plot) in SAS?

Histogram:

Scatter plot:

Box-plot:

Part 6: How to generate frequency tables in SAS?

Part 7: How to do sampling of Data set in SAS?

Part 8: How to remove duplicate values of a variable?

Method 1. Using First. or Last.

Method 2. NODUPKEY with Proc SORT

Method 3. Binning Numerical Variable

Part 9: How to group variables in SAS to calculate count, average, sum?

Part 10: How to recognize missing values and outliers?

Part 11: How to impute missing values and outliers?

Part 12: How to drop and rename variables in a data set?

Part 13: How to merge / join data set effectively?

End Notes:

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or