How to Upload Data Into Sas Data Exploration
Introduction
I would similar to extend my sincere gratitude to our readers for their overwhelming response on my previous manufactures on data exploration. These manufactures featured: variable identification, Univariate and Bivariate analysis, Missing and Outlier identification and treatment and characteristic engineering.
In this guide, I will take a stride ahead and show all these steps to explore data sets practically in SAS. I will also perform some exercises that will help you empathize the concept better. Yous can look at this article as practical implementation of my previous articles (in SAS).
I am hoping that this guide can act as a set reference for our followers trying to navigate SAS on their ain. Let'south get down to work!
Contents:
Since this is an exhaustive guide, information technology is a practiced idea to list down all the things I'll cover:
-
-
How to load data file(s) into SAS Information gear up?
-
How to catechumen a variable to different data type?
-
How to transpose a Data set?
-
How to sort Data set in SAS?
-
How to create plots (Histogram, Scatter, Box Plot) in SAS?
-
How to generate frequency tables in SAS?
-
How to do sampling of Data set in SAS?
-
How to remove indistinguishable values of a variable?
-
How to group variables in SAS to calculate count, average, sum?
-
How to recognize missing values and outliers?
-
-
How to impute missing values and outliers?
-
How to drop and rename variables in a data ready?
-
How to merge / join data set effectively?
Office one: How to load data file(s) into SAS Information set up?
The sources of input data sets can be in various formats (.XLS, .TXT, .CSV) and sources like databases. In SAS, we can use multiple methods to load data from these sources. Permit'due south look at the commands to load information from each dataset blazon mentioned above:
- Proc Import
- Data Pace
Importing XLS/ CSV file using PROC Import:
Notes:
- Both .xls and .xlsx files can be imported.
- To import a .xlsx file, simply alter xls to xlsx in both the datafile and the dbms line.
- For importing CSV file, apply CSV for dbms line.
We can also create a library from excel files using Libname statement (Each worksheet in the Excel workbook is treated every bit a SAS data set. Worksheet proper noun appears with a dollar sign at the end of the name).
If SAS has a libref assigned to an Excel workbook, the workbook cannot be opened in Excel. To disassociate a libref, use a LIBNAME statement and specify the libref and the Clear pick.
Importing raw text file (Txt) using PROC Import:
If your information file is a elementary text file, yous can employ following commands:
Information technology is assumed that the kickoff row of the data set contains column names. If starting time row is non the column name, then we would alter getnames=yes to getnames=no. Afterthat,names of the columns would get stored as VAR1 to VARn.
Importing using Data step:
Y'all can too make use ofData Step to import information from csv or text file.
Syntax:
Information output_set; INFILE 'raw_data_file_name'; Input specifications; <additional statements>; Run;
Example: Import data from a csv file using data footstep, assuming values are separated past comma(,).
Higher up, nosotros looked at multiple methods to load data gear up in SAS. To load data fix from databases like ORACLE, SQL SERVER and others, we would require authorization from both SAS Admin or Database admin.
To explore this in detail, y'all can refer to links below:
- Resource 1
- Resource 2
- Resource three
Office two: How to catechumen a variable to different data type?
We tin convert character to numeric and numeric to character and as well change the format of variable similar number to appointment, date to number, number to currency format etc. Allow'due south await at some of the commands to perform these conversions:
-
Convert numeric variables to Character variables and vice versa
To perform this, we will use INPUT function. Information technology takes 2 arguments: the name of a graphic symbol variable and a SAS informat or user-divers informat to read the data.
Syntax:
INPUT (Source, Informat)
In snapshot beneath, you tin run across that variable Avg is in graphic symbol format. At present to catechumen information technology into number, nosotros'll apply Input function.
See below codes:
Similarly, if nosotros want to convert a numeric variable to character, it can be done using PUT office.
Syntax:
Put(Source, Format)
2. Catechumen graphic symbol date to Appointment
For more details on Input and Put role, y'all can refer below links:
- Resources i
- Resource 2
Part iii: How to transpose a Data set?
Let us say, nosotros want to transpose Tabular array A into Table B on variable Product. This task tin be accomplished in SAS using PROC Transpose:
For more detail on PROC Transpose, refer below link:
- Resources i
- Resource 2
Role 4: How to sort Data fix in SAS?
Sorting of data can be done using process PROC SORT. It tin can be based on multiple variables and ascending or descending both order.
Syntax:
PROC SORT Data = Input_data_set <Out = Output_data_set>; By <Descending> Variable_1 <Descending Variable_2 ....; Run;
Above, we have a tabular array with variables ID, Product and Sales. At present, nosotros want to sort it past Product and Sales (in descending order) as shown in table 2. This can be done using Proc Sort as shown below.
Part 5: How to create plots (Histogram, Besprinkle, Box Plot) in SAS?
Let'southward understand plots using the example shown above. We have employee details with their EmpID, Gender, Age and Sales Detail. We desire to understand:
- The distribution of age
- Relation between historic period and sales; and
- If sales are normally distributed or non?
These tasks can be accomplished by using Besprinkle, Box and Histogram representation.
Histogram:
Now to empathize the distribution and check whether the information is distributed normally or non, we will plot a Histogram. In SAS, histograms can be produced using PROC UNIVARIATE, PROC CHART, or PROC GCHART. Here we will use PROC UNIVARIATE with the HISTOGRAM statement.
Scatter plot:
It is used to notice the relation b/w ii continuous variables. Here we will employ PROC SGPLOT to plot scatter graph.
Box-plot:
Box-Plot is used to empathize the distribution of continuous variables. This is also known as 5 number summary plot of Min, Beginning Quartile, Median, 3rd Quartile and Max. Nosotros volition once again use PROC SGPLOT to display the Box-plot.
For more details on PROC Univariate and PROC SGPLOT, you can refer below links:
- Resource ane
- Resource ii
- Resource 3
Role half dozen: How to generate frequency tables in SAS?
Frequency Tables tin can exist used to understand the distribution of a chiselled variable or due north categorical variables using frequency tables. Nosotros will utilize PROC FREQ procedure to perform this.
PROC FREQ is capable of producing statistical test and other statistical measures in gild to analyze categorical data based on the jail cell frequencies in 2-way or college tables.
I have added some other variable BMI to above mentioned employee table. Now, to sympathise the distribution between GENDER and BMI, I volition apply PROC FREQ procedure with CHISQ statistical examination.
For more detail on PROC FREQ, y'all tin can refer below link:
- Resource
Office 7: How to practise sampling of Information gear up in SAS?
To select an unbiased sample from a larger data set in SAS, we use procedure PROC SURVEYSELECT. Here we will go with PROC SURVEYSELCT.
Allow's say, from EMP table, I desire to select random sample of 3 employee.
Part 8: How to remove duplicate values of a variable?
Oft, we meet duplicate observations. To tackle this, SAS has multiple options similar FIRST., Last., NODUPKEY with PROC SORT ,PROC SQL and others. Let's sympathize these options i past one:
Method ane. Using Kickoff. or Terminal.
To utilise Starting time. or Last. option, information set must exist sorted by variable(southward) on which we want to identify the unique records. First.
and Last.
automatic variables created by SAS when using past-grouping
processing. It has value of 0 and 1.
- If it is commencement observation of duplicate values of 'by variable' then value of variable Beginning. would be 1 else 0
- For Last variable, it would be 1 if final observation of duplicate values of 'by variable' else 0.
Above, you can see that how value of First. and concluding. is populated. At present, let's come across how can we use these 2 values to identify unique records.
Higher up, we accept used first. to filter first ascertainment and to filter last ascertainment, nosotros can use Terminal.
Method 2. NODUPKEY with Proc SORT
Nosotros can use NODUPKEY option with Proc Sort to remove duplicate values.
Method three. Binning Numerical Variable
We can use conditional statements and logical operators to bin numerical variables. In Emp data set, nosotros accept variable Historic period. Here we volition bin variable Age every bit <25, >=25 and <35, >=35.
Part ix: How to group variables in SAS to calculate count, average, sum?
To understand the count, average and sum of variable, I would suggest you to use PROC SQL with group by. There are other methods too like Proc FREQ and PROC Means to perform.
Let's wait at the syntax of these Procedures:
PROC SQL:
PROC SQL; Create table <Output Data ready> as Select Count(Var1), Sum(Var2), Average(Var2) from <Input Data set> grouping by Var4, Var5...; Quit;
PROC Ways:
PROC MEANS Data=<Input Data Set>; VAR Varibales(south); Class Classification_Varibale(s); Run;
Role 10: How to recognize missing values and outliers?
To identify outliers in a variable, nosotros can become with Proc Univariate procedure and use PROC FREQ to identify missing values. Allow's look at the output below to understand these two procedures:
Above, you tin can see that PROC Univariate as shown top and lesser 5 values whereas PROC FREQ shows the distribution of unique values of variable.
Part 11: How to impute missing values and outliers?
There are various imputation methods available for missing and outlier imputation. You can refer these articles for methods to detect Outlier and Missing values. Imputation methods for both missing and outlier values are almost like. Here we will discuss general case imputation methods to replace missing values. Allow's do information technology using an example:
Let's say we have an employee data gear up comprising of multiple variables similar Empid, Name, Gender, Sales, Age, Region, Production and other. Here, we want to predict the sales of employee. But, one of the business organization is variable Age has missing values and variable Age appeared as significant variable.
Now to deal with this missing values, I have written below SAS statements:
Identify Values to Impute Using Full general Case Method (Average of Historic period):
Imputation Using Data Step
In a higher place, yous have seen 1 of the methods to deal with it. You can besides employ multiple methods using SAS statements. I would propose you to do all the discussed method in my previous mail on missing values and outliers.
Role 12: How to drop and rename variables in a data gear up?
Let's say, during data exploration stage, we desire to exclude variables those are not required in the data modelling exercise or want to rename few variables as well. These two operations tin can exist performed using Driblet and RENAME options using Data Stride.
Let'southward say, we want to drop variable Age and rename variable Gender as Sex activity. This can be performed using below argument.
Part 13: How to merge / join data set effectively?
Merging / Joining can be of various types. It depends on the concern requirement and human relationship betwixt information sets. In SAS, we tin perform this in various ways using DATA STEP, PROC SQL and PROC Format. Now, question is, which is the near appropriate method to perform merging and joining?
You tin can refer on of my mail on this topic for detailed info. hither: Introduction to Merging.
Cease Notes:
In this guide, we looked at the SAS statements for various steps in data exploration and munging like loading of data, converting information blazon, transposing tables, sorting, plotting, removing duplicate values, binning, grouping, identifying missing & outlier values, dropping & renaming variables, merging & joining tables and imputing values for missing and outlier values. We besides looked at the basic SAS statement to perform this and have given links to look at more advance methods.
In i of the next article, I volition reveal the codes to perform these steps in Python. Stay Tuned!
Did you find the commodity useful? Practice permit us know your thoughts about this guide through comments below.
If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow the states on twitter or similar our facebook page.
Source: https://www.analyticsvidhya.com/blog/2015/04/data-exploration-sas-data-step-proc-sql/
0 Response to "How to Upload Data Into Sas Data Exploration"
Post a Comment