Table of Contents
- Replacing missing values (nulls) with 0s
- Removing unwanted characters from a string
- Cleansing multiple columns at the same time
- Creating Variables: Finding the distance between two locations
- Creating Variables: Days between application date and term start
- Creating Variables: Out-of-state Flag
- Creating Variables: Age
Replacing missing values (nulls) with 0s
When values are missing from a data source, they show in Construct as null. As part of data cleaning best practices, it is often desired to remove all instances of missing values and replace with a place holder. For instance, when tracking the number of courses a student has registered for, if a student has not registered for any courses then they will have nothing in their record. Instead of showing null for this student, it would be more appropriate to show that they had registered for a count of 0 courses.
The Cleanse node can be used to find all instances of a value, and replace it with a different value. In this case the "value" we are looking to replace is actually no value at all, so the "Is Missing" option is selected from the list of operators.
Removing unwanted characters from a string
The Cleanse node not only allows user to find and replace whole values, but also subsections within values. The Sub-String Replacement tool on the Modify tab of the Cleanse node can be used to replace unwanted characters within text fields. To replace characters within text values, enter the character(s) that should be replaced, and then enter the character or string you want to replace it with. If you want to remove a character within a text value, simply do not enter a replacement field.
Cleansing multiple columns at the same time
There are often several columns that have the same data type and need the same cleanup rules applied. For instance, you may want to replace missing values (nulls) with zeros for multiple columns that contain nulls. This can be done all at once in the Cleanse node by multi-selecting columns to be cleaned, and then applying the same cleanup rule to all selected columns.
To do this, in the Cleanse node when you go to establish your cleanse operations, hold down the Ctrl key and click to select multiple columns from the columns list. Then, the "replace rule" you define will apply to all selected columns at once when you click Add.
Note: All of the columns you multi-select must be the same data type for this to work.
Creating Variables: Finding the distance between two locations
Distances between ZIP codes may be calculated in one of two ways in the Transform node – to find the distance between two ZIP codes that are each stored as values in the dataset, use the DistanceBetween function. To find the distance from one ZIP code stored in the dataset and a “known”, constant ZIP code, use the DistanceFrom function. These are available as embedded functions in the Transform node under the f(?) option.
Function Definitions
DistanceBetween determines and returns a number representing the miles between two U.S. postal zip codes. The inputs can be either 5 or 9 digits. The zip codes can be any combination of two fixed points (i.e. static reference points) or variables from the dataset (which might change from record to record).
DistanceFrom returns a number representing the number of miles between a zip code and a constant zip code.
Example
DISTANCEBETWEEN([A],[B])
where [A]=03860 and [B]=03818
Result = 4.94
DISTANCEFROM([A],'03818')
where [A]=03860
Result = 4.94
Video Tutorial
Creating Variables: Days between application date and term start
For any institution that accepts applications on a non-rolling basis, there exists a window of time during which applications must be filed to be considered for acceptance. The amount of time between when an application is submitted and when the relevant admission term begins can be an indication of a student’s interest in a particular institution.
For example, a student may turn in an application to his first-choice college during the first week that applications are accepted, but this same student might wait until the day or week before the deadline to turn in applications to his safety or back-up schools. In this way, the amount of time between the day that a student turns in an application and the term start date can be seen as an indicator of that student’s interest.
To create this variable, the first step is to connect applicant data to a Transform node. Then use the DaysBetween function from the f(?) drop-down menu to calculate the new variable.
Function Definition
DaysBetween determines and returns the number of days between two dates.
Example
DAYSBETWEEN([A],[B])
where [A]=07/02/2022 and [B]=08/28/2022
Result = 57 days
Video Tutorial
Creating Variables: In-State Flag
A flag is a variable consisting of 0's and 1's indicating whether something is true or false. In the case of an "In-State flag", a value of 1 would represent that the record DOES have a value that is equal to the state considered to be in-state. A value of 0 would indicate that the state associated with that record is NOT equal to the state considered to be in-state.
An IF statement formula in the Transform node can be used to create this type of flag.
Function Definition
The IF function tests a chosen expression/formula to see if it's true or false. If it's true, it returns the value designated as the 'true' outcome. If it's false, it returns the value designated as the 'false' outcome.
IF(Condition, True Outcome, False Outcome)
Example
IF([A]=='NH',1,0)
where [A] is the variable containing the state
and New Hampshire (NH) is considered to be the local/current in-state
Creating Variables: Age
Because 'age' is constantly changing with each day that passes, 'date of birth' is a much more common measure to have in a dataset. Again, because of the changing nature of age, it's not as straight forward to calculate from a date of birth as you would initially think.
Age can be calculated using the Transform node, and a few built in functions: Month, Day, YearsBetween and Today(). These functions are used to check whether the birthday occurs in the current month, and if the day has passed yet. If the day has not passed yet, then the age is 1 less than the number of years between the birth year and today. If the day has passed already, or if the entire month has passed already, then the age is the number of years between the birth year and today.
Function Definitions
Month returns an integer representing the month part of a specified date (1-12).
Day returns an integer representing the day part of a specified date (1-31).
YearsBetween returns an integer representing the # of 12 month boundaries between 2 dates.
Today() returns the current date as of the exact time the job is executed.
Example
IF(MONTH([A]) == MONTH(today()) AND DAY([A]) > DAY(today()), YEARSBETWEEN([A], today())-1, YEARSBETWEEN([A], today()))
where [A] is the variable containing a birthdate
and today() is a function that returns the current date
Comments
0 comments
Please sign in to leave a comment.