About: This article walks through how to create new variables from the original dataset in Predict.
Table of Contents
- Introduction
- Binning
- Multi-Variable Formula
- Text Functions
- Date/Time
- Transform Toolbar
- Video Tutorial
- Related Articles
Introduction
The Transform tool allows users to create new columns from existing columns in the dataset without leaving Predict. This tool shows a list of all of the fields that exist in the analysis to be used as part of a formula to create new variables. There are several options for different types of variables that can be created.
Along the top of the Transform window, there are four tabs:
- Binning: Creating groups of specific values from existing values
- Multi-Variable Formula: Creating a new variable from existing variables using basic formulas
- Text Functions: Accessing the values of variables and applying conditions or formulas
- Date Time: Addressing Date and Time specific variables and applying functions and formulas
Binning
The binning option allows users to create buckets that every record distinctly fits into. To do this, first select the field that the bins will be based off from the list of variables on the left. Next name the first bin/category by entering a name where it says "First Category =". This is the value the new column will contain for each record that fits into this bin. Next, define the definition for the first bin by using the operators from the dropdown list and typing the criteria into the following space.
For example, perhaps users have a field containing SAT Math scores, and want to define low, medium, and high bins. In that case, name the first category "Low SAT". Then define that category as (SAT Math <= 500).
Once defined, click the validate button. Then name and define the next category.
Perhaps a user wants to name the next category as "Medium SAT", and define it as SAT Math > 500 and SAT Math <= 600. Click validate again, and define the next category. Once all categories are defined, type the name of the new column containing the bin values under New Variable Name.
Finally, click Create. This new column now appears at the bottom of the window in the Transform Operations section. Transforms that have already been created can be edited at any time by clicking the name of the transform operation, making any changes, and then clicking the Update.
Tip: If the variable being used to create the binned variable is a categorical variable like state, users can retrieve all of the unique values by clicking the get sample button in the upper right corner. Users can then choose the IN operator, hold down the shift or control keys and choose multiple values to be placed in each bin.
Multi-Variable Formula
Multi-Variable Formulas offer the most flexibility in creating new columns as the formulas can make use of variables of varying data types. First, select any variables to be used in the formula by clicking them from the variables list on the left. When a variable is selected from that list, Construct assigns a variable name to each of the selected variables starting with [A] and moving down the alphabet depending on the order of selection. These variables are listed in the Assign Variables section.
Once variables are assigned, any formula can be entered into the formula box such as [A]+[B]. This example would add the two fields.
Tip: If you are unsure how to start writing formulas, check out the Function Definition drop down for a list of all built in functions along with syntax and examples.
Next choose a Result Type from the drop-down box. The default result type is always Real, however if the output contains text and the user leave the result type as Real, the output will contain entirely null values.
Finally, name the new variable in the New Variable Name section and click Create.
IF Operation
An IF statement evaluates a statement and returns a result based on whether the statement evaluates as True or False. The syntax for and IF statement is: IF(P1,P2,P3), where:
P1 is the statement to be evaluated
P2 is the value to return if the expression is True
P3 is the value to return if the expression is False
As an example IF( [A] > 50 , 'High' , 'Low' ) would return the value of High if the variable [A] was greater than 50, and Low if it was not greater than 50.
Note: As seen in this example, anytime a text value is included in a formula it should be surrounded by single quotes and the Result Type should be set to Text.
Function Operations
Clicking the f(?) button on the right side panel displays a list of functions. Clicking a function shows the definition and parameter requirements of the function. The function can either be typed directly into the formula box, double clicked, or the caret to the right of the function can be used to populate the function into the formula box.
Tip: Notes can be included within the formula window which will not be considered part of the formula. Notes are denoted by /* to open the note and */ to close the note.
Matches REGEX Function
MatchesRegEx is a function that takes a text column and returns a "1" when the text string matches the desired Regular Expression (RegEx). In effect, MatchesRegEx acts as a universal IS function, although users must develop the testing expression. Fortunately, regex syntax is well documented online.
Examples
- This RegEx will only match a valid social security number (i.e. it won't match 000-00-0000): ^(?!000)([0-6]\d{2}|7([0-6]\d|7[012]))([ -]?)(?!00)\d\d\3(?!0000)\d{4}$
- This RegEx will match currency (with or without cents, and optional $ sign): ^\$?\d+(\.(\d{2}))?$
- This RegEx will match positive integer: ^\d+$
Boolean Operations
Users can create Boolean (True/False or 0/1) variables in the Multi-Variable tab of the Transform node.
To do this, type an expression including a variable that will either evaluate to True or False. Then select a Result Type of either Boolean (which results in True/False) or Integer (which results in 0/1).
Examples
A>0
This formula results in a True if A is greater than 0, and results in a False otherwise.
A>0 AND B>0
This formula results in a True if A and B are both greater than 0, and results in a False otherwise.
A>0 OR B>0
This formula results in a True if A or B is greater than 0, and results in a False otherwise.
Text Functions
To use a Text Function, first choose a variable from the left side that contains text values. If there are no variables that are type text, this list will be blank. Next, fill in any parameters that are requested. Examples of the function definition as well as the parameter definitions can be found in the Function Definition box. Once the required parameters are filled in, name the new transform and click the Create button.
Date/Time
Date/Time functions will create a new column that contains additional information relating to a date variable that already exists in your dataset. For example, if a date exists in your dataset, you can create a new variable that will calculate the day of the week that date is within the year. Other date/time functions are listed under the Transforms function list on the right side of the window. If no date/time fields are found, the column list is blank.
Transform Toolbar
Editing a Transform
Transforms can be edited at any time by clicking the specific transform formula within the Transform Operations section at the bottom of the window. Once the edits have been made, click the Update button to update and save the changes. A transform formula can also be removed by right-clicking the formula and clicking Delete. Several formulas can be selected at once by holding the Ctrl key and clicking more than one transform formula.
Test Formula
The Test Formula feature allows users to preview the result of the transform formula depending on different values of the variables that are being used in their formula. Clicking the Get Values button will populate the unique values of the variable within the Test Value drop-down menu. After specifying Test Values, click the Test button to view the output of the formula using the specific Test Values in the Result window. This is a useful feature to test your formula as you craft it, rather than running your job and observing the result to check your work.
Validate Formula
This button appears above the formula box. Clicking this button will check the contents of the formula box for syntax errors. If a problem is detected, an Invalid Formula error is displayed. If no message appears, then the syntax is valid.
Exporting and Importing a Transform Operation
Transform formulas can be exported by selecting the Transform Operation and the clicking the Export icon shown in the top left of the window or by right clicking on the Transform Operation name and selecting Export transform. Transform formulas are saved as a *.ritd (Rapid Insight Transform Definition) file.
Formulas can also be imported using the Import icon shown in the top left of the Transform node window, or by right-clicking in the Transform Operation list at the bottom of the window and selecting Import transform. You can then browse to and select a saved *.ritd file and its operation will be added to the existing Transform Operation list.
Function Descriptions
Date and Time
DAYSBETWEEN - Returns an integer representing the number of days between two dates.
HOURSBETWEEN - Returns an integer representing the number of hours between two dates.
MINUTESBETWEEN - Returns an integer representing the number of minutes between two dates.
MONTHSBETWEEN - Returns an integer representing the number of month boundaries between two dates.
MONTHSBETWEEN - Returns an integer representing the absolute number of months between two dates.
SECONDSBETWEEN - Returns an integer representing the number of seconds between two dates.
WORKDAYSBETWEEN - Returns an integer representing the number of workdays between two dates.
YEARSBETWEEN - Returns an integer representing the number of 12 month boundaries between two dates.
YEARSBETWEEN - Returns an integer representing the absolute number of years between two dates.
DATE - Converts a value to a date.
DATEFIRSTOFMONTH - Determines the date of the first day of the month.
DATEFIRSTOFMONTH - Determines the date of the first day of the month.
DATELASTOFMONTH - Determines the date of the last day of the month.
DATELASTOFMONTH - Determines the date of the last day of the month.
DATESTR - Convert a date to a string.
DATESTR - Formats a date as a string.
DATETIME - Converts a value to a date and time.
DAY - Returns an integer representing the day datepart of the specified date (1-31).
DAYOFWEEK - Returns a string representing the day datepart of the specified date (Sunday, Monday, etc.).
DAYOFWEEKN - Returns an integer representing the day datepart of the specified date (1-7).
DAYOFYEAR - Returns an integer representing the day datepart of the specified date (1-366).
HOUR - Returns an integer that represents the hour part of a specified datetime.
MINUTE - Returns an integer that represents the hour part of a specified datetime.
MONTH - Returns an integer that represents the month part of a specified date (1-12).
MONTHOFYEAR - Returns a string that represents the month part of a specified date (January, February, etc.).
NOW - Returns the current date/time.
QUARTER - Returns an integer representing the 'Quarter' specified date falls within.
TODAY - Returns the current date.
TIME - Converts a value to a time.
WEEK - Returns an integer expression representing the week that the specified date falls in (1-53).
WEEKDAY - Returns 'Weekday' or 'Weekend'.
YEAR - Returns an integer that represents the year part of a specified date.
Geographic Distance and Points
DISTANCEBETWEEN - Returns a real representing the number of miles between two zipcodes.
DISTANCEBETWEENLATLON - Returns a real representing the number of miles between two Latitude/Longitude coordinates.
DISTANCEBETWEENMGRS - Returns a integer representing the number of meters between two MGRS (Military Grid Reference System) coordinates in same grid zone designator.
DISTANCEFROM - Returns a real representing the # of miles between a zipcode and a constant zip code.
DISTANCEFROMLATLON - Returns a real representing the number of miles between a variable and a constant Latitude/Longitude coordinate pair.
DISTANCEFROMMGRS - Returns a real representing the number of meters between a MGRS (Military Grid Reference System) coordinate and a constant MGRS coordinate in the same grid zone designator.
METERSTOMILES - Returns the real number of miles equivalent to the specified number of meters.
ZIPCODE( ) - Returns a 5 character zipcode converted from string or numeric data. Illegal values are output as text.
ZIPCODE( , 5 ) - Returns a US zip or zip+4 converted from string or numeric data.
ZIPCODELATITUDE - Returns the latitude of a zipcode. Unrecognized zipcodes return a missing value.
ZIPCODELONGITUDE - Returns the longitude of a zipcode. Unrecognized zipcodes return a missing value.
Mathematical Functions
ABS - Returns the absolute, positive value of the given numeric expression.
ACOS - Returns the angle, in radians, whose cosine is the given real expression.
ASIN - Returns the angle, in radians, whose sine is the given real expression.
ATAN - Returns the angle in radians whose tangent is the given real expression.
ATN2( #y, #x ) - Returns the angle, in radians, whose tangent is the quotient (Y/X) of given real expressions.
DEGREES - Given an angle in radians, returns the corresponding angle in degrees.
PI - Returns the constant value of PI (3.14…).
RADIANS - Returns radians when a numeric expression, in degrees, is entered.
TAN - Returns the tangent of the input expression.
AVERAGE - Returns the average value of a list of one or more numeric values.
GREATEST - Returns the largest value in a list of two or more values.
GREATEST - Returns the index of the largest value in a list of two or more values.
LEAST - Returns the smallest value in a list of two or more values.
General Data Strings
CHAR - A string function that converts an int ASCII code to a character.
CHARINDEX - Returns the starting position of the specified expression in a character string. Returns 0 if not found.
CONTAINS - Returns 1 if the specified expression is in the character string, otherwise returns 0.
LEN - Returns the number of characters of the given string, excluding trailing blanks.
REPLACE - Replaces all occurrences of the first string expression with the second string expression.
STUFF - Deletes a specified length of characters and inserts another set of characters at a specified starting point.
SUBFIELDCOUNT( ) - Returns the number of 'sub-fields' in a string where the fields are delimited by a comma.
SUBFIELDCOUNT( , ',' ) - Returns the number of 'sub-fields' in a string where the fields are delimited by the specified delimiter (default is comma).
LEFT - Returns the left part of a string the specified number of characters in length.
LTRIM - Returns a character expression after removing leading blanks.
PADLEFT - Returns a string padded on the left with specified character to the specified length.
PADRIGHT - Returns a string padded on the right with specified character to the specified length.
REMOVELEFT - Removes specified number of characters from left end of string expression.
REMOVERIGHT - Removes specified number of characters from right end of string expression.
RIGHT - Returns the right part of a string the specified number of characters in length.
RTRIM - Returns a character string after truncating all trailing blanks.
SUBFIELD( , #fieldNum ) - Returns the Nth 'sub-field' from a string where the fields are delimited by commas.
SUBFIELD( ,#fieldNum, ',' ) - Returns the Nth 'sub-field' from a string where the fields are delimited by the specified delimiter (default is comma).
SUBSTRING( , #start ) - Returns part of a string beginning at the specified start position to the end of the string.
SUBSTRING( , #start, #len ) - Returns part of a string beginning at the specified start position and continuing for the specified length.
If Statements and General Expressions
IF - Tests an expression result, returns one value if true and another value if false.
ISNULL ( , replacement_value ) - Tests if an expression result is null, returns the result if it is not null, otherwise a replacement value.
IS NULL - Tests if an expression result is null, evaluates to boolean.
NULLIF - Returns a null if two expressions are equivalent.
Text Edits and Data Type Conversions
LOWER - Returns a character expression after converting uppercase character data to lowercase.
TITLECASE - Returns a character expression after converting the character data to Title case.
UPPER - Returns a character expression with lowercase character data converted to uppercase.
NUM - Returns a numeric data type converted from character (text) data type.
STR - Returns the character (text) data type converted from a numeric data type.
Other Available Functions
Miscellaneous functions
COALESCE - Returns the first non-null value in a list of two or more values.
LAG - Returns the prior value of the expression.
LAG - Returns a value of the expression from P2 rows ago.
RANDOM - Returns a random number between 0.0 and 1.0.
REVERSE - Returns the reverse of a character expression.
ROUND - Round value to the nearest integer value.
SECOND - Returns an integer that represents the second part of a specified datetime.
SIGN - Returns the positive (+1), zero (0), or negative (-1) sign of the given expression.
SPACE - Returns a string of repeated spaces.
Special Parameters
This section documents unique parameters that do not fit neatly into other categories yet expand the user’s abilities and contribute to the utility of the program.
Symbol | Type | Behavior | Description |
* | Parameter | Wildcard | When an asterisk is used as a formula parameter, it acts as a wildcard.COMBINE INPUTS Table Pattern Example: Enrollment Data.Selects all files, regardless of type, that start with the characters Enrollment Data. |
Regex: | Functional Parameter | Wildcard | When the function is used as a formula parameter, it acts as a wildcard with features beyond those of an asterisk.COMBINE INPUTS Table Pattern Example: Regex:CX..\.xls.*\.*$ Selects all worksheets from all Excel (.xls) files that start with the characters ”CX” followed by exactly two characters. |
[_] | Parameter | Lag | An underscore in square brackets used as a formula parameter applies the value of the previous record. An example that creates an incrementing series in the new column starting at 100: IF ([_] IS NULL, 100, [_] + 1) The LAG () Transform function performs a similar function, but allows users to designate the column being lagged in a multivariable formula. It is recommended that if the function is complex, that the value of the lag be defined as a separate step: [prior]:= LAG([B]); IF([C]=7, [prior], [B]) |
Comments
0 comments
Article is closed for comments.