About: This article introduces the SQL Command for Each node, a Process node within Construct.
Location: Node panel
Table of Contents
Feature Overview
The DeDup node identifies and removes duplicate records based upon criteria set up by the user. There are three elements in the DeDup Node:
- Input Columns: Input columns are the available columns from the data that can be used in the DeDup node. These columns can be used in the DeDup By Columns window or as a Selection Rule.
-
DeDup By Columns: Column(s) in which duplicate values are identified and removed based on the Selection Rules. Multiple columns can be added to the DeDup By Columns section. To be considered a duplicate for removal, the record would have to have an exact match of all the identified DeDup By Columns.
For example if the DeDup By Columns included First Name, Last Name and Birth Date, a record with values of:First Name, Last Name, Birth Date
would not match
Jon, Doe, 1/1/2000First Name, Last Name, Birth Date
because they have different Last Names which was one of the Dedup By Columns. The value of all three columns need to match to be considered a duplicate.
Jon, Deer, 1/1/2000 - Selection Rules: These allow you to have control over which version of the duplicate record is discarded and which version is kept. In the example below, the SAT Math variable is being used as a Selection Rule. Therefore, if a duplicate Student ID is found, the version with the higher (Max) SAT Math score will be kept. Other selection rule options are Min, Not Null and Null.
Note: Selecting the Discard All Copies of Duplicate Records option removes all copies of a record that has duplicates. For example, if a dataset contains records: 1, 1, 2, 4, 5, and 6. Choosing Discard All Copies of Duplicate Records will result in a dataset containing records: 2, 4, and 6 only.
Video Tutorial
Related Article
Comments
0 comments
Article is closed for comments.