About: This article introduces the Fuzzy matching node, an Add In node within Construct.
Location: Rapid Insight Collaborative Cloud
Table of Contents
Feature Overview
Fuzzy matching is a data preparation technique used to unite records that should match but currently do not. The Jaro-Winkler score is one of the formulas Construct’s fuzzy matching node uses to associate similar records.
Jaro-Winkler assigns a score between 0 and 1 to indicate the degree of difference between two entries. In Construct, a score of 1 is a perfect match (the same characters in the same order).
Construct uses the following formula to calculate a Jaro score:
⅓*(Matching Characters/Length of String1 + Matching Characters/Length of String2 + (Matching Characters – Transpositions)/Matching Characters)
To make it more simple, or complex depending on how you view things, let’s assign variables to these values where:
- S1 and S2 are the lengths of strings 1 and 2
- M is the number of matching characters
- T is the number of transpositions
With these variables the formula would be as follows:
⅓*(M/S1 + M/S2 + (M – T)/M)
Now let’s see what happens when we run an example through the Jaro score:
Record 1: Macadamia
Record 2: Macedemia
Matching Characters (M) = 7
Length of String 1 (S1) = 9
Length of String 2 (S2) = 9
Transpositions (T) = 0
This leaves us with:
⅓*(7/9+7/9+(7-0)/7)= 0.85
Now, 0.85 is a relatively high score, but you may have noticed that we have so far forgotten about Winkler. The Jaro-Winkler score uses a prefix scale (p) which puts more emphasis on the beginning of a string of characters. If the prefix of two strings are close to one another, Jaro-Winkler grants the strings a higher score than it would for matching characters that come later in the string.
With the addition of this prefix scale the formula is as follows:
Jaro Score + Prefix Length * Prefix Scale(1 – Jaro Score)
The Prefix Length is the length of the common prefix for the two strings, that allows up to a maximum of 4 characters. The Prefix Scale is a constant scaling factor that is usually set between 0 and 0.25.
In this case Macadamia and Macedemia have the first 3 characters matching, “Mac”, so the prefix length is 3 and the Prefix scale is 0.1.
0.85 + 3*0.1(1-0.85) = 0.895
The prefix length has a maximum value of 4 and in this case only the first 3 characters match. The 0.1 is the most common scale factor that Winkler used in his work, but you can choose any value between 0 and 0.25. You can see that with the additional points awarded to a matching prefix, the score has risen from 0.85, the initial Jaro score, to 0.895, the Jaro-Winkler score.
Comments
0 comments
Please sign in to leave a comment.