Use a fuzzy duplicate check to ensure that no (almost) similar records exist in a table.

To set up a fuzzy duplicate check rule, use a duplicate check of type Fuzzy matching. If the desired Fuzzy matching duplicate check does not exist, set up a new Fuzzy matching duplicate check.

A Fuzzy matching duplicate check checks for duplicates applying fuzzy logic. A Fuzzy matching duplicate check compares, for a selected record, several field values with the values of the same fields of other records. Based on the comparison, a duplicate score is calculated.

On the duplicate check, you define:

  • Which dynamic query is used. The dynamic query defines the records that are checked for duplicate values and the fields that can be checked for duplicate values.
    In the dynamic query, the first defined table must be the main table, on which you want to check for duplicates. This table must be the same table that you define in the duplicate check header, in the Table name field.
    A form can use several related tables. In this case, in the dynamic query, use data entities for the next table records. For each of the data entity table records, define the applicable parent. Use the data entities to select the fields that you want to check for duplicates.
  • The fields which values are checked for duplicates. You only can use fields that are defined in the dynamic query.
  • For each field, the weightage. The weightage expresses the importance of a duplicate value. Express the weightage in a number (with or without decimals). Define the weightage number in such a way that the importance is expressed compared to the other fields. If you do not define a weightage for a field, the field value is not checked for duplicates.
  • The threshold that is used to indicate if a record is a possible duplicate. The threshold is expressed in a percentage. Only if the calculated duplicate score for a record is equal to or higher than the threshold, a record is marked as possible duplicate.
    The duplicate score is calculated in this way: [Weightage sum of fields with duplicates] / [Total weightage sum] * 100%

Example:

Duplicate check on CustTable

Threshold: 50%

Table name Datasource name Field Field label Weightage
CustTable CustTable AccountNum Customer account  
CustCustomerV3Entity CustCustomerV3Entity AddressStreet Street 1
CustCustomerV3Entity CustCustomerV3Entity AddressZipCode ZIP/postal code 1
CustCustomerV3Entity CustCustomerV3Entity OrganizationName Organization name 6
CustCustomerV3Entity CustCustomerV3Entity PrimaryContactEmail Primary email 3
CustCustomerV3Entity CustCustomerV3Entity PrimaryContactPhone Primary phone 3

Calculation examples:

  • Duplicate values exist in the Primary email field and in the Primary phone field. The duplicate score is: 6 / 14 * 100 = 42,86. The record is not reported as possible duplicate.
  • Duplicate values exist in the Organization name field and the ZIP/postal code field. The duplicate score is: 7 / 14 * 100 = 50. The record is reported as possible duplicate.


Standard procedure

1. Click Data quality management.
2. Click Duplicate checks.
3. Click New.
4. In the Duplicate check name field, type a value.
5. In the Duplicate check type field, select 'Fuzzy matching'.
6. Use a dynamic query to define the records that are checked for duplicate values and the fields that can be checked for duplicate values.
  In the Query field, enter or select a value.
7. Define the threshold that is used to indicate if a record is a possible duplicate. The threshold is expressed in a percentage. Only if the calculated duplicate score for a record is equal to or higher than the threshold, a record is marked as possible duplicate.
  In the Threshold % field, enter a number.
8. Sub-task: Select the fields to be checked on duplicate values.
  8.1 In the Fields section, click Add.
  8.2 In the Select fields dialog, in the available grid, select the desired fields.
  8.3 Click Add.
  8.4 Click OK.
  8.5 You can use a phonetic search algorithm to check on duplicate names that sound similar, for example, John and Jon.
  Select the Use phonetic search check box.
  8.6 If you have selected the Use phonetic search check box, define the applicable phonetic search rule.
  In the Phonetic search rule field, enter or select a value.
  8.7 The weightage expresses the importance of a duplicate value. Express the weightage in a number (with or without decimals). Define the weightage number in such a way that the importance is expressed compared to the other fields.
  In the Weightage field, enter a number.
 

Note: If you do not define a weightage for a field, the field value is not checked for duplicates.

  8.8 You can hide a field and its duplicate check results from the Duplicate records found page. For example, if a field shows sensitive data, you can check the field for duplicate values but hide the results of the check.
  Select the Hide in results check box.
9. Sub-task: Select the fields that can be merged for a duplicate record.
  9.1 In the Merge fields section, click Add.
  9.2 On the dialog, in the list, find and select the desired fields.
  9.3 Click Add (->).
  9.4 Click OK.
10. Sub-task: Activate duplicate check.
  10.1 If the duplicate check setup is finished, you can make it available for selection on the duplicate check rules.
  Select Yes in the Active field.
 

Note:

  • If active, you cannot edit the duplicate check.
  • If active and used in an active data quality policy version, you cannot make the duplicate check inactive.

11. Close the page.
Related to Notes

Set up duplicate check

 

See also

Provide feedback