Addressing the John Smith Problem

Many databases have duplicate data. Especially if manual data entry is required. In order to clean the data and to resolve unnecessary duplicates, it is necessary to identify and rectify messy data. However, many duplicates are non-matching; meaning there could be duplicate data that contains, for example, spelling errors. It is challenging to identify these duplicates perfectly using the SQL database language because this relies on exact matching (due to the tenets of Relational Database theory). Therefore, it is necessary to look for other methods of identifying non-matching duplicates, which is where Fuzzy Matching is able to be used.

Addressing the John Smith Problem
Using Fuzzy Logic to Identify Non-Matching Duplicates

TowardsDataScience.com