Modern technology gives us many things.

Name Matching Techniques with Python


Generated By Author

Another library which got published in 1990. It is one of the complex architecture among this. It includes special rules for handling spelling inconsistencies and also looking at combination of consonants. This library was modified with more precision which is called Double Metaphone. Double metaphone further refines the matching by returning both a “primary” and “secondary” code for each name.

The result contain two hash values. This can be enabled for making the precision better.

Image for post
Generated By Author

A soundex value is created by taking the first letter and converting the rest of the consonants (not vowels) from letters to digits from a basic lookup table. Even though it appeared to be a naive approach it is useful in many cases.

In the official documentation it is suggesting an output like this. But for me when I checked faced lots of issues. When I look for the error it is understood that soundex appears to be broken.

Image for post
Official Documentation

This method is used to list all the possible spelling variations of each name component. That is this can create almost all the variations of a given name (which is computationally expensive) then matching can be taken from that.

It comes up with computational cost and reduced speed. But whenever a user complains about a mismatch it is easy to add the new match to the given list.

Methods like Levenshtein distance, the Jaro–Winkler distance, and the Jaccard similarity coefficient can be used to look for the character by character distance between two names. Thus by understanding the error in terms of character is another option to check the matching.

This is another approach suggested by the community that we can train a model that can intake two names and return as a similarity score between them. For doing this we have to train a model with similar and disimilar names so that model learns the pattern of such data and give us a good similarity score.

Here we see that sometimes the names contain synonyms with it this is usually seen in organization names. In these cases, we can use word embeddings. Since word embeddings are numerical vector representations of a word’s semantic meaning. If two words or documents have similar vectors then we can consider them as semantically similar. This idea can be used to implement in name matching case.

In order to look for typos and errors in names textual similarity search is another option to check the accuracy of them. Jaro-Winkler DistanceHamming DistanceDamerau-Levenshtein Distance and also the regular Levenshtein Distance can be used for this.

We tried to query a name “kale” with the rest of names in the dict. Checkout the results in the below image.

Image for post
Generated By Author

By reading and looking at those outputs you must have figured out that not one can give us a perfect recall and precision together. So we have to do some ensembled approaches for having good accuracy and precision.

  1. Combining common key word (Phonetics) score along with jarovakue or levenshtein score can be a good criterion.
  2. Also, we observed that nysis and Dmetaphone (first hash value) can also be ensembled and can be produced to obtain a better approach.

Checkout the github repo for the code.

Leave A Reply

Your email address will not be published.