29th June 2020

Fingerprinting names

On Screener, we source many public documents with “names”. The problem with names is that we can write the same name in multiple ways. A simple name like MAPL Industries Limited can also be written like:

- M A P L Industries Limited
- MAPL Industries Ltd
- M.A.P.L Industries Ltd
- MAPL Ind. Ltd.

These all should be treated as the same person. We created a fingerprinting algorithm to generate the same string for all the above inputs. This fingerprinting makes it easy to store, search and match these names in the database.

It handles:
- abbreviations such as Ind, dev and corp
- common suffixes and prefixes such as Mr, Mrs and Ms

Code is given below.

Update
There are some good algorithms for fuzzy comparison. However, these compute the similarity in real-time. We needed a way to do such searches at the database level. Santhosh's Soundex library for Indian languages is a wonderful way. We didn't use it because we needed exact searches and didn't want any false positives.

import re


NON_WORD = re.compile(r"[\W]+")


def get_fingerprint(name):
    """
    Strips non-alphanumeric characters and common prefixes and suffixes
    Motilal Oswal Services -> motilaloswalservices
    Motilal Oswal -> motilaloswal
    """
    original_name = name.replace("\n", " ").strip()
    name = original_name.lower()
    name = NON_WORD.sub(" ", name)

    removals = [
        r"^the ",
        r" and ",
        r"^mr ",
        r"^mrs ",
        r"^ms ",
        # public private limited company
        r"\bp ltd\b",
        r"\blim[ited]+\b",
        r"\bltd\b",
        r"\bpvt\b",
        r"\bprivate\b",
        r"\bpublic\b",
        r"\bco\b",
        r"\bco[mpany]+\b",
        r"\bplc\b",
    ]
    for removal in removals:
        name = re.sub(removal, "", name)

    replacements = {
        r"\bcorp[oration]+\b": "corp",
        r"\bdev[elopment]+\b": "dev",
        r"\bdev[lopers]+\b": "dev",
        r"\binv[estments]+\b": "inv",
        r"\bind[ia]+\b": "ind",
        r"\bind[ustries]+\b": "ind",
        r"\bind[ustrial]+\b": "ind",
        r"\bint[ernational]+\b": "intl",
    }
    for pattern, replacement in replacements.items():
        name = re.sub(pattern, replacement, name)

    # join everything
    name = name.replace(" ", "")
    return name

Liked this post?

Get new posts in your email. The updates are free.