Fully-Faltoo blog by Pratyush

Bio Twitter Screener

29th June 2020

Fingerprinting names

On Screener, we source many public documents with “names”. The problem with names is that we can write the same name in multiple ways. A simple name like MAPL Industries Limited can also be written like:

- M A P L Industries Limited
- MAPL Industries Ltd
- M.A.P.L Industries Ltd
- MAPL Ind. Ltd.

These all should be treated as the same person. We created a fingerprinting algorithm to generate the same string for all the above inputs. This fingerprinting makes it easy to store, search and match these names in the database.

It handles:
- abbreviations such as Ind, dev and corp
- common suffixes and prefixes such as Mr, Mrs and Ms

Code is given below.

Update
There are some good algorithms for fuzzy comparison. However, these compute the similarity in real-time. We needed a way to do such searches at the database level. Santhosh's Soundex library for Indian languages is a wonderful way. We didn't use it because we needed exact searches and didn't want any false positives.
import re


NON_WORD = re.compile(r"[\W]+")


def get_fingerprint(name):
    """
    Strips non-alphanumeric characters and common prefixes and suffixes
    Motilal Oswal Services -> motilaloswalservices
    Motilal Oswal -> motilaloswal
    """
    original_name = name.replace("\n", " ").strip()
    name = original_name.lower()
    name = NON_WORD.sub(" ", name)

    removals = [
        r"^the ",
        r" and ",
        r"^mr ",
        r"^mrs ",
        r"^ms ",
        # public private limited company
        r"\bp ltd\b",
        r"\blim[ited]+\b",
        r"\bltd\b",
        r"\bpvt\b",
        r"\bprivate\b",
        r"\bpublic\b",
        r"\bco\b",
        r"\bco[mpany]+\b",
        r"\bplc\b",
    ]
    for removal in removals:
        name = re.sub(removal, "", name)

    replacements = {
        r"\bcorp[oration]+\b": "corp",
        r"\bdev[elopment]+\b": "dev",
        r"\bdev[lopers]+\b": "dev",
        r"\binv[estments]+\b": "inv",
        r"\bind[ia]+\b": "ind",
        r"\bind[ustries]+\b": "ind",
        r"\bind[ustrial]+\b": "ind",
        r"\bint[ernational]+\b": "intl",
    }
    for pattern, replacement in replacements.items():
        name = re.sub(pattern, replacement, name)

    # join everything
    name = name.replace(" ", "")
    return name

Comments