Fingerprinting names
On Screener, we source many public documents with “names”. The problem with names is that we can write the same name in multiple ways. A simple name like MAPL Industries Limited can also be written like:
- M A P L Industries Limited
- MAPL Industries Ltd
- M.A.P.L Industries Ltd
- MAPL Ind. Ltd.
These all should be treated as the same person. We created a fingerprinting algorithm to generate the same string for all the above inputs. This fingerprinting makes it easy to store, search and match these names in the database.
It handles:
- abbreviations such as Ind, dev and corp
- common suffixes and prefixes such as Mr, Mrs and Ms
Code is given below.
Update
There are some good algorithms for fuzzy comparison. However, these compute the similarity in real-time. We needed a way to do such searches at the database level. Santhosh's Soundex library for Indian languages is a wonderful way. We didn't use it because we needed exact searches and didn't want any false positives.
- M A P L Industries Limited
- MAPL Industries Ltd
- M.A.P.L Industries Ltd
- MAPL Ind. Ltd.
These all should be treated as the same person. We created a fingerprinting algorithm to generate the same string for all the above inputs. This fingerprinting makes it easy to store, search and match these names in the database.
It handles:
- abbreviations such as Ind, dev and corp
- common suffixes and prefixes such as Mr, Mrs and Ms
Code is given below.
Update
There are some good algorithms for fuzzy comparison. However, these compute the similarity in real-time. We needed a way to do such searches at the database level. Santhosh's Soundex library for Indian languages is a wonderful way. We didn't use it because we needed exact searches and didn't want any false positives.
import re
NON_WORD = re.compile(r"[\W]+")
def get_fingerprint(name):
"""
Strips non-alphanumeric characters and common prefixes and suffixes
Motilal Oswal Services -> motilaloswalservices
Motilal Oswal -> motilaloswal
"""
original_name = name.replace("\n", " ").strip()
name = original_name.lower()
name = NON_WORD.sub(" ", name)
removals = [
r"^the ",
r" and ",
r"^mr ",
r"^mrs ",
r"^ms ",
# public private limited company
r"\bp ltd\b",
r"\blim[ited]+\b",
r"\bltd\b",
r"\bpvt\b",
r"\bprivate\b",
r"\bpublic\b",
r"\bco\b",
r"\bco[mpany]+\b",
r"\bplc\b",
]
for removal in removals:
name = re.sub(removal, "", name)
replacements = {
r"\bcorp[oration]+\b": "corp",
r"\bdev[elopment]+\b": "dev",
r"\bdev[lopers]+\b": "dev",
r"\binv[estments]+\b": "inv",
r"\bind[ia]+\b": "ind",
r"\bind[ustries]+\b": "ind",
r"\bind[ustrial]+\b": "ind",
r"\bint[ernational]+\b": "intl",
}
for pattern, replacement in replacements.items():
name = re.sub(pattern, replacement, name)
# join everything
name = name.replace(" ", "")
return name