24th Jan. 2022

Extending full-text search in MySQL

We recently added a full-text search feature in Screener. It searches for all the exchange announcements. We initially faced a few issues while implementing it. However, we were easily able to find a way around them by creating a simple query parser.

The problems
MySQL offers 2 main types of searches. BOOLEAN and NATURAL MODE. The natural mode is a user-friendly mode. It takes the input and searches it across the database. The users don't need to learn anything new.

On the downside, the NATURAL mode doesn't allow much customisation. We cannot use "OR" in queries. It also doesn't search for partial matches and stems. For example, searching for "right" won't include results for "rights issue".

This was a deal-breaker for us. We needed to have partial matches.

The BOOLEAN mode is much more flexible. We can combine complex AND and OR matches. We can also use it for finding partial matches. However, the queries are not very user friendly. The BOOLEAN query for searching right issue with partial matches is "+right* +issue*".

The idea of a parser
The ease of the natural mode and the flexibility of the boolean mode gave us an idea. How about writing a parser to convert a natural query into a boolean query? That's what we did.

The solution
We implemented a parser using the following code. It basically works like this:
- split the query into OR clauses
- Split each clause into separate words
- For each word, find its stem and convert it to a partial search: +stem*
- User can add quotes around the words to search exact phrases

import re
from nltk.stem.snowball import SnowballStemmer


def _to_boolean_term(term):
    # break term on: space, comma, period, phrases
    regex = '\s|,|\.|([+-]?"[^"]+")'
    words = re.split(regex, term)

    boolean_words = []
    stemmer = SnowballStemmer(language="english")
    for word in words:
        if not word:
            continue

        if (
            word.startswith('"')
            or word.startswith('+"')
            or word.startswith('-"')
        ) and word.endswith('"'):
            # is phrase
            # no cleaning required
            boolean_word = f"+{word}" if word.startswith('"') else word
        elif word.startswith("-"):
            # is negative
            # strip non-alphanumeric characters at start or end
            # clean the word
            # and use as negative search
            word = re.sub(r"[^a-zA-Z0-9-]", "", word).strip("-+")
            if not word:
                continue
            boolean_word = f"-{word}"
        else:
            # remove non-alphanumeric characters
            word = re.sub(r"[^a-zA-Z0-9-]", "", word)

            # let mysql handle tiny words
            if len(word) < 3:
                boolean_word = f"{word}"
            else:
                # get stemmed word
                stem = stemmer.stem(word).strip("-")
                # we don't want to use bad stems such as daili for daily
                # these happen very rarely
                word = stem if stem in word else word
                boolean_word = f"+{word}*"
        boolean_words.append(boolean_word)

    boolean_term = " ".join(boolean_words)
    return f"({boolean_term})"


def to_boolean_query(natural_query):
    natural_query = natural_query.lower()
    or_terms = natural_query.split(" or ")
    return " ".join(_to_boolean_term(term) for term in or_terms)

The above works pretty well for us. We take the query from a user and then convert it to boolean form using to_boolean_query(raw_query).

Liked this post?

Get new posts in your email. The updates are free.