Extending full-text search in MySQL
We recently added a full-text search feature in Screener. It searches for all the exchange announcements. We initially faced a few issues while implementing it. However, we were easily able to find a way around them by creating a simple query parser.
The problems
MySQL offers 2 main types of searches. BOOLEAN and NATURAL MODE. The natural mode is a user-friendly mode. It takes the input and searches it across the database. The users don't need to learn anything new.
On the downside, the NATURAL mode doesn't allow much customisation. We cannot use "OR" in queries. It also doesn't search for partial matches and stems. For example, searching for "right" won't include results for "rights issue".
This was a deal-breaker for us. We needed to have partial matches.
The BOOLEAN mode is much more flexible. We can combine complex AND and OR matches. We can also use it for finding partial matches. However, the queries are not very user friendly. The BOOLEAN query for searching right issue with partial matches is "+right* +issue*".
The idea of a parser
The ease of the natural mode and the flexibility of the boolean mode gave us an idea. How about writing a parser to convert a natural query into a boolean query? That's what we did.
The solution
We implemented a parser using the following code. It basically works like this:
- split the query into OR clauses
- Split each clause into separate words
- For each word, find its stem and convert it to a partial search: +stem*
- User can add quotes around the words to search exact phrases
The problems
MySQL offers 2 main types of searches. BOOLEAN and NATURAL MODE. The natural mode is a user-friendly mode. It takes the input and searches it across the database. The users don't need to learn anything new.
On the downside, the NATURAL mode doesn't allow much customisation. We cannot use "OR" in queries. It also doesn't search for partial matches and stems. For example, searching for "right" won't include results for "rights issue".
This was a deal-breaker for us. We needed to have partial matches.
The BOOLEAN mode is much more flexible. We can combine complex AND and OR matches. We can also use it for finding partial matches. However, the queries are not very user friendly. The BOOLEAN query for searching right issue with partial matches is "+right* +issue*".
The idea of a parser
The ease of the natural mode and the flexibility of the boolean mode gave us an idea. How about writing a parser to convert a natural query into a boolean query? That's what we did.
The solution
We implemented a parser using the following code. It basically works like this:
- split the query into OR clauses
- Split each clause into separate words
- For each word, find its stem and convert it to a partial search: +stem*
- User can add quotes around the words to search exact phrases
import re from nltk.stem.snowball import SnowballStemmer def _to_boolean_term(term): # break term on: space, comma, period, phrases regex = '\s|,|\.|([+-]?"[^"]+")' words = re.split(regex, term) boolean_words = [] stemmer = SnowballStemmer(language="english") for word in words: if not word: continue if ( word.startswith('"') or word.startswith('+"') or word.startswith('-"') ) and word.endswith('"'): # is phrase # no cleaning required boolean_word = f"+{word}" if word.startswith('"') else word elif word.startswith("-"): # is negative # strip non-alphanumeric characters at start or end # clean the word # and use as negative search word = re.sub(r"[^a-zA-Z0-9-]", "", word).strip("-+") if not word: continue boolean_word = f"-{word}" else: # remove non-alphanumeric characters word = re.sub(r"[^a-zA-Z0-9-]", "", word) # let mysql handle tiny words if len(word) < 3: boolean_word = f"{word}" else: # get stemmed word stem = stemmer.stem(word).strip("-") # we don't want to use bad stems such as daili for daily # these happen very rarely word = stem if stem in word else word boolean_word = f"+{word}*" boolean_words.append(boolean_word) boolean_term = " ".join(boolean_words) return f"({boolean_term})" def to_boolean_query(natural_query): natural_query = natural_query.lower() or_terms = natural_query.split(" or ") return " ".join(_to_boolean_term(term) for term in or_terms)
The above works pretty well for us. We take the query from a user and then convert it to boolean form using to_boolean_query(raw_query).