Pratyush Mittal
Hobby coder and a stock investor.
Co-founder Screener.in
We recently added a full-text search feature in Screener. It searches for all the exchange announcements. We initially faced a few issues while implementing it. However, we were easily able to find a way around them by creating a simple query parser.

The problems
MySQL offers 2 main types of searches. BOOLEAN and NATURAL MODE. The natural mode is a user-friendly mode. It takes the input and searches it across the database. The users don't need to learn anything new.

On the downside, the NATURAL mode doesn't allow much customisation. We cannot use "OR" in queries. It also doesn't search for partial matches and stems. For example, searching for "right" won't include results for "rights issue".

This was a deal-breaker for us. We needed to have partial matches.

The BOOLEAN mode is much more flexible. We can combine complex AND and OR matches. We can also use it for finding partial matches. However, the queries are not very user friendly. The BOOLEAN query for searching right issue with partial matches is "+right* +issue*".

The idea of a parser
The ease of the natural mode and the flexibility of the boolean mode gave us an idea. How about writing a parser to convert a natural query into a boolean query? That's what we did.

The solution
We implemented a parser using the following code. It basically works like this:
- split the query into OR clauses
- Split each clause into separate words
- For each word, find its stem and convert it to a partial search: +stem*
-
User can add quotes around the words to search exact phrases
import re
from nltk.stem.snowball import SnowballStemmer


def _to_boolean_term(term):
    # break term on: space, comma, period, phrases
    regex = '\s|,|\.|([+-]?"[^"]+")'
    words = re.split(regex, term)

    boolean_words = []
    stemmer = SnowballStemmer(language="english")
    for word in words:
        if not word:
            continue

        if (
            word.startswith('"')
            or word.startswith('+"')
            or word.startswith('-"')
        ) and word.endswith('"'):
            # is phrase
            # no cleaning required
            boolean_word = f"+{word}" if word.startswith('"') else word
        elif word.startswith("-"):
            # is negative
            # strip non-alphanumeric characters at start or end
            # clean the word
            # and use as negative search
            word = re.sub(r"[^a-zA-Z0-9-]", "", word).strip("-+")
            if not word:
                continue
            boolean_word = f"-{word}"
        else:
            # remove non-alphanumeric characters
            word = re.sub(r"[^a-zA-Z0-9-]", "", word)

            # let mysql handle tiny words
            if len(word) < 3:
                boolean_word = f"{word}"
            else:
                # get stemmed word
                stem = stemmer.stem(word).strip("-")
                # we don't want to use bad stems such as daili for daily
                # these happen very rarely
                word = stem if stem in word else word
                boolean_word = f"+{word}*"
        boolean_words.append(boolean_word)

    boolean_term = " ".join(boolean_words)
    return f"({boolean_term})"


def to_boolean_query(natural_query):
    natural_query = natural_query.lower()
    or_terms = natural_query.split(" or ")
    return " ".join(_to_boolean_term(term) for term in or_terms)

The above works pretty well for us. We take the query from a user and then convert it to boolean form using to_boolean_query(raw_query).

I loved this essay by PG. He explains the concept of "schlep blindness".

It explains the reason why no one solved payments before Stripe. Every coder knew that problem. Every coder wanted a solution. Yet no one solved it! Because everyone thought that it was a hard problem. It will involve making deals with banks. And then take a lot of risks because of the flow of money involved. These are "schleps" that hackers like to avoid.

But there is a fallacy here. ALL businesses involve schleps:

But I soon learned from experience that schleps are not merely inevitable, but pretty much what business consists of. A company is defined by the schleps it will undertake. And schleps should be dealt with the same way you'd deal with a cold swimming pool: just jump in. Which is not to say you should seek out unpleasant work per se, but that you should never shrink from it if it's on the path to something great.

Another paragraph that really stuck with me was on how "founders grow with the problems."

In practice the founders grow with the problems. But no one seems able to foresee that, not even older, more experienced founders. So the reason younger founders have an advantage is that they make two mistakes that cancel each other out. They don't know how much they can grow, but they also don't know how much they'll need to. Older founders only make the first mistake.
We saw a sudden increase in disk usage last month.

We store all persistent data on /data-volume. I checked what is using the space using:
sudo du -h --max-depth 1 /data-volume | sort -h
It showed MySQL folder was using around 150 GB. It was huge. Last week it was ~120 GB.

I was very concerned about this increase and dug more into it. I thought it was because of the full-text search index. But I wanted to be sure.

I was searching for a way to get stats around what is using space inside MySQL. Interestingly, the easiest solution to this was to do list the directory contents here too.
sudo du -h --max-depth 1 /data-volume/mysql | sort -h

It showed only ~18GB being used by production_db. The directory size was however ~120GB.

Doing ls on the directory revealed lots of binlog files. Some of these were in GBs.

Psssst! It was the event logs that were taking up the space.

The binlog files are used for replication. The updates are communicated between the primary and replica servers using the binlog files.

Earlier versions of MySQL (<8.0.11) retained binlog files of the last 5 days. However, now the default is 30 days. This was taking up most of the space.

We purged the old binlog files using:
PURGE BINARY LOGS BEFORE '2021-10-10 22:46:26';
This freed up around ~80 GB.

We also updated the default value for binlog_expire_logs_seconds to retain logs only for 15 days going forward.
I loved this interview of AR Rahman. The piece that I loved the most was where he explains how he thinks about creativity [10.06].

I think it is a constant seeking. When we try 500 things out of which 5 become amazing; I think it's a blessing. We can try 1000 years and still not crack something. But if you get 1 line that lingers on in your heart, in people's hearts, I think that's a blessing. Because somewhere deep there's a soul which is actually connecting to all other souls

 And for that, you have to constantly be cleansing yourself.

Your mind, the way you think about, because what you think actually manifests. And you can't fake it. You can't be an evil person and try to do a very good song.

There is goodness in all of us. I feel like how can you enhance that stuff. How can you trigger that? How can you manifest it in another when people are listening to it and transport people.

[...]

One thing I do is fasting. When you fast, all the bad energies go. And then you are more in one with yourself. That helps me think this is right this is wrong.

Loved this short essay by Taleb. He highlights the difference in the way nature and we humans behave:

Nature builds with extra spare parts (two kidneys), and extra capacity in many, many things (say lungs, neural system, arterial apparatus, etc.), while design by humans tend to be spare, overoptimized, and have the opposite attribute of redundancy, that is, leverage—we have a historical track record of engaging in debt, which is the reverse of redundancy (fifty thousand in extra cash in the bank or, better, under the mattress, is redundancy; owing the bank an equivalent amount is debt).

We extrapolate the current optimism. We take the historical past as the worst-case scenario. And then we make provisions based on the estimated probabilities of that happening (or not happening) again.

However, nature learns from the past. It adapts to prevent those failures from happening again.

So if humans fight the last war, nature fights the next war.

Wonderful read.

« first previous
Page 3 of 60.
next last »