Improving search experience using Elasticsearch

October 29, 2024

Improving search experience using Elasticsearch

We use Elasticsearch in NeetoCourse for our searching needs. Recently, we have made some changes to Elasticsearch config to improve the search experience. In this blog, we will share the changes we made and what we learned during the process.

Definitions

These are some of the terminologies in Elasticsearch that we use in this blog.

Document: A document in Elasticsearch is similar to a row in a database table. It is a collection of key-value pairs.
Index: An index is a collection of documents. It is similar to a database table. Indexing is the process of creating the said index, and we can configure each step of this process.
Analyzer: An analyzer converts a string into a list of searchable tokens. Analyzer contains three functions: Character Filter, Tokenizer, and Token Filter.
Character Filter is a function that performs the process of filtering out certain characters from the input string. For example, to strip html tags and to get only the body.
Tokenizer: A tokenizer is a function that splits the input string into tokens. This can be based on whitespace, punctuation, or any other character.
Token Filter: A token filter is a function that performs the process of filtering out certain tokens from the input string. For example, it can be used to remove stop words (a, the, and, etc.) which serves no purpose in the search.

Analyzer

This is our analyzer setup:

{
  default: {
    tokenizer: "whitespace",
    filter: ["lowercase", "autocomplete"]
  },
  search: {
    tokenizer: "whitespace",
    filter: ["lowercase"]
  },
  english_exact: {
    tokenizer: "whitespace",
    filter: [
      "lowercase"
    ]
  }
}

default is the analyzer used for indexing and searching. The search terms from the user is passed through the search analyzer. english_exact is the analyzer used for exact matches.

Tokenizer

By default, Elasticsearch uses the standard tokenizer which splits the input string into tokens based on whitespace, punctuation, and any other character. Since our content is mostly based on technical concepts and programming, we cannot use the standard tokenizer. The whitespace tokenizer splits the input string into tokens based on whitespace which is suitable for our use case. Hence, we use the whitespace tokenizer for all the analyzers.

Filter

The lowercase filter is used to convert all the tokens to lowercase before storing it in the index. This is a common requirement as we want the search to be case-insensitive. We also use the custom autocomplete filter. Let's see its definition:

{
  autocomplete: {
    "type": "edge_ngram",
    "min_gram": 3,
    "max_gram": 20,
    "preserve_original": true
  }
}

The custom autocomplete filter is an implementation of edge_ngram token filter. Let's see the result of this filter when applied on the phrase "Elephant in the room".

{
  "tokens": [
    {
      "token": "Ele",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elep",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Eleph",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elepha",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elephan",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elephant",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 9,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "the",
      "start_offset": 12,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "roo",
      "start_offset": 16,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "room",
      "start_offset": 16,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

The edge_ngram filter creates n-grams starting from the first character. Since we specified min_gram as 3 and max_gram as 20, the filter will create n-grams from the first 3 characters to the last 20 characters. This means that a document will be created for every 3 to 20 letter sequence of a word. If we start typing "ELEP", then there will be a document in the index corresponding to "ELEP" and we will get the result "Elephant in the room".

The Query

The query object in the search request is equally important for getting relevant results. Elasticsearch sorts matching search results by relevance score, which measures how well each document matches a query. The query clause contains the criteria for calculating the relevance score. This is our query object:

{
  bool: {
    should: [
      {
        simple_query_string: {
          query: requestBody.searchTerm,
          fields: ["content", "meta.pageTitle", "meta.chapterTitle"],
          quote_field_suffix: ".exact",
        },
      },
      {
        match: {
          content: {
            query: requestBody.searchTerm,
            ...baseFuzzyQueryConfig(),
          },
        },
      },
      {
        match: {
          "meta.pageTitle": {
            query: requestBody.searchTerm,
            ...baseFuzzyQueryConfig(),
            boost: 1.5,
          },
        },
      },
      {
        match: {
          "meta.chapterTitle": {
            query: requestBody.searchTerm,
            ...baseFuzzyQueryConfig(),
            boost: 1.5,
          },
        },
      },
    ],
    minimum_should_match: 1,
  },
};

The bool and should clauses are used to create a compound query. The should clause means that at least one of the queries should match. Here the match is the standard query used for full text search. Here content, meta.pageTitle and meta.chapterTitle specify the fields that we created while indexing the data.

We have provided a boost value of 1.5 for page title and chapter title. This is to make sure that a page or chapter title has more relevance score than any content in the middle of the page.

The simple_query_string query is used for exact matches, when the search term contains double quotes, the english_exact analyzer is used. The double quotes operator ( " ) is part of the several operators that can be used in the simple_query_string query.

Fuzzy searching

We also use fuzzy searching in the match query. Fuzzy searching helps in giving proper results even if there are typos in the search term. Elasticsearch uses Levenshtein distance to calculate the similarity between the search term and the indexed data. Previously, we used the fuzzy query to implement fuzzy searching.

{
  content: {
    value: requestBody.searchTerm,
    ...extendedFuzzyQueryConfig(),
  },
},

But this caused several issues like:

Fuzzy results being prioritized over exact matches. For example, searching for "five" returned results for "dive", even when "five" was present in the content
Fuzzy results not being returned when the search term contained multiple words.

Upon investigation, we found that the fuzzy query does not perform analysis on the search term. Instead we now use the match query with fuzziness parameter.

const baseFuzzyQueryConfig = () => ({
  prefix_length: 0,
  fuzziness: "AUTO",
});
...
{
  content: {
    query: requestBody.searchTerm,
    ...baseFuzzyQueryConfig(),
  },
},

Conclusion

There is no silver bullet in case of Elasticsearch configuration. And there is no metric to determine if the search is giving quality results. We have tweaked our configuration based on trial and error and from user feedback. We hope these techniques are useful for anyone who is looking to improve their search experience.

If this blog was helpful, check out our full blog archive.