---
title: "Improving search experience using Elasticsearch"
description:
  "Learn how we improved our search experience in NeetoCourse by tweaking
  Elasticsearch config."
canonical_url: "https://www.bigbinary.com/blog/elasticsearch-improvements"
markdown_url: "https://www.bigbinary.com/blog/elasticsearch-improvements.md"
---

# Improving search experience using Elasticsearch

Learn how we improved our search experience in NeetoCourse by tweaking
Elasticsearch config.

- Author: Sayooj Surendran
- Published: October 29, 2024
- Categories: Elasticsearch

We use Elasticsearch in [NeetoCourse](https://neeto.com/neetocourse) for our
searching needs. Recently, we have made some changes to the Elasticsearch config
to improve the search experience. In this blog, we will share the changes we
made and what we learned during the process.

## Definitions

These are some of the Elasticsearch terminology we use in this blog.

- **Document:** A document in Elasticsearch is similar to a row in a database
  table. It is a collection of key-value pairs.

- **Index:** An index is a collection of documents. It is similar to a database
  table. Indexing is the process of creating the index, and we can configure
  each step of this process.

- **Analyzer:** An analyzer converts a string into a list of searchable tokens.
  Analyzer contains three functions: Character Filter, Tokenizer, and Token
  Filter.

- **Character Filter** is a function that performs the process of filtering out
  certain characters from the input string. For example, to strip HTML tags and
  to get only the body.

- **Tokenizer:** A tokenizer is a function that splits the input string into
  tokens. This can be based on whitespace, punctuation, or any other character.

- **Token Filter:** A token filter is a function that performs the process of
  filtering out certain tokens from the input string. For example, it can be
  used to remove stop words (a, the, and, etc.) which serve no purpose in the
  search.

## Analyzer

This is our analyzer setup:

```js
{
  default: {
    tokenizer: "whitespace",
    filter: ["lowercase", "autocomplete"]
  },
  search: {
    tokenizer: "whitespace",
    filter: ["lowercase"]
  },
  english_exact: {
    tokenizer: "whitespace",
    filter: [
      "lowercase"
    ]
  }
}
```

`default` is the analyzer used for indexing and searching. The search terms from
the user is passed through the `search` analyzer. `english_exact` is the
analyzer used for exact matches.

### Tokenizer

By default, Elasticsearch uses the `standard` tokenizer, which splits the input
string into tokens based on whitespace, punctuation, and any other character.
Since our content is mostly based on technical concepts and programming, we
cannot use the `standard` tokenizer. The `whitespace` tokenizer splits the input
string into tokens based on whitespace, which is suitable for our use case.
Hence, we use the `whitespace` tokenizer for all the analyzers.

### Filter

The `lowercase` filter is used to convert all the tokens to lowercase before
storing it in the index. This is a common requirement as we want the search to
be case-insensitive. We also use the custom `autocomplete` filter. Let's see its
definition:

```js
{
  autocomplete: {
    "type": "edge_ngram",
    "min_gram": 3,
    "max_gram": 20,
    "preserve_original": true
  }
}
```

The custom `autocomplete` filter is an implementation of
[edge_ngram](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html)
token filter. Let's see the result of this filter when applied on the phrase
"Elephant in the room".

```js
{
  "tokens": [
    {
      "token": "Ele",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elep",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Eleph",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elepha",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elephan",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Elephant",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 9,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "the",
      "start_offset": 12,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "roo",
      "start_offset": 16,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "room",
      "start_offset": 16,
      "end_offset": 20,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
```

The `edge_ngram` filter creates [n-grams](https://en.wikipedia.org/wiki/N-gram)
starting from the first character. Since we specified `min_gram` as 3 and
`max_gram` as 20, the filter will create n-grams from the first 3 characters to
the last 20 characters. This means that a document will be created for every 3
to 20 letter sequence of a word. If we start typing "ELEP", then there will be a
document in the index corresponding to "ELEP" and we will get the result
"Elephant in the room".

## The Query

The query object in the search request is equally important for getting relevant
results. Elasticsearch sorts matching search results by _relevance score_, which
measures how well each document matches a query. The query clause contains the
criteria for calculating the relevance score. This is our query object:

```js
{
  bool: {
    should: [
      {
        simple_query_string: {
          query: requestBody.searchTerm,
          fields: ["content", "meta.pageTitle", "meta.chapterTitle"],
          quote_field_suffix: ".exact",
        },
      },
      {
        match: {
          content: {
            query: requestBody.searchTerm,
            ...baseFuzzyQueryConfig(),
          },
        },
      },
      {
        match: {
          "meta.pageTitle": {
            query: requestBody.searchTerm,
            ...baseFuzzyQueryConfig(),
            boost: 1.5,
          },
        },
      },
      {
        match: {
          "meta.chapterTitle": {
            query: requestBody.searchTerm,
            ...baseFuzzyQueryConfig(),
            boost: 1.5,
          },
        },
      },
    ],
    minimum_should_match: 1,
  },
};
```

The `bool` and `should` clauses are used to create a compound query. The
`should` clause means that at least one of the queries should match. Here the
`match` is the standard query used for full-text search. Here `content`,
`meta.pageTitle` and `meta.chapterTitle` specify the fields that we created
while indexing the data.

We have provided a `boost` value of 1.5 for page title and chapter title. This
is to make sure that a page or chapter title has more relevance score than any
content in the middle of the page.

The `simple_query_string` query is used for exact matches, when the search term
contains double quotes, the `english_exact` analyzer is used. The double quotes
operator ( `"` ) is part of the
[several operators](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#simple-query-string-syntax)
that can be used in the `simple_query_string` query.

### Fuzzy searching

We also use fuzzy searching in the `match` query. Fuzzy searching helps in
giving proper results even if there are typos in the search term. Elasticsearch
uses [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
to calculate the similarity between the search term and the indexed data.
Previously, we used the
[fuzzy query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html)
to implement fuzzy searching.

```js
{
  content: {
    value: requestBody.searchTerm,
    ...extendedFuzzyQueryConfig(),
  },
},
```

But this caused several issues like:

- Fuzzy results being prioritized over exact matches. For example, searching for
  "five" returned results for "dive", even when "five" was present in the
  content
- Fuzzy results not being returned when the search term contained multiple
  words.

Upon investigation, we found that the `fuzzy` query does not perform analysis on
the search term. Instead we now use the `match` query with `fuzziness`
parameter.

```js
const baseFuzzyQueryConfig = () => ({
  prefix_length: 0,
  fuzziness: "AUTO",
});
...
{
  content: {
    query: requestBody.searchTerm,
    ...baseFuzzyQueryConfig(),
  },
},
```

## Conclusion

There is no silver bullet in the case of Elasticsearch configuration. And there
is no metric to determine if the search is giving quality results. We have
tweaked our configuration based on trial and error and from user feedback. We
hope these techniques are useful for anyone who is looking to improve their
search experience.

## Links

- [Human page](https://www.bigbinary.com/blog/elasticsearch-improvements)
