We use Elasticsearch in NeetoCourse for our searching needs. Recently, we have made some changes to Elasticsearch config to improve the search experience. In this blog, we will share the changes we made and what we learned during the process.
These are some of the terminologies in Elasticsearch that we use in this blog.
Document: A document in Elasticsearch is similar to a row in a database table. It is a collection of key-value pairs.
Index: An index is a collection of documents. It is similar to a database table. Indexing is the process of creating the said index, and we can configure each step of this process.
Analyzer: An analyzer converts a string into a list of searchable tokens. Analyzer contains three functions: Character Filter, Tokenizer, and Token Filter.
Character Filter is a function that performs the process of filtering out certain characters from the input string. For example, to strip html tags and to get only the body.
Tokenizer: A tokenizer is a function that splits the input string into tokens. This can be based on whitespace, punctuation, or any other character.
Token Filter: A token filter is a function that performs the process of filtering out certain tokens from the input string. For example, it can be used to remove stop words (a, the, and, etc.) which serves no purpose in the search.
This is our analyzer setup:
{
default: {
tokenizer: "whitespace",
filter: ["lowercase", "autocomplete"]
},
search: {
tokenizer: "whitespace",
filter: ["lowercase"]
},
english_exact: {
tokenizer: "whitespace",
filter: [
"lowercase"
]
}
}
default
is the analyzer used for indexing and searching. The search terms from
the user is passed through the search
analyzer. english_exact
is the
analyzer used for exact matches.
By default, Elasticsearch uses the standard
tokenizer which splits the input
string into tokens based on whitespace, punctuation, and any other character.
Since our content is mostly based on technical concepts and programming, we
cannot use the standard
tokenizer. The whitespace
tokenizer splits the input
string into tokens based on whitespace which is suitable for our use case.
Hence, we use the whitespace
tokenizer for all the analyzers.
The lowercase
filter is used to convert all the tokens to lowercase before
storing it in the index. This is a common requirement as we want the search to
be case-insensitive. We also use the custom autocomplete
filter. Let's see its
definition:
{
autocomplete: {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 20,
"preserve_original": true
}
}
The custom autocomplete
filter is an implementation of
edge_ngram
token filter. Let's see the result of this filter when applied on the phrase
"Elephant in the room".
{
"tokens": [
{
"token": "Ele",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Elep",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Eleph",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Elepha",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Elephan",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "Elephant",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "in",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the",
"start_offset": 12,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "roo",
"start_offset": 16,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "room",
"start_offset": 16,
"end_offset": 20,
"type": "<ALPHANUM>",
"position": 3
}
]
}
The edge_ngram
filter creates n-grams
starting from the first character. Since we specified min_gram
as 3 and
max_gram
as 20, the filter will create n-grams from the first 3 characters to
the last 20 characters. This means that a document will be created for every 3
to 20 letter sequence of a word. If we start typing "ELEP", then there will be a
document in the index corresponding to "ELEP" and we will get the result
"Elephant in the room".
The query object in the search request is equally important for getting relevant results. Elasticsearch sorts matching search results by relevance score, which measures how well each document matches a query. The query clause contains the criteria for calculating the relevance score. This is our query object:
{
bool: {
should: [
{
simple_query_string: {
query: requestBody.searchTerm,
fields: ["content", "meta.pageTitle", "meta.chapterTitle"],
quote_field_suffix: ".exact",
},
},
{
match: {
content: {
query: requestBody.searchTerm,
...baseFuzzyQueryConfig(),
},
},
},
{
match: {
"meta.pageTitle": {
query: requestBody.searchTerm,
...baseFuzzyQueryConfig(),
boost: 1.5,
},
},
},
{
match: {
"meta.chapterTitle": {
query: requestBody.searchTerm,
...baseFuzzyQueryConfig(),
boost: 1.5,
},
},
},
],
minimum_should_match: 1,
},
};
The bool
and should
clauses are used to create a compound query. The
should
clause means that at least one of the queries should match. Here the
match
is the standard query used for full text search. Here content
,
meta.pageTitle
and meta.chapterTitle
specify the fields that we created
while indexing the data.
We have provided a boost
value of 1.5 for page title and chapter title. This
is to make sure that a page or chapter title has more relevance score than any
content in the middle of the page.
The simple_query_string
query is used for exact matches, when the search term
contains double quotes, the english_exact
analyzer is used. The double quotes
operator ( "
) is part of the
several operators
that can be used in the simple_query_string
query.
We also use fuzzy searching in the match
query. Fuzzy searching helps in
giving proper results even if there are typos in the search term. Elasticsearch
uses Levenshtein distance
to calculate the similarity between the search term and the indexed data.
Previously, we used the
fuzzy query
to implement fuzzy searching.
{
content: {
value: requestBody.searchTerm,
...extendedFuzzyQueryConfig(),
},
},
But this caused several issues like:
Upon investigation, we found that the fuzzy
query does not perform analysis on
the search term. Instead we now use the match
query with fuzziness
parameter.
const baseFuzzyQueryConfig = () => ({
prefix_length: 0,
fuzziness: "AUTO",
});
...
{
content: {
query: requestBody.searchTerm,
...baseFuzzyQueryConfig(),
},
},
There is no silver bullet in case of Elasticsearch configuration. And there is no metric to determine if the search is giving quality results. We have tweaked our configuration based on trial and error and from user feedback. We hope these techniques are useful for anyone who is looking to improve their search experience.
If this blog was helpful, check out our full blog archive.