Home » Elasticsearch » Sort strings alphabetically rather than lexicographically in Elasticsearch?

Sort strings alphabetically rather than lexicographically in Elasticsearch?

Let’s say we have a text field “name” in an elasticsearch index with the following values: Siddhant, SIDTECHNOCRAT, and sid.

Now follow the conventions mentioned in String Sorting in Elasticsearch, which talks about using a text field that is not analyzed for sorting.

I am assuming that you’ve followed the conventions mentioned in the above link.

For the demo I am using Elasticsearch 6.4.1.

Let’s index the names:

PUT /my_index/_doc/1
{ "name": "Siddhant" }

PUT /my_index/_doc/2
{ "name": "SIDTECHNOCRAT" }

PUT /my_index/_doc/3
{ "name": "sid" }

Let’s sort the names:

GET /my_index/user/_search?sort=name.keyword

Output:

SIDTECHNOCRAT 
Siddhant 
sid

Wait!! weren’t you expecting the result to be sid, Siddhant and SIDTECHNOCRAT.

You’re getting the results in the above order because the bytes used to represent capital letters have a lower ASCII value than the bytes used to represent lowercase letters, and as an international accepted standard, Elasticsearch follows ASCII sort order which is why the names are sorted with lowest bytes first.

In other words we’re getting results in lexicographical order which is perfectly fine for a machine but does not make much sense to human beings (expecting results to be sorted in alphabetical order).

If you want the results to be sorted in alphabetical order you should index each name in a way that ES should ignore the case while indexing.

To achieve this create a custom analyzer combining keyword tokenizer and lowercase token filter.

Then configure the text field you want to sort with the custom analyzer:

PUT /my_index
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "custom_keyword_analyzer" : {
          "tokenizer" : "keyword",
          "filter" : ["lowercase"]
        }
      }
    }
  },
  "mappings" : {
    "_doc" : {
      "properties" : {
        "name" : {
          "type" : "text",
          "fields" : {
            "raw" : {
              "type" : "text",
              "analyzer" : "custom_keyword_analyzer",
              "fielddata": true
            }
          }
        }
      }
    }
  }
}
  • keyword tokenizer is used to consider the string as a whole and not splitting up into tokens.
  • lowercase filter is used to convert the token into small letters.
  • custom_keyword_analyzer is used with the multifield raw to sort the results alphabetically.

Index your data:

POST my_index/_doc/1
{ "name" : "Siddhant" }

POST my_index/_doc/2
{ "name" : "SIDTECHNOCRAT" }

POST my_index/_doc/3
{ "name" : "sid" }

Perform sort:

GET my_index/_doc/_search?sort=name.raw

Output:

sid 
Siddhant 
SIDTECHNOCRAT

Bingo !! You’ve got what you were expecting.


Leave a comment

Your email address will not be published. Required fields are marked *