String sorting in Elasticsearch

We should not sort on analyzed text field instead we should sort on not_analyzed text field.

Let’s understand this with an example:

Index some documents with a text field “name”.

POST my_index/_doc/1
{
  "name" : "technocrat sid"
}

POST my_index/_doc/2
{
  "name" : "siddhant01"
}

POST my_index/_doc/3
{
  "name" : "sid 01"
}

POST my_index/_doc/4
{
  "name" : "agnihotry siddhant"
}

Let’s sort the results in ascending order:

GET my_index/_search
{
  "sort": [
  {
    "name": {
      "order": "asc"
    }
  }
 ]
}

We get the results in the order:

sid 01

agnihotry siddhant

technocrat sid

siddhant 01

Wait !! Why did we not get the results in alphabetical order? We were expecting something like this:

agnihotry siddhant

sid 01

siddhant 01

technocrat sid

 

Reason that we did not get the results in the above order:

As we haven’t specified index mapping beforehand, we are relying on default mapping.  So in this case, the text field above will be analyzed with Standard Analyzer by default which mainly splits the text with spaces and removes stop words.

i.e. if we analyze “agnihotry siddhant”, it results in two terms “agnihotry” & “siddhant”.

which means when we index the text it is stored into tokens,

text --> tokens 
technocrat sid --> technocrat, sid 
siddhant01 --> siddhant01 
sid 01 --> sid, 01 
agnihotry siddhant --> agnihotry, siddhant

 

But we probably want to sort alphabetically on the first term, then on the second term, and so forth. In this case we should consider the text as whole instead of splitting it into tokens.

i.e. we should consider “technocrat sid”, “sid 01” and “agnihotry siddhant” as a whole which means we should not analyze the text field.

How do we not analyze a text field?

Before Elasticsearch 5.x

Before Elasticsearch 5.x text fields were stored as string. In order to consider a string field as a whole it should not be analyzed but we still need to perform a full text query on that same field.

So what we really want is to index the same field in two different ways, i.e. we want to sort and search on the same string field.

We can do this using multifield mapping:

"name": {
  "type": "string",
    "fields": {
      "raw": {
        "type":  "string",
        "index": "not_analyzed"
      }
   }
}  

The main name field is same as before: an analyzed full-text field. The new name.raw sub field is not_analyzed.

That means we can use the name field for search and name.raw field for sorting:

GET my_index/_search
{
  "sort": [
  {
    "name.raw": {
      "order": "asc"
    }
  }
 ]
}

After Elasticsearh 5.x

In Elasticsearch 5.x, the string type has been removed and there are now two new types: text, which should be used for full-text search, and keyword, which should be used for sort.

For instance, if you index the following document:

{
  "name": "sid"
}

Then the following dynamic mappings will be created:

{
  "name": {
    "type" "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

So you don’t have to specify not_analyzed explicitly for a text field after ES 5.x.

You can use name.keyword for sorting:

GET my_index/_search
{
  "sort": [
  {
    "name.keyword": {
      "order": "asc"
    }
  }
 ]
}