Tag Archives: Analyzers

Elasticsearch 6.x Analyzers

Elasticsearch Analyzer is a wrapper which wraps three functions:
  • Character filter: Mainly used to strip off some unused characters or change some characters.
  • Tokenizer: Breaks a text into individual tokens(or words) based on certain factors like whitespace, ngram etc.
  • Token filter: It receives the individual tokens from tokenizer and then applies some filters on it (example changing uppercase terms to lowercase).

In a nutshell, an analyzer is used to tell elasticsearch how the text/phrase should be indexed and searched.

Why do we need analyzers?

Analyzers are generally used when you want to index a text or phrase. It is useful to break the text into words so that you can search on terms to get the document.

Example: Let’s say you have an index (my_index) with a field “intro” and you index a document:

{ “intro” : “Hi there I am sid”} 

The following requests are performed in Kibana:

  • Create an index my_index:
PUT my_index
  • Put index mapping:
PUT my_index/_mapping/doc
{
  "properties": {
   "intro" : {
    "type": "keyword",
    "index": true
   }
  }
}
  • Index data:
POST my_index/doc/1
{
  "intro": "Hi there I am sid"
}

keyword type is not analyzed so the above text “Hi there I am sid” is indexed as it is i.e. it is not split into tokens.

If you want to query the above document you will have to write the complete phrase

i.e. (find documents where intro = “Hi there I am sid”)

The query will return the indexed document:

GET my_index/_search
{
  "query": {
   "match": {
    "intro": "Hi there I am sid"
   }
  }
}

But this will not:

GET my_index/_search
{
  "query": {
   "match": {
    "intro": "Hi there"
   }
  }
}

But if the phrase is indexed as tokens then even if you query for a token (find documents where intro=”sid”) you’ll get the document.

POST my_index2/doc/1
{
  "intro": "Hi there I am sid"
}

Note: By default standard analyzer is used for all text fields and it provides grammar based tokenization.

GET my_index2/_search
{
  "query": {
   "match": {
    "intro": "sid"
   }
  }
}

The above query will return the document.

Hope this is helpful !

Reference: My stackoverflow answer.