Elasticsearch 6.x Analyzers

Elasticsearch Analyzer is a wrapper which wraps three functions:
  • Character filter: Mainly used to strip off some unused characters or change some characters.
  • Tokenizer: Breaks a text into individual tokens(or words) based on certain factors like whitespace, ngram etc.
  • Token filter: It receives the individual tokens from tokenizer and then applies some filters on it (example changing uppercase terms to lowercase).

In a nutshell, an analyzer is used to tell elasticsearch how the text/phrase should be indexed and searched.

Why do we need analyzers?

Analyzers are generally used when you want to index a text or phrase. It is useful to break the text into words so that you can search on terms to get the document.

Example: Let’s say you have an index (my_index) with a field “intro” and you index a document:

{ “intro” : “Hi there I am sid”} 

The following requests are performed in Kibana:

  • Create an index my_index:
PUT my_index
  • Put index mapping:
PUT my_index/_mapping/doc
{
  "properties": {
   "intro" : {
    "type": "keyword",
    "index": true
   }
  }
}
  • Index data:
POST my_index/doc/1
{
  "intro": "Hi there I am sid"
}

keyword type is not analyzed so the above text “Hi there I am sid” is indexed as it is i.e. it is not split into tokens.

If you want to query the above document you will have to write the complete phrase

i.e. (find documents where intro = “Hi there I am sid”)

The query will return the indexed document:

GET my_index/_search
{
  "query": {
   "match": {
    "intro": "Hi there I am sid"
   }
  }
}

But this will not:

GET my_index/_search
{
  "query": {
   "match": {
    "intro": "Hi there"
   }
  }
}

But if the phrase is indexed as tokens then even if you query for a token (find documents where intro=”sid”) you’ll get the document.

POST my_index2/doc/1
{
  "intro": "Hi there I am sid"
}

Note: By default standard analyzer is used for all text fields and it provides grammar based tokenization.

GET my_index2/_search
{
  "query": {
   "match": {
    "intro": "sid"
   }
  }
}

The above query will return the document.

Hope this is helpful !

Reference: My stackoverflow answer.

Why ES 6.x doesn’t allow multiple types?

Before Elasticsearch6.x, the analogy wrt Relational Databases was:

Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields

which led to incorrect assumptions.

SQL tables are independent of each other and if two tables have same column names then they will be stored separately and even they can have different definitions (eg: Table_1 & Table_2 have a common column name “date” which can have different meaning for both the tables), which is not the case in elastic mapping types. Internally, fields that have same names in different mapping types are stored as same Lucene field, having said that, it implies that both the fields should have the same mapping definition. This breaks the analogy mentioned above.

So in order to break this analogy ES6.x doesn’t allow more than one mapping type for an index. Even they are planning to remove _type in the upcoming versions.

Question: How you’re going to differentiate documents for the same index then? 

Answer: You can do this in the following ways:

  • Add a custom field type in the index definition.
  • Make a separate index for each type.

How to re-index an index in Elasticsearch using Java ?

To re-index an index using java, build a re-index request using ReindexRequestBuilder API like:

ReindexRequestBuilder reindexRequest = 
new ReindexRequestBuilder(client,ReindexAction.INSTANCE)
    .source("source_index")
    .destination("destination_index")
    .refresh(true);

After creating a request execute the request:

reindexRequest.execute();

To validate whether the request is executed or not add a validation check:

if(copy.execute().isDone()) {
System.out.println("Request is executed");
}

Bingo! Your index is re-indexed.

Reflections in java

Reflection is a powerful feature of Java which provides the ability to inspect & modify the code at run time (manipulate internal properties of the program).

For example: It’s possible for a Java class to obtain the names of all its members and display them. Even we can also use reflection to instantiate an object, invoke it’s methods and change field values.

 

How it is done?

For every object JVM creates an immutable Class object which is used by reflection to get the run time properties of that object and once it has access we can change the properties. Reflection is not something which is used in daily programming tasks as it has some cons as well, one being a security threat, as using reflection we can get access to the private variables of a class and then can change it’s value.

 

How do we get access to the class object?

object.getClass();

 

After having the access we can get the methods, variables and constructors etc.

 

Stop the world phase

Garbage Collection literally stops the world.

When a GC occurs in young generation space, it is completed quickly as the young generation space is small.

Young generation space is the space where newly instantiated objects are stored. Internally, this space has two survivor spaces which are used when GC occurs and the objects which still have references are shifted to a survivor space. If an object survives many cycles of GC, it is shifted to old generation space.

Problem is when GC occurs in Old generation space which contains long lived objects. This space uses a lot more memory than the young generation and when GC occurs in old generation, it literally halts all the requests made to that JVM process.

So, the world literally stops !!

Why Java 8 ?

In simple words java 8 allows us to write code more precisely and concisely, which is better than writing verbose code in the java versions prior to java 8.

Example: Let’s sort a collection of cars based on their speed.

Java versions prior to java 8 :

Collections.sort(fleet, new Comparator() {
  @Override
  public int compare (Car c1, Car c2) { 
  return c1.getSpeed().compareTo(c2.getSpeed());
  }
}

Instead of writing a verbose code like above, using java 8 we can write the same code as:

Java 8 :

fleet.sort(Comparator.comparing(Car::getSpeed));

The above code is more concise and could be read as “sort fleet comparing Car’s speed”.

So why write a boilerplate code which is not related to the problem statement. Instead you can write concise code which is related to the problem statement and has SQL like readability.

Learn, Collaborate & Share !!

Exit mobile version