Elasticsearch 7.0 Cookbook(Fourth Edition)
上QQ阅读APP看书,第一时间看更新

How it works...

The bulk operation allows for the aggregation of different calls as a single one: a header part with the action to be performed, and a body for other operations such as index, create, and update.

The header is composed of the action name and the object of its parameters. Looking at the previous index example, we have the following:

{ "index":{ "_index":"myindex", "_id":"1" } }

For indexing and creating, an extra body is required with the data:

{ "field1" : "value1", "field2" : "value2" }

The delete action doesn't require optional data, so only the header composes it:

{ "delete":{ "_index":"myindex", "_id":"1" } }

At least, it is possible use an update action in a bulk with a format similar to the index one:

{ "update":{ "_index":"myindex", "_id":"3" } }

The header accepts all the common parameters of the update action, such as doc, upsert, doc_as_upsert, lang, script, and params. For controlling the number of retries in the case of concurrency, the bulk update defines the _retry_on_conflict parameter, set to the number of retries to be performed, before raising an exception.

So, a possible body for the update would be as follows:

{ "doc":{"field1" : "value1", "field2" : "value2" }}

The bulk item can accept several parameters, such as the following:

  • routing, to control the routing shard.
  • parent, to select a parent item shard. This is required if you are indexing some child documents. Global bulk parameters that can be passed using query arguments are as follows:
    • consistency (one, quorum, all) (default quorum), which controls the number of active shards before executing write operations.
    • refresh (default false), which forces a refresh in the shards involved in bulk operations. The newly indexed document will be available immediately, without having to wait for the standard refresh interval (1s).
    • pipeline, which forces an index using the ingest pipeline provided.
Previous versions of Elasticsearch required users to pass the _type value, but this was removed in version 7.x due to type removal.

Usually, Elasticsearch client libraries that use the Elasticsearch REST API automatically implement a serialization of bulk commands.

The correct number of commands to serialize in a bulk execution is a user choice, but there are some things to consider:

  • In standard configuration, Elasticsearch limits the HTTP call to 100 MB in size. If the size is over that limit, the call is rejected.
  • Multiple complex commands take a lot of time to be processed, so pay attention to client timeout.
  • The small size of commands in a bulk doesn't improve performance.

If the documents aren't big, 500 commands in a bulk can be a good number to start with, and it can be tuned depending on data structures (number of fields, number of nested objects, complexity of fields, and so on).