Elasticsearch Server(Third Edition)
上QQ阅读APP看书,第一时间看更新

Batch indexing to speed up your indexing process

In Chapter 1, Getting Started with Elasticsearch Cluster, we saw how to index a particular document into Elasticsearch. It required opening an HTTP connection, sending the document, and closing the connection. Of course, we were not responsible for most of that as we used the curl command, but in the background this is what happened. However, sending the documents one by one is not efficient. Because of that, it is now time to find out how to index a large number of documents in a more convenient and efficient way than doing so one by one.

Preparing data for bulk indexing

Elasticsearch allows us to merge many requests into one package. This package can be sent as a single request. What's more, we are not limited to having a single type of request in the so called bulk – we can mix different types of operations together, which include:

  • Adding or replacing the existing documents in the index (index)
  • Removing documents from the index (delete)
  • Adding new documents into the index when there is no other definition of the document in the index (create)
  • Modifying the documents or creating new ones if the document doesn't exist (update)

The format of the request was chosen for processing efficiency. It assumes that every line of the request contains a JSON object with the description of the operation followed by the second line with a document – another JSON object itself. We can treat the first line as a kind of information line and the second as the data line. The exception to this rule is the delete operation, which contains only the information line, because the document is not needed. Let's look at the following example:

{ "index": { "_index": "addr", "_type": "contact", "_id": 1 }}
{ "name": "Fyodor Dostoevsky", "country": "RU" }
{ "create": { "_index": "addr", "_type": "contact", "_id": 2 }}
{ "name": "Erich Maria Remarque", "country": "DE" }
{ "create": { "_index": "addr", "_type": "contact", "_id": 2 }}
{ "name": "Joseph Heller", "country": "US" }
{ "delete": { "_index": "addr", "_type": "contact", "_id": 4 }}
{ "delete": { "_index": "addr", "_type": "contact", "_id": 1 }}

It is very important that every document or action description is placed in one line (ended by a newline character). This means that the document cannot be pretty-printed. There is a default limitation on the size of the bulk indexing file, which is set to 100 megabytes and can be changed by specifying the http.max_content_length property in the Elasticsearch configuration file. This lets us avoid issues with possible request timeouts and memory problems when dealing with requests that are too large.

Note

Note that with a single batch indexing file, we can load the data into many indices and documents in the bulk request can have different types.

Indexing the data

In order to execute the bulk request, Elasticsearch provides the _bulk endpoint. This can be used as /_bulk or with an index name as /index_name/_bulk or even with a type and index name as /index_name/type_name/_bulk. The second and third forms define the default values for the index name and the type name. We can omit these properties in the information line of our request and Elasticsearch will use the default values from the URI. It is also worth knowing that the default URI values can be overwritten by the values in the information lines.

Assuming we've stored our data in the documents.json file, we can run the following command to send this data to Elasticsearch:

curl -XPOST 'localhost:9200/_bulk?pretty' --data-binary @documents.json

The ?pretty parameter is of course not necessary. We've used this parameter only for the ease of analyzing the response of the preceding command. What is important, in this case, is using curl with the --data-binary parameter instead of using –d. This is because the standard –d parameter ignores new line characters, which, as we said earlier, are important for parsing the bulk request content by Elasticsearch. Now let's look at the response returned by Elasticsearch:

{
  "took" : 469,
  "errors" : true,
  "items" : [ {
    "index" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "1",
      "_version" : 1,
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "status" : 201
    }
  }, {
    "create" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "2",
      "_version" : 1,
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "status" : 201
    }
  }, {
    "create" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "2",
      "status" : 409,
      "error" : {
        "type" : "document_already_exists_exception",
        "reason" : "[contact][2]: document already exists",
        "shard" : "2",
        "index" : "addr"
      }
    }
  }, {
    "delete" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "4",
      "_version" : 1,
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "status" : 404,
      "found" : false
    }
  }, {
    "delete" : {
      "_index" : "addr",
      "_type" : "contact",
      "_id" : "1",
      "_version" : 2,
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "status" : 200,
      "found" : true
    }
  } ]
}

As we can see, every result is a part of the items array. Let's briefly compare these results with our input data. The first two commands, named index and create, were executed without any problems. The third operation failed because we wanted to create a record with an identifier that already existed in the index. The next two operations were deletions. Both succeeded. Note that the first of them tried to delete a nonexistent document; as you can see, this wasn't a problem for Elasticsearch – the thing worth noting though is that for the nonexisting document we saw a status of 404, which in the HTTP response code means not found (http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html). As you can see, Elasticsearch returns information about each operation, so for large bulk requests the response can be massive.

The _all field

The _all field is used by Elasticsearch to store data from all the other fields in a single field for ease of searching. This kind of field may be useful when we want to implement a simple search feature and we want to search all the data (or only the fields we copy to the _all field), but we don't want to think about the field names and things like that. By default, the _all field is enabled and contains all the data from all the fields from the document. However, this field makes the index a bit bigger and that is not always needed.

For example, when you input a search phrase into a search box in the library catalog site, you expect that you can search using the author's name, the ISBN number, and the words that the book title contains, but searching for the number of pages or the cover type usually does not make sense. We can either disable the _all field completely or exclude the copying of certain fields to it. In order not to include a certain field in the _all field, we use the include_in_all property, which was discussed earlier in this chapter. To completely turn off the _all field functionality, we modify our mappings file as follows:

{
  "book" : {
    "_all" : {
      "enabled" : false
     },
     "properties" : {
       .  .  .
     }
  }
}

In addition to the enabled property, the _all field supports the following ones:

  • store
  • term_vector
  • analyzer

For information about the preceding properties, refer to the Mappings configuration section in this chapter.

The _source field

The _source field allows us to store the original JSON document that was sent to Elasticsearch during indexation. By default, the _source field is turned on as some of the Elasticsearch functionalities depend on it (for example, the partial update feature). In addition to that, the _source field can be used as the source of data for the highlighting functionality if a field is not stored. However, if we don't need such a functionality, we can disable the _source field as it causes some storage overhead. In order to do that, we need to set the _source object's enabled property to false, as follows:

{
  "book" : {

        "_source" : {
      "enabled" : false
    },
    "properties" : {
      . . .
    }
  }
}

We can also tell Elasticsearch which fields we want to exclude from the _source field and which fields we want to include. We do that by adding the includes and excludes properties to the _source field definition. For example, if we want to exclude all the fields in the author path from the _source field, our mappings will look as follows:

{
  "book" : {
    "_source" : {
      "excludes" : [ "author.*" ]
    },
    "properties" : {
      . . .
    }
  }
}

Additional internal fields

There are additional fields that are internally used by Elasticsearch, but which we can't configure. Those fields are:

  • _id: This field is used to hold the identifier of the document inside the index and type
  • _uid: This field is used to hold the unique identifier of the document in the index and is built of _id and _type (this allows to have documents with the same identifier with different types inside the same index)
  • _type: This field is the type name for the document
  • _field_names: This field is the list of fields existing in the document