Elasticsearch Server(Third Edition)
上QQ阅读APP看书,第一时间看更新

Understanding the querying process

After reading the previous section, we now know how querying works in Elasticsearch. You know that Elasticsearch, in most cases, needs to scatter the query across multiple nodes, get the results, merge them, fetch the relevant documents from one or more shards, and return the final results to the client requesting the documents. What we didn't talk about are two additional things that define how queries behave: search type and query execution preference. We will now concentrate on these functionalities of Elasticsearch.

Query logic

Elasticsearch is a distributed search engine and so all functionality provided must be distributed in its nature. It is exactly the same with querying. Because we would like to discuss some more advanced topics on how to control the query process, we first need to know how it works.

Let's now get back to how querying works. We started the theory in the first chapter and we would like to get back to it. By default, if we don't alter anything, the query process will consist of two phases: the scatter and the gather phase. The aggregator node (the one that receives the request) will run the scatter phase first. During that phase, the query is distributed to all the shards that our index is built from (of course if routing is not used). For example, if it is built of 5 shards and 1 replica then 5 physical shards will be queried (we don't need to query a shard and its replica as they contain the same data). Each of the queried shards will only return the document identifier and the score of the document. The node that sent the scatter query will wait for all the shards to complete their task, gather the results, and sort them appropriately (in this case, from top scoring to the lowest scoring ones).

After that, a new request will be sent to build the search results. However, now only to those shards that held the documents to build the response. In most cases, Elasticsearch won't send the request to all the shards but to its subset. That's because we usually don't get the complete result of the query but only a portion of it. This phase is called the gather phase. After all the documents are gathered, the final response is built and returned as the query result. This is the basic and default Elasticsearch behavior but we can change it.

Search type

Elasticsearch allows us to choose how we want our query to be processed internally. We can do that by specifying the search type. There are different situations where different search types are appropriate: sometimes one can care only about the performance while sometimes query relevance is the most important factor. You should remember that each shard is a small Lucene index and, in order to return more relevant results, some information, such as frequencies, needs to be transferred between the shards. To control how the queries are executed, we can pass the search_type request parameter and set it to one of the following values:

  • query_then_fetch: In the first step, the query is executed to get the information needed to sort and rank the documents. This step is executed against all the shards. Then only the relevant shards are queried for the actual content of the documents. This is the search type used by default if no search type is provided with the query and this is the query type we described previously.
  • dfs_query_then_fetch: This is similar to query_then_fetch. However, it contains an additional query phase comparing to query_then_fetch which calculates distributed term frequencies.

There are also two deprecated search types: count and scan. The first one is deprecated starting from Elasticsearch 2.0 and the second one starting with Elasticsearch 2.1. The first search type used to provide benefits where only aggregations or the number of documents was relevant, but now it is enough to add size equal to 0 to your queries. The scan request was used for scrolling functionality.

So if we would like to use the simplest search type, we would run the following command:

curl -XGET 'localhost:9200/library/book/_search?pretty&search_type=query_then_fetch' -d '{
 "query" : {
 "term" : { "title" : "crime" }
 }
}'

Search execution preference

In addition to the possibility of controlling how the query is executed, we can also control on which shards to execute the query. By default, Elasticsearch uses shards and replicas on any node in a round robin manner – so that each shard is queried a similar number of times. The default behavior is the proper method of shard execution preference for most use cases. But there may be times when we want to change the default behavior. For example, you may want the search to only be executed on the primary shards. To do that, we can set the preference request parameter to one of the following values:

  • _primary: The operation will be only executed on the primary shards, so the replicas won't be used. This can be useful when we need to use the latest information from the index but our data is not replicated right away.
  • _primary_first: The operation will be executed on the primary shards if they are available. If not, it will be executed on the other shards.
  • _replica: The operation will be executed only on the replica shards.
  • _replica_first: This operation is similar to _primary_first, but uses replica shards. The operation will be executed on the replica shards if possible and on the primary shards if the replicas are not available.
  • _local: The operation will be executed on the shards available on the node which the request was sent from and, if such shards are not present, the request will be forwarded to the appropriate nodes.
  • _only_node:node_id: This operation will be executed on the node with the provided node identifier.
  • _only_nodes:nodes_spec: This operation will be executed on the nodes that are defined in nodes_spec. This can be an IP address, a name, a name or IP address using wildcards, and so on. For example, if nodes_spec is set to 192.168.1.*, the operation will be run on the nodes with IP addresses starting with 192.168.1.
  • _prefer_node:node_id: Elasticsearch will try to execute the operation on the node with the provided identifier. However, if the node is not available, it will be executed on the nodes that are available.
  • _shards:1,2: Elasticsearch will execute the operation on the shards with the given identifiers; in this case, on shards with identifiers 1 and 2. The _shards parameter can be combined with other preferences, but the shards identifiers need to be provided first. For example, _shards:1,2;_local.
  • Custom value: Any custom, string value may be passed. Requests with the same values provided will be executed on the same shards.

For example, if we would like to execute a query only on the local shards, we would run the following command:

curl -XGET 'localhost:9200/library/_search?pretty&preference=_local' -d '{
 "query" : {
 "term" : { "title" : "crime" }
 }
}'

Search shards API

When discussing the search preference, we would also like to mention the search shards API exposed by Elasticsearch. This API allows us to check which shards the query will be executed on. In order to use this API, run a request against the search_shards rest end point. For example, to see how the query will be executed, we run the following command:

curl -XGET 'localhost:9200/library/_search_shards?pretty' -d '{"query":"match_all":{}}'

The response to the preceding command will be as follows:

{
  "nodes" : {
    "my0DcA_MTImm4NE3cG3ZIg" : {
      "name" : "Cloud 9",
      "transport_address" : "127.0.0.1:9300",
      "attributes" : { }
    }
  },
  "shards" : [ [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 0,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "9ayLDbL1RVSyJRYIJkuAxg"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 1,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "wfpvtaLER-KVyOsuD46Yqg"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 2,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "zrLPWhCOSTmjlb8TY5rYQA"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 3,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "efnvY7YcSz6X8X8USacA7g"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "my0DcA_MTImm4NE3cG3ZIg",
    "relocating_node" : null,
    "shard" : 4,
    "index" : "library",
    "version" : 4,
    "allocation_id" : {
      "id" : "XJHW2J63QUKdh3bK3T2nzA"
    }
  } ] ]
}

As you can see, in the response returned by Elasticsearch, we have the information about the shards that will be used during the query process. Of course, with the search shards API, we can use additional parameters that control the querying process. These properties are routing, preference, and local. We are already familiar with the first two. The local parameter is a Boolean (values true or false), one that allows us to tell Elasticsearch to use the cluster state information stored on the local node (setting local to true) instead of the one from the master node (setting local to false). This allows us to diagnose problems with cluster state synchronization.