Querying Elasticsearch
So far, when we havesearched our data, we used the REST API and a simple query or the GET
request. Similarly, when we were changing the index, we also used the REST API and sent the JSON-structured data to Elasticsearch. Regardless of the type of operation we wanted to perform, whether it was a mapping change or document indexation, we used JSON structured request body to inform Elasticsearch about the operation details.
A similar situation happens when we want to send more than a simple query to Elasticsearch, we structure it using the JSON objects and send it to Elasticsearch in the request body. This is called the query DSL. In a broader view, Elasticsearch supports two kinds of queries: basic ones and compound ones. Basic queries, such as the term
query, are used for querying the actual data. We will cover these in the Basic queries section of this chapter. The second type of query is the compound query, such as the bool
query, which can combine multiple queries. We will cover these in the Compound queries section of this chapter.
However, this is not the whole picture. In addition to these two types of queries, certain queries can have filters that are used to narrow down your results with certain criteria. Filter queries don't affect scoring and are usually very efficient and easily cached.
To make it even more complicated, queries can contain other queries (don't worry; we will try to explain all this!). Furthermore, some queries can contain filters and others can contain both queries and filters. Although this is not everything, we will stick with this working explanation for now. We will go over this in greater detail in the Compound queries section in this chapter and the Filtering your results section in Chapter 4, Extending Your Querying Knowledge.
The example data
If not stated otherwise, the following mappings will be used for the rest of the chapter:
{ "book" : { "properties" : { "author" : { "type" : "string" }, "characters" : { "type" : "string" }, "copies" : { "type" : "long", "ignore_malformed" : false }, "otitle" : { "type" : "string" }, "tags" : { "type" : "string", "index" : "not_analyzed" }, "title" : { "type" : "string" }, "year" : { "type" : "long", "ignore_malformed" : false, "index" : "analyzed" }, "available" : { "type" : "boolean" } } } }
The preceding mappings represent a simple library and were used to create the library index. One thing to remember is that Elasticsearch will analyze the string based fields if we don't configure it differently.
The preceding mappings were stored in the mapping.json
file and, in order to create the mentioned library index, we can use the following commands:
curl -XPOST 'localhost:9200/library' curl -XPUT 'localhost:9200/library/book/_mapping' -d @mapping.json
We also used the following sample data as the example ones for this chapter:
{ "index": {"_index": "library", "_type": "book", "_id": "1"}}
{ "title": "All Quiet on the Western Front","otitle": "Im Westen nichts Neues","author": "Erich Maria Remarque","year": 1929,"characters": ["Paul Bäumer", "Albert Kropp", "Haie Westhus", "Fredrich Müller", "Stanislaus Katczinsky", "Tjaden"],"tags": ["novel"],"copies": 1, "available": true, "section" : 3}
{ "index": {"_index": "library", "_type": "book", "_id": "2"}}
{ "title": "Catch-22","author": "Joseph Heller","year": 1961,"characters": ["John Yossarian", "Captain Aardvark", "Chaplain Tappman", "Colonel Cathcart", "Doctor Daneeka"],"tags": ["novel"],"copies": 6, "available" : false, "section" : 1}
{ "index": {"_index": "library", "_type": "book", "_id": "3"}}
{ "title": "The Complete Sherlock Holmes","author": "Arthur Conan Doyle","year": 1936,"characters": ["Sherlock Holmes","Dr. Watson", "G. Lestrade"],"tags": [],"copies": 0, "available" : false, "section" : 12}
{ "index": {"_index": "library", "_type": "book", "_id": "4"}}
{ "title": "Crime and Punishment","otitle": "Преступлéние и наказáние","author": "Fyodor Dostoevsky","year": 1886,"characters": ["Raskolnikov", "Sofia Semyonovna Marmeladova"],"tags": [],"copies": 0, "available" : true}
We stored our sample data in the documents.json
file and we use the following command to index it:
curl -s -XPOST 'localhost:9200/_bulk' --data-binary @documents.json
This command runs bulk indexing. You can learn more about it in the Batch indexing to speed up your indexing process section in Chapter 2, Indexing Your Data.
A simple query
The simplest way to query Elasticsearch is to use the URI request query. We already discussed it in the Searching with the URI request query section of Chapter 1, Getting Started with Elasticsearch Cluster. For example, to search for the word crime in the title field, you could send a query using the following command:
curl -XGET 'localhost:9200/library/book/_search?q=title:crime&pretty'
This is a very simple, but limited, way of submitting queries to Elasticsearch. If we look from the point of view of the Elasticsearch query DSL, the preceding query is a query_string
query. It searches for the documents that have the term crime in the title field and can be rewritten as follows:
{ "query" : { "query_string" : { "query" : "title:crime" } } }
Sending a query using the query DSL is a bit different, but still not rocket science. We send the GET
(POST
is also accepted in case your tool or library doesn't allow sending request body in HTTP GET
requests) HTTP request to the _search
REST endpoint as earlier and include the query in the request body. Let's take a look at the following command:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "query" : { "query_string" : { "query" : "title:crime" } } }'
As you can see, we used the request body (the -d
switch) to send the whole JSON-structured query to Elasticsearch. The pretty
request parameter tells Elasticsearch to structure the response in such a way that we humans can read it more easily. In response to the preceding command, we get the following output:
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "title" : "Crime and Punishment", "otitle" : "Преступлéние и наказáние", "author" : "Fyodor Dostoevsky", "year" : 1886, "characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ], "tags" : [ ], "copies" : 0, "available" : true } } ] } }
Nice! We got our first search results with the query DSL.
Paging and result size
Elasticsearch allows us to control how many results we want to get (at most) and from which result we want to start. The following are the two additional properties that can be set in the request body:
from
: This property specifies the document that we want to have our results from. Its default value is0
, which means that we want to get our results from the first document.size
: This property specifies the maximum number of documents we want as the result of a single query (which defaults to10
). For example, if we are only interested in aggregations results and don't care about the documents returned by the query, we can set this parameter to0
.
If we want our query to get documents starting from the tenth item on the list and fetch 20 documents, we send the following query:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "from" : 9, "size" : 20, "query" : { "query_string" : { "query" : "title:crime" } } }'
Tip
Downloading the example code
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
- Log in or register to our website using your e-mail address and password
- Hover the mouse pointer on the SUPPORT tab at the top
- Click on Code Downloads & Errata
- Enter the name of the book in the Search box
- Select the book for which you're looking to download the code files
- Choose from the drop-down menu where you purchased this book from
- Click on Code Download
Once the file is downloaded, make sure that you unzip or extract the folder using the latest version of:
- WinRAR / 7-Zip for Windows
- Zipeg / iZip / UnRarX for Mac
- 7-Zip / PeaZip for Linux
Returning the version value
In addition to all the information returned, Elasticsearch can return the version of the document (we mentioned about versioning in Chapter 1, Getting Started with Elasticsearch Cluster. To do this, we need to add the version
property with the value of true to the top level of our JSON object. So, the final query, which requests the version information, will look as follows:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "version" : true, "query" : { "query_string" : { "query" : "title:crime" } } }'
After running the preceding query, we get the following results:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.5,
"hits" : [ {
"_index" : "library",
"_type" : "book",
"_id" : "4",
"_version" : 1,
"_score" : 0.5,
"_source" : {
"title" : "Crime and Punishment",
"otitle" : "Преступлéние и наказáние",
"author" : "Fyodor Dostoevsky",
"year" : 1886,
"characters" : [ "Raskolnikov", "Sofia Semyonovna Marmeladova" ],
"tags" : [ ],
"copies" : 0,
"available" : true
}
} ]
}
}
As you can see, the _version
section is present for the single hit we got.
Limiting the score
For nonstandard use cases, Elasticsearch provides a feature that lets us filter the results on the basis of a minimum score value that the document must have to be considered a match. In order to use this feature, we must provide the min_score
value at the top level of our JSON object with the value of the minimum score. For example, if we want our query to only return documents with a score higher than 0.75
, we send the following query:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "min_score" : 0.75, "query" : { "query_string" : { "query" : "title:crime" } } }'
We get the following response after running the preceding query:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } }
If you look at the previous examples, the score of our document was 0.5
, which is lower than 0.75
, and thus we didn't get any documents in response.
Limiting the score usually doesn't make much sense because comparing scores between the queries is quite hard. However, maybe in your case, this functionality will be needed.
Choosing the fields that we want to return
With the use of the fields array in the request body, Elasticsearch allows us to define which fields to include in the response. Remember that you can only return these fields if they are marked as stored in the mappings used to create the index, or if the _source
field was used (Elasticsearch uses the _source
field to provide the stored values and the _source
field is turned on by default).
So, for example, to return only the title and the year fields in the results (for each document), send the following query to Elasticsearch:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "fields" : [ "title", "year" ], "query" : { "query_string" : { "query" : "title:crime" } } }'
In response, we get the following output:
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "title" : [ "Crime and Punishment" ], "year" : [ 1886 ] } } ] } }
As you can see, everything worked as we wanted to. There are four things we would like to share with you at this point, which are as follows:
- If we don't define the fields array, it will use the default value and return the
_source
field if available. - If we use the
_source
field and request a field that is not stored, then that field will be extracted fromthe _source
field (however, this requires additional processing). - If we want to return all the stored fields, we just pass an asterisk (
*
) as the field name. - From a performance point of view, it's better to return the
_source
field instead of multiple stored fields. This is because getting multiple stored fields may be slower compared to retrieving a single_source
field.
Source filtering
In addition to choosing which fields are returned, Elasticsearch allows us to use so-called source filtering. This functionality allows us to control which fields are returned from the _source
field. Elasticsearch exposes several ways to do this. The simplest source filtering allows us to decide whether a document should be returned or not. Consider the following query:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : false, "query" : { "query_string" : { "query" : "title:crime" } } }'
The result retuned by Elasticsearch should be similar to the following one:
{ "took" : 12, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5 } ] } }
Note that the response is limited to base information about a document and the _source
field was not included. If you use Elasticsearch as a second source of data and content of the document is served from SQL database or cache, the document identifier is all you need.
The second way is similar to that described in the preceding fields, although we define which fields should be returned in the document source itself. Let's see that using the following example query:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : ["title", "otitle"], "query" : { "query_string" : { "query" : "title:crime" } } }'
We wanted to get the title and the otitle
document fields in the returned _source
field. Elasticsearch extracted those values from the original _source
value and included the _source
field only with the requested fields. The whole response returned by Elasticsearch looked as follows:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "_source" : { "otitle" : "Преступлéние и наказáние", "title" : "Crime and Punishment" } } ] } }
We can also use an asterisk to select which fields should be returned in the _source
field; for example, title*
will return values for the title
field and for title10
(if we have such field in our data). If we have documents with nested parts, we can use notation with a dot; for example, title.*
to select all the fields nested under the title
object.
Finally, we can also specify explicitly which fields we want to include and which to exclude from the _source
field. We can include fields using the include
property and we can exclude fields using the exclude
property (both of them are arrays of values). For example, if we want the returned _source
field to include all the fields starting with the letter t
but not the title field, we will run the following query:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "_source" : { "include" : [ "t*"], "exclude" : ["title"] }, "query" : { "query_string" : { "query" : "title:crime" } } }'
Using the script fields
Elasticsearch allows us to use script-evaluated values that will be returned with the result documents (we will discuss Elasticsearch scripting capabilities in greater detail in the Scripting capabilities of Elasticsearch section in Chapter 6, Make Your Search Better). To use the script
fields functionality, we add the script_fields
section to our JSON query object and an object with a name of our choice for each scripted value that we want to return. For example, to return a value named correctYear
, which is calculated as the year field minus 1800, we run the following query:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "doc[\"year\"].value - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }'
Note
By default, Elasticsearch doesn't allow us to use dynamic scripting. If you tried the preceding query, you probably got an error with information stating that the scripts of type [inline]
with operation [search]
and language [groovy]
are disabled. To make this example work, you should add the script.inline: on
property to the elasticsearch.yml
file. However, this exposes a security threat. Make sure to read the Scripting capabilities of Elasticsearch section in Chapter 6, Make Your Search Better, to learn about the consequences.
Using the doc
notation, like we did in the preceding example, allows us to catch the results returned and speed up script execution at the cost of higher memory consumption. We also get limited to single-valued and single term fields. If we care about memory usage, or if we are using more complicated field values, we can always use the _source
field. The same query using the _source
field looks as follows:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - 1800" } }, "query" : { "query_string" : { "query" : "title:crime" } } }'
The following response is returned by Elasticsearch with dynamic scripting enabled:
{ "took" : 76, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.5, "hits" : [ { "_index" : "library", "_type" : "book", "_id" : "4", "_score" : 0.5, "fields" : { "correctYear" : [ 86 ] } } ] } }
As you can see, we got the calculated correctYear
field in response.
Passing parameters to the script fields
Let's take a look at one more feature of the script fields - the passing of additional parameters. Instead of having the value 1800
in the equation, we can use a variable name and pass its value in the params
section. If we do this, our query will look as follows:
curl -XGET 'localhost:9200/library/book/_search?pretty' -d '{ "script_fields" : { "correctYear" : { "script" : "_source.year - paramYear", "params" : { "paramYear" : 1800 } } }, "query" : { "query_string" : { "query" : "title:crime" } } }'
As you can see, we added the paramYear
variable as part of the scripted equation and provided its value in the params
section. This allows Elasticsearch to execute the same script with different parameter values in a slightly more efficient way.