Learn MongoDB 4.x
上QQ阅读APP看书,第一时间看更新

Discovering what's new and different in MongoDB 4.x

What's new and different in the MongoDB 4.x release can be broken down into two main categories: new features and internal enhancements. Let's look at the most significant new features first.

Significant new features

The most significant new features introduced in MongoDB 4.x include the following:

  • Multidocument ACID transaction support
  • Nonblocking secondary replica reads
  • In-progress index build interruption 
Transactions, secondary replica reads, and the aggregation pipeline are covered in detail in later chapters of this book. For an excellent brief overview of the major changes from MongoDB 3 to 4, go to 
https://www.mongodb.com/blog/post/mongodb-40-release-candidate-0-has-landed .
Multidocument ACID transaction support

In the database world, a transaction is a block of database operations that should be treated as if the entire block of commands was just a single command. An example would be where your application is performing end-of-the-month payroll processing. In order to maintain the integrity of the database, and your accounting files, you would need this set of operations to be safeguarded in the event of a failure. ACID is an acronym that stands for atomicity, consistency, isolation, and durabilityIt represents a set of principles that the database needs to follow in order to safeguard a block of database updates. For more information on ACID, you can refer to https://en.wikipedia.org/wiki/ACID.

In MongoDB 3, a write operation on a single document, even a document containing other embedded documents, was considered atomic. In MongoDB 4.x, multiple documents can be included in a single atomic transaction. Although invoking this support negatively impacts performance, the gain in database integrity might prove attractive. It's also worth noting that the lack of such support prior to MongoDB 4.x was a major criticism leveled against MongoDB, and slowed its adoption at the corporate level.

Invoking transaction support impacts read preferences and write concerns. For more information, you can refer to https://docs.mongodb.com/manual/core/transactions/#transaction-options-read-concern-write-concern-read-preference. Although these topics are covered in detail later in the book, we can briefly summarize them by stating that read preferences allow you to direct operations to specific members of a replica set. For example, you might want to indicate a preference for the primary member server in a replica set rather than allowing any member, including secondaries, to be used in a read operation. Write concerns allow you to adjust the level of acknowledgement when writing to the database, thereby ensuring that data integrity is maintained. In MongoDB 4.x, you are able to set read preferences and write concerns at the transaction level, that in turn influences individual document operations.

In MongoDB version 4.2 and above, the 16 MB limit on transaction size is removed. Also, as of MongoDB 4.2, full support for multidocument transactions is added for sharded clusters. In addition, full transaction support is extended to replica sets whose secondary members are using the in-memory storage engine (https://docs.mongodb.com/manual/core/inmemory/).

Replica sets are discussed in Chapter 13, Deploying a Replica Set. Sharded clusters are covered in Chapter 15Deploying a Sharded Cluster.
Nonblocking secondary reads

MongoDB developers have often included read concerns (as mentioned previously) in their operations in order to shift the burden of response from the primary server in a replica set to its secondaries instead. This frees up the primary to process write operations. Traditionally, MongoDB, prior to version 4, blocked such secondary reads while an update from the primary was in progress. The block ensured that any data read from the secondary would appear exactly the same as data read from the primary.

The downside to this approach, however, was that while the block was in place, the secondary read operation had to wait, which in turn negatively impacted read performance. Likewise, if a read operation was requested prior to a write, it would hold up update operations between the primary and secondary, negatively impacting write performance.

Because of internal changes introduced in MongoDB 4.x to support multidocument transactions, storage engine timestamps and snapshots are now used, which has the side effect of eliminating the need to block secondary reads. The net effect is an overall improvement in consistency and lower latency in terms of reads and writes. Another way to view this enhancement is that it allows an application to read from a secondary at the same time writes are being applied without delay.

In-progress index build interruption

A big problem with versions of MongoDB prior to 4.4 is that the following commands error out if an index build (https://docs.mongodb.com/master/core/index-creation/#index-builds-on-populated-collections) operation is in progress:

  • db.dropDatabase()
  • db.collection.drop()
  • db.collection.dropIndexes()

In MongoDB 4.4, when this happens, an attempt to force the in-progress index build operation is made. If successful, the index build is halted, and the drop*() operation continues without error. In the case of a drop*() performed on a replica set, the abort attempt is made on the primary. Once the primary commits to the abort, it then synchronizes to the secondaries.

Other noteworthy new features

There are a number of other new features that do not represent a massive paradigm shift, but are extremely useful nonetheless. These include improvements to the aggregation pipeline, field-level encryption, password() prompt, and wildcard indexes. Let's first have a look at aggregation pipeline improvements.

Aggregation pipeline type conversions

Another major new feature we discuss here involves the introduction of $convert, a new aggregation pipeline operator. This new operator allows the developer to change the data type of a document field while being processed in the pipeline. Target data types include double, string, Boolean, date, integer, long, and decimal. In addition, you can convert a field in the pipeline to the data type objectId, which is critically useful when you need direct access to the autogenerated unique identification field _id. For more information on the aggregation pipeline operator and $convert, go to https://docs.mongodb.com/master/core/aggregation-pipeline/#aggregation-pipeline.

Client-side field-level encryption

The official programming language drivers for MongoDB 4.2 now support client-side field-level encryption. The implications for security improvements are enormous. This enhancement means that your applications can now provide end-to-end encryption for transmitted data down to the field level. So you could have a transmission of data from your application to MongoDB that includes, for example, an encrypted national identification number mixed in with otherwise plain-text data.

Password prompt

In many cases, it is highly undesirable to include a hard-coded password in a Mongo script. Starting with MongoDB 4.2, in place of a hard-coded password value, you can substitute a built-in JavaScript function passwordPrompt(). You can refer to https://docs.mongodb.com/master/reference/method/passwordPrompt/#passwordPrompt for more details on the function. This causes the Mongo shell to pause and wait for manual user input before proceeding. The password that is entered is then used as the password value.

Wildcard indexes

Starting with MongoDB 4.2, support has been added for wildcard indexes (https://docs.mongodb.com/master/core/index-wildcard/#wildcard-indexes). This feature is useful for situations where the index is either not yet available or is unknown. For situations where the field is known and well established, it's best to create a normal index. There are cases, however, where you have a subset of documents that contain a particular field otherwise lacking in other documents in the collection. You might also be in a situation where a field is added later, and where the DBA has not yet had a chance to create an index on the new field. In these cases, adding a wildcard index allows MongoDB to perform a query more efficiently.

Extended JSON v2 support

Starting with MongoDB 4.2, support for the Extended JSON v2 (https://docs.mongodb.com/master/reference/mongodb-extended-json/#mongodb-extended-json-v2) specification has been enabled for the following utilities:

  • bsondump
  • mongodump
  • mongoexport
  • mongoimport
Improved logging and diagnostics

Starting with MongoDB 4.2, there are now five verbosity log levels, each revealing increasing amounts of information. In MongoDB 4.0.6, you can now set a threshold on the maximum time it should take for data to replicate between members of a replica set. It's now possible to get the information from the MongoDB log file if that time is exceeded.

 

A further enhancement to diagnostics capabilities includes additional fields that are added to the output of the db.serverStatus() command that can be issued from a Mongo shell.

Hedged reads

MongoDB 4.4 adds the ability to perform a hedged read (https://docs.mongodb.com/master/core/sharded-cluster-query-router/#hedged-reads) on a sharded cluster. By setting a hedged read preference option, applications are able to direct read requests to servers in replica sets other than the primary. The advantage of this approach is that the application simply takes the first result response, improving performance. The potential cost, of course, is that if a secondary responds, the data might be slightly out of date.

TCP fast open support

MongoDB version 4.4 introduces support for TCP Fast Open (TCO) connections. For this to work, it must be supported by the operating system hosting MongoDB. The following configuration file (and command line) parameters have been added to enable and control support under the setParameter configuration option: tcpFastOpenServer, tcpFastOpenClient, and tcpFastQueueSize. In addition, four new TCO-related information counters have been added to the output of the serverStatus() database command.

You can refer to  https://tools.ietf.org/html/rfc7413  for more details on TCO . For more information on the parameter, you can refer to  https://docs.mongodb.com/master/reference/parameters/#param.tcpFastOpenServer . Refer to  https://docs.mongodb.com/master/reference/command/serverStatus/#serverstatus for more information on  serverStatus.
Natural sort

MongoDB version 4.4 introduces a new operator, $natural, which is used in a cursor.hint() operation. This operator causes the results of a sort operation to return a list in natural (also called human-readable) order. As an example, take these values:

['file19.txt','file10.txt','file20.txt','file8.txt']

An ordinary sort would return the list in this order:

['file10.txt','file19.txt','file20.txt','file8.txt']

Whereas a natural sort would return the following:

['file8.txt','file10.txt','file19.txt','file20.txt']

Internal enhancements

The first enhancement we examine is related to nonblocking secondary reads (as mentioned earlier). After that, we cover shard migration, authentication, and stream enhancements.

Timestamps in the storage engine

One of the major new features introduced in MongoDB version 3 was the integration of the WiredTiger storage engine. Prior to December 2014, WiredTiger Inc. was a company that specialized in database storage engine technology. Its impressive list of customers included Amazon Inc. In December 2014, WiredTiger was acquired by MongoDB after partnering with them on multiple projects.

In MongoDB version 3, multidocument transactions were not supported. Furthermore, in the replication process (https://docs.mongodb.com/manual/replication/#replication), changes accepted by the primary server in a replica set were pushed out to the secondary servers in the set, which were largely controlled through oplogs (https://docs.mongodb.com/manual/core/replica-set-oplog/#replica-set-oplog) and programming logic outside of the storage engine. Simply stated, the oplog represents changes made to the database. When a secondary synchronizes with a primary, it creates a copy of the oplog and then applies the changes to its own local copy of the database. The logic in place that controlled this process in MongoDB 3 was quite complicated and consumed resources that could otherwise have been used to satisfy user requests.

In MongoDB 4, the internal update mechanism of WiredTiger, the storage engine, was rewritten to include a timestamp in each update document. The main reason for this change was to provide support for multidocument transactions. It was soon discovered, however, that this seemingly simple change could potentially revolutionize the entire replication process.

In MongoDB 4, much of the logic required to ensure data integrity during replication synchronization has now been shifted to the storage engine itself, which in turn frees up resources to service user requests. The net effect is threefold: improved data integrity, read operations producing a more up-to-date set of documents, and improved performance.

For an excellent in-depth explanation of how timestamps work in the WiredTiger storage engine, have a look at the video at  https://www.mongodb.com/presentations/wiredtiger-timestamps-enforcing-correctness-in-operation-ordering-across-the-distributed-storage-layer , which features Dr. Michael Cahill , formerly of WiredTiger, Inc. , now Director of Engineering at MongoDB .
Shard migration

Typically, DevOps engineers distribute the database into shards to support a massive amount of data. There comes a time, however, when the data needs to be moved. For example, let's say a host server needs to be upgraded or replaced. In MongoDB version 3.2 and earlier, this process could be quite daunting. In one documented case, a 500 GB shard took 13 days to migrate. In MongoDB 3.4, parallelism support was provided that sped up the migration process. Part of the reason for the improvement was that the chunk balancer logic was moved to the config server (https://docs.mongodb.com/manual/core/sharded-cluster-config-servers/#config-servers), which must be configured as part of a replica set. 

Another improvement, available with MongoDB 4.0.3, allows the sharded cluster balancer (https://docs.mongodb.com/manual/core/sharding-balancer-administration/#sharded-cluster-balancer) to preallocate chunks if zones and ranges have been defined, which facilitates rapid capacity expansion. DevOps engineers are able to add and remove nodes from a sharded cluster in real time. The sharded cluster balancer handles the work of rebalancing data between the nodes, thereby alleviating the need for manual intervention. This gives DevOps engineers the ability to scale database capacity up or down on demand. This feature is especially needed in environments that experience seasonal shifts in demand. An example would be a retail outlet that needs to scale up its database capacity to support increased consumer spending during holidays.

For more information on chunks, refer to  https://docs.mongodb.com/manual/core/sharding-data-partitioning/#data-partitioning-with-chunks, and for more information on zones and ranges, refer to  https://docs.mongodb.com/manual/core/zone-sharding/.
Change streams

As the database is updated, changes are recorded in the oplog maintained by the primary server in the replica set, which is then used to replicate changes to the secondaries. Trying to read a list of changes via the oplog is a tedious and resource-intensive process, so many developers choose to use change streams (https://docs.mongodb.com/manual/changeStreams/?jmp=blog&_ga=2.5574835.1698487790.1546401611-137143613.1528093145#change-streams) to subscribe to all changes on a collection. For those of you who are familiar with software design patterns, this is a form of the publish/subscribe pattern.

Aside from their obvious use in troubleshooting and diagnostics, changing streams can also be used to give an indicator of whether or not data changes are durable.

What is new and different in MongoDB 4.x is the introduction of a startAtOperationTime parameter that allows you to specify the timestamp at which you wish to tap into the change stream. This timestamp can also be in the past, but cannot extend beyond what is recorded in the current oplog.

If you enter 4.0 as a value of another parameter, featureCompatibilityVersion, then the streams return token data, used to restart the stream, in the form of a hex-encoded string, which gives you greater flexibility when comparing blocks of token data. An interesting side effect of this is that a replica set based on MongoDB 4.x could theoretically make use of a change stream token opened on a replica set based on MongoDB 3.6. Another new feature in MongoDB 4 is that change streams that are opened on multidocument transactions include the transaction number.