Learn T-SQL Querying
上QQ阅读APP看书,第一时间看更新

Introducing the Cardinality Estimator

In Chapter 2Understanding Query Processing, we discussed how the Query Optimizer is a fundamental piece of the overall query processor. In this chapter, we will dig deeper into the core component of cost-based query optimization: the Cardinality Estimator (CE).

As the name suggests, the role of the CE is to provide fundamental estimation input to the query optimization process. For example, the cardinality of a table that contains the name of every living human on Earth today is about 7,600,000,000. But if a predicate is applied on this table to find only inhabitants of the United States of America, the cardinality after the predicate is applied is only 327,000,000. Reading through 7,600,000,000 or 327,000,000 records may result in different data-access operations, such as a full scan or a range scan in this case. As such, early knowledge of the estimated number of rows is fundamental for creating an accurate query execution plan. It would be very inefficient if SQL Server had to incur the high cost of accessing actual data to make this estimation – that would be like executing the query in order to figure out how to execute the query. Instead, it uses metadata kept in statistics.

Statistics are the building blocks for the process of CE: if statistics don't accurately portray underlying data distributions, the Query Optimizer will work with inaccurate data and estimate cardinalities that don't adhere to the reality of the data.

To ensure statistics are kept updated, SQL Server keeps a modification counter on each table referenced by the statistic and, when enough changes have been made to the table or indexed view columns tracked by a statistic, an update to that statistic is needed. When a query is compiled or recompiled, SQL Server loads all required statistics based on which columns are being used and determines whether statistics need to be updated.

If the database option for automatic statistics update is enabled (which is the default), SQL Server will update the outdated statistic before proceeding with query execution of any execution plan that referenced that statistic – this is known as a synchronous update. If asynchronous automatic statistics update is enabled, SQL Server will proceed with query execution based on the existing statistic as-is and update the outdated statistic as a background process. Once any statistics object has been updated, the next time any cached query plan that references that statistic is loaded for use, it is recompiled.

Up to SQL Server 2014, unless trace flag 2371 is used, SQL Server uses a threshold based on the percentage of rows changed. This is regardless of the number of rows in the table. You can know more about the threshold, in the following link https://docs.microsoft.com/sql/relational-databases/statistics/statistics, under the topic Statistics Options, and sub heading AUTO_UPDATE_STATISTICS Option.

Starting with SQL Server 2016, and database compatibility level 130, SQL Server uses a effective threshold that had been introduced in earlier versions under the trace flag 2371, which keeps adjusting for the number of rows in the table or indexed view. This is the result of comparing the SQL Server 2014 threshold with the square root of the product of 1000 and the current table's cardinality. The smallest number resulting from this comparison is used. For example, if our table contains 1,000,000 rows, the calculation is SQRT(1000 * 1000000) = 31622. When the table grows to 2,000,000 rows, the threshold is only 44,721 rows, whereas the SQL Server 2014 threshold would be 400,500 rows. With this change, statistics on large tables are updated more often, which decreases the chances of producing an inefficient query execution plan and the likely consequence is poor query performance.

Database compatibility level is a setting that tells SQL Server to execute T-SQL statements in that database using the same functional and query optimization behaviors that were the default for a given database engine version. For example, SQL Server 2016 introduced database compatibility level 130 and a set of new default behaviors, but setting database compatibility level 120 forces functional and query optimization behaviors that were the default in SQL Server 2014, which maps to the version when database compatibility level 120 was introduced.

The CE operates with on mathematical models based on certain assumptions about the T-SQL statements that will be executed. These assumptions are considered during computations to find what should be reasonable predictions about how many rows are expected to flow through each plan operator, and these predictions are used in the query optimization process to estimate the cost of each query plan.

CE 70, as introduced back in SQL Server 7.0, used four basic assumptions about how users query their data:

  • Independence assumption: Data distributions on different columns of the same table are assumed to be independent of each other, and predicates on different columns of the same table are therefore also independent of each other. This is known as the Independence assumption. For example, in a fictitious database for a large retail store chain where customer data is stored, a report shows which customers exist per store location using a query like the following example: SELECT * FROM Customers WHERE FirstName = 'James' AND City = 'San Francisco'. We can assume there are many Jameses in San Francisco, so these two columns are independent. 
  • Uniformity assumption: Distinct values are evenly distributed in a given histogram and all have the same frequency. This is known as the uniformity assumption.
  • Simple Containment: Join predicates are assumed to be dependent on filter predicates. When users query data joining different tables and set a filter predicate on these tables, it’s assumed that the filters apply to the joined data and are considered when estimating the number of rows returned by the join. This is called Simple Containment. For the example of a fictitious database for the same large retail store chain, different tables record items sold and items returned, and a report shows the number of returns per sold item type and date, using a query like the following example: SELECT * FROM Sales INNER JOIN Returns ON Sales.ReceiptID = Returns.ReceiptID WHERE Sales.Type = 'Toys' AND Returns.Date = '2019-04-18'. Throughout the year we have a fairly steady number of returns per items sold and the estimation shouldn’t change for any given day. However, when the query predicate changes to WHERE Sales.Type = 'Toys' AND Returns.Date = '2018-12-27' then accounting for filters can greatly impact estimations, because in the days after Christmas it’s expected that many toys are returned.
  • Inclusion assumption: For filter predicates where a column equals a constant (for example, WHERE col1 = 10), it is assumed the constant always exists in that column. This is called the Inclusion assumption.

But application workloads don't always follow the model assumptions, which can result in inefficiently-optimized query execution plans.

We will discuss some out-of-model T-SQL constructs in more detail in Chapter 6 , Easily Identified T-SQL Anti-Patterns, and Chapter 7 , Discovering T-SQL Anti-Patterns in Depth.

Observation of, and experience with, query performance led to a major redesign of the CE, with the release of SQL Server 2014 and CE 120.

The main objectives of this new CE were as follows:

  • To improve the quality of cardinality estimation for a broad range of queries and modern workloads, such as Online Transaction Processing (OLTP), Data Warehousing (DW), and Decision Support Systems (DSS)
  • To generate more efficient and predictable query execution plans for most use cases, especially complex queries

With that new release, some model assumptions about how users queried their data were changed:

  • Independence became partial Correlation, where a combination of the different column values is not necessarily independent, and it's assumed this resembles real-life data-querying patterns. For the example of a fictitious database for a large retail store chain where customer data is stored, a report lists the names of all customers using a query like the following example: SELECT * FROM Customers WHERE FirstName = 'James' AND LastName = 'Kirk'. We can assume a tight correlation between a customer first and last names, meaning that while there are many Jameses, there are not many James Kirks.
  • Simple Containment became Base Containment, meaning that filter predicates and join predicates are independent. For the example of a fictitious database for the same large retail store chain where the HR department runs a report that shows the base salary for full-time employees, using a query like the following example: SELECT * FROM Payroll INNER JOIN Employee ON Payroll.EmployeeID = Employee.EmployeeID WHERE Payroll.CompType = 'Base' AND Employee.Type = 'FTE'. All employees have a base salary and the workforce has one-third FTEs, one-third part-time employees, and one-third contractors, so for any different employee type that is queried, whether the WHERE clause is there or not, the estimation wouldn't change and the default base containment works best.

It's common to see these CE models referred to as legacy CE and new CE. These are in fact side-by-side implementations and more accurately referred to as CE 70, and CE 120, or higher. Being side by side means that developers can opt in for either CE version, as newer changes and enhancements are filtered by the database compatibility level.

We will see examples of how these CE assumptions cannot cover all workloads in Chapter 13 , Managing Optimizer Changes with the Query Tuning Assistant.

CE versions are tied to the database compatibility level setting of the SQL Server version when it was first introduced. The following table contains a mapping reference between database compatibility levels and CE versions:

 

This mapping between database compatibility levels and CE versions is especially useful when the topic is application certification. For example, if a given application were written and optimized for SQL Server 2012 (CE 70) and later upgraded as-is to SQL Server 2017 (CE 140), there's a chance that part of that application's workload may be susceptible to the model changes of a higher CE version, and as a result may perform worse than it did in SQL Server 2012. These types of performance regressions can be easily handled, and SQL Server includes several features designed to assist in overcoming a number of these regressions. We will further discuss these in Chapters 8, Building Diagnostic Queries Using DMVs and DMFs, through Chapter 12, Troubleshooting Live Queries, where we discuss how to assemble our query troubleshooting toolbox.  

CE 120+ changes target mainly non-leaf-level operators that support logical operations, such as JOIN, UNION, GROUP BY, and DISTINCT. Other T-SQL constructs that only exist at runtime still behave the same, such as multi-statement table-valued functions ( MSTVFs), table variables, local variables, and table-valued parameters. We will discuss these out-of-model constructs in Chapter 7 , Discovering T-SQL Anti-Patterns in Depth.

The inverse is the more common case though, where without refactoring a query,  CE 120+ can do a better job of optimizing a query plan than CE 70. For example, the AdventureWorks sample database has several tables with employee data. To write a query that returns the employee name and details such as contacts, address, and job title, a series of inner joins is used.

The query would look like the following example:

SELECT e.[BusinessEntityID], p.[Title], p.[FirstName], p.[MiddleName], 
p.[LastName], p.[Suffix], e.[JobTitle], pp.[PhoneNumber],
pnt.[Name] AS [PhoneNumberType], ea.[EmailAddress],
p.[EmailPromotion], a.[AddressLine1], a.[AddressLine2],
a.[City], sp.[Name] AS [StateProvinceName], a.[PostalCode],
cr.[Name] AS [CountryRegionName], p.[AdditionalContactInfo]
FROM [HumanResources].[Employee] AS e
INNER JOIN [Person].[Person] AS p
ON RTRIM(LTRIM(p.[BusinessEntityID])) = RTRIM(LTRIM(e.[BusinessEntityID]))
INNER JOIN [Person].[BusinessEntityAddress] AS bea
ON RTRIM(LTRIM(bea.[BusinessEntityID])) = RTRIM(LTRIM(e.[BusinessEntityID]))
INNER JOIN [Person].[Address] AS a
ON RTRIM(LTRIM(a.[AddressID])) = RTRIM(LTRIM(bea.[AddressID]))
INNER JOIN [Person].[StateProvince] AS sp
ON RTRIM(LTRIM(sp.[StateProvinceID])) = RTRIM(LTRIM(a.[StateProvinceID]))
INNER JOIN [Person].[CountryRegion] AS cr
ON RTRIM(LTRIM(cr.[CountryRegionCode])) = RTRIM(LTRIM(sp.[CountryRegionCode]))
LEFT OUTER JOIN [Person].[PersonPhone] AS pp
ON RTRIM(LTRIM(pp.BusinessEntityID)) = RTRIM(LTRIM(p.[BusinessEntityID]))
LEFT OUTER JOIN [Person].[PhoneNumberType] AS pnt
ON RTRIM(LTRIM(pp.[PhoneNumberTypeID])) = RTRIM(LTRIM(pnt.[PhoneNumberTypeID]))
LEFT OUTER JOIN [Person].[EmailAddress] AS ea
ON RTRIM(LTRIM(p.[BusinessEntityID])) = RTRIM(LTRIM(ea.[BusinessEntityID]));

With CE 70, the elapsed execution time for this query is 101,975 ms. But with the exact same query on the exact same database on CE 140, the elapsed execution time is only 103 ms.

As seen in the following screenshot, the query execution plans are radically different in shape, and, given the observed execution times, better optimized using newer versions of the CE.

The query plan shape for CE 70 is as follows:

The query plan shape for CE 140 is as follows:

The above two images depict the difference in shapes, as stated earlier. The text within the images need not to be referred to.

We will revisit the preceding query example in greater depth in Chapter 4Exploring Query Execution Plans, and Chapter 10, Troubleshooting Live Queries.