A common beginner’s misconception about Common Table Expressions (CTEs) is that they are a real result set, like those produced by a temporary table or table variable. In fact, the opposite is true: they’re really just a way to simplify and encapsulate your code. For this month’s TSQL-Tuesday, focused on CTEs, I want to illustrate this difference with an example of how a (non-recursive) CTE can be both more and less efficient than a temporary table at accessing data.
Let’s start with a simple query comparing the sales volume and average price between a given year and the previous year expressed as both a CTE and a derived table:
- /* CTE Version */
- WITH SalesData as
- (
- SELECT sd.ProductId
- , SalesYr = YEAR(sh.OrderDate)
- , AvgPrice = Avg(UnitPrice)
- FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
- JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
- ON sh.SalesOrderID = sd.SalesOrderID
- GROUP BY sd.ProductId,YEAR(sh.OrderDate)
- )
- SELECT s1.ProductId
- , s1.SalesYr as year1
- , s2.SalesYr as year2
- , s1.AvgPrice as Year1AvgPrice
- , s2.AvgPrice as Year2AvgPrice
- , s2.AvgPrice/s1.AvgPrice as Year2Change
- FROM SalesData s1
- INNER JOIN SalesData s2
- ON s1.Productid = s2.ProductId
- AND s1.SalesYr = s2.SalesYr -1
- /* Derived Table Version */
- SELECT s1.ProductId
- , s1.SalesYr as year1
- , s2.SalesYr as year2
- , s1.AvgPrice as Year1AvgPrice
- , s2.AvgPrice as Year2AvgPrice
- , s2.AvgPrice/s1.AvgPrice as Year2Change
- FROM (
- SELECT sd.ProductId
- , SalesYr = YEAR(sh.OrderDate)
- , AvgPrice = Avg(UnitPrice)
- FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
- JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
- ON sh.SalesOrderID = sd.SalesOrderID
- GROUP BY sd.ProductId,YEAR(sh.OrderDate)
- ) s1
- INNER JOIN
- (
- SELECT sd.ProductId
- , SalesYr = YEAR(sh.OrderDate)
- , AvgPrice = Avg(UnitPrice)
- FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
- JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
- ON sh.SalesOrderID = sd.SalesOrderID
- GROUP BY sd.ProductId,YEAR(sh.OrderDate)
- ) s2
- ON s1.Productid = s2.ProductId
- AND s1.SalesYr = s2.SalesYr -1
Both the CTE and the derived table generate the same execution plan:
Statistics IO for both queries is also identical:
- (347 row(s) affected)
- Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderDetail’. Scan count 2, logical reads 2480, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderHeader’. Scan count 2, logical reads 1372, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
So, from an execution standpoint, the derived table and CTE are essentially the same. Now, consider the same query using a temporary table:
- /* Aggregate Data in Temp Table */
- SELECT sd.ProductId
- , SalesYr = YEAR(sh.OrderDate)
- , AvgPrice = Avg(UnitPrice)
- INTO #TmpSales
- FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
- JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
- ON sh.SalesOrderID = sd.SalesOrderID
- GROUP BY sd.ProductId,YEAR(sh.OrderDate)
- /*Return Sales Info */
- SELECT s1.ProductId
- , s1.SalesYr as year1
- , s2.SalesYr as year2
- , s1.AvgPrice as Year1AvgPrice
- , s2.AvgPrice as Year2AvgPrice
- , s2.AvgPrice/s1.AvgPrice as Year2Change
- FROM #TmpSales s1
- INNER JOIN #TmpSales s2
- ON s1.Productid = s2.ProductId
- AND s1.SalesYr = s2.SalesYr -1
Along with its query plan:
And Statistics IO:
- Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderDetail’. Scan count 1, logical reads 1240, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderHeader’. Scan count 1, logical reads 686, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- (613 row(s) affected)
- (347 row(s) affected)
- Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘#TmpSales’. Scan count 2, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Notice that in both the CTE and derived table versions, SalesOrderHeader and SalesOrderDetail are each accessed twice and the aggregations are each calculated twice. The temporary table version accesses each table once, aggregates the results, and then uses the smaller temporary table to produce the final results. As a result, the CTE version is nearly twice as expensive as the temporary table. If a CTE were truly a “results set” (as some authors and speakers have presented it), then we should only see SalesOrderHeader and SalesOrderDetail accessed once, just as with the temporary table, and it should have a similar IO cost. They’re not. Conclusion: a CTE is not a temp table or stored results set.
In this particular case, I’ve structured my queries so that the CTE was a less efficient way to access the data. There are times, however, that the compiler can take advantage of the CTE structure and create a more efficient way to access the data.
Here are two (oversimplified) queries to illustrate this point:
- /* CTE Version */
- WITH SalesData as
- (
- SELECT sd.ProductId
- , SalesYr = YEAR(sh.OrderDate)
- , AvgPrice = Avg(UnitPrice)
- , AvgOrderQty =AVG(OrderQty)
- FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
- JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
- ON sh.SalesOrderID = sd.SalesOrderID
- GROUP BY sd.ProductId,YEAR(sh.OrderDate)
- )
- SELECT ProductId
- , AvgPrice
- , AvgOrderQty
- FROM SalesData
- WHERE SalesYr = ’2006′;
- /* Aggregate Data in Temp Table */
- SELECT sd.ProductId
- , SalesYr = YEAR(sh.OrderDate)
- , AvgPrice = Avg(UnitPrice)
- , AvgOrderQty =AVG(OrderQty)
- INTO #TmpSales
- FROM AdventureWorks2008R2.sales.SalesOrderHeader sh
- JOIN AdventureWorks2008R2.sales.SalesOrderDetail sd
- ON sh.SalesOrderID = sd.SalesOrderID
- GROUP BY sd.ProductId,YEAR(sh.OrderDate)
- /*Return Sales Info */
- SELECT ProductId
- , AvgPrice
- , AvgOrderQty
- FROM #TmpSales
- WHERE SalesYr = ’2006′
Along with their Statistics IO:
- ***CTE***
- (132 row(s) affected)
- Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderDetail’. Scan count 1, logical reads 285, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderHeader’. Scan count 1, logical reads 686, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- ***Temp Table***
- Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderDetail’. Scan count 1, logical reads 1240, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- Table ‘SalesOrderHeader’. Scan count 1, logical reads 686, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
- (613 row(s) affected)
- (132 row(s) affected)
- Table ‘#TmpSales’. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
The temporary table approach still pays the cost of aggregating all the data first and it has the same IO cost as the first sample query. With the CTE, however, the compiler is able to take advantage of the WHERE clause when it’s expanding the query. As a result, it aggregates less data and uses about half the IO that’s involved in creating and reading from the unfiltered temporary table. The text showplan reveals this application of the WHERE clause:
- StmtText
- —————————————————————————————————————————————————————————————————————————————————————-
- |–Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1018]=(0) THEN NULL ELSE [Expr1019]/CONVERT_IMPLICIT(money,[Expr1018],0) END, [Expr1006]=CASE WHEN [Expr1018]=(0) THEN NULL ELSE [Expr1020]/CONVERT_IMPLICIT(int,[Expr1018],0) END))
- |–Hash Match(Aggregate, HASH:([sd].[ProductID]) DEFINE:([Expr1018]=COUNT(*), [Expr1019]=SUM([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[UnitPrice] as [sd].[UnitPrice]), [Expr1020]=SUM([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[OrderQ
- |--Merge Join(Inner Join, MERGE:([sh].[SalesOrderID])=([sd].[SalesOrderID]), RESIDUAL:([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[SalesOrderID] as [sd].[SalesOrderID]=[AdventureWorks2008R2].[Sales].[SalesOrderHeader].[SalesOrderID] as [
- |--Clustered Index Scan(OBJECT:([AdventureWorks2008R2].[Sales].[SalesOrderHeader].[PK_SalesOrderHeader_SalesOrderID] AS [sh]), WHERE:(datepart(year,[AdventureWorks2008R2].[Sales].[SalesOrderHeader].[OrderDate] as [sh].[OrderDate])=(2006))
- |–Clustered Index Scan(OBJECT:([AdventureWorks2008R2].[Sales].[SalesOrderDetail].[PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID] AS [sd]), ORDERED FORWARD)
So which approach is better, the CTE or temporary table? As always, it depends on your data and your use. For a little help deciding, why not check out the rest of the posts in this month’s TSQL-Tuesday?
