Problem
- We want to count rows based on two distinct criteria.
- We want to calculate 2 or more aggreagates over overlapping sets.
Setup script
USE tempdb; CREATE TABLE t1 ( Id INT IDENTITY PRIMARY KEY, Date datetime, Number int, Flag tinyint, ); BEGIN TRAN; SET NOCOUNT ON; DECLARE @i INT; SET @i = 0; WHILE ( @i
Sample 1
Return row count where Number3 = 1 and row count where Number3 = 2.
The simpliest solution is to write
SELECT COUNT(*) FROM t1 WHERE Flag = 1; SELECT COUNT(*) FROM t1 WHERE Flag = 2;
However, this approach scans the base table 2x times, yielding excessive IO – 2x 324 page reads on my machine.
Sample 2
Sum Number column for the following time spans: after April, 2008; after April 2009; after April 2010.
SELECT SUM(Number) FROM t1 WHERE Date > '2008-04-30'; SELECT SUM(Number) FROM t1 WHERE Date > '2009-04-30'; SELECT SUM(Number) FROM t1 WHERE Date > '2010-04-30';
The optimalizer used index seek this time, however still the same data are scaned multiple times (193 + 118 + 44 page reads).
Solution
Our aim is to scan the required rows only once, regardless of how many aggregates we want to calculate.
What would happen if we query the data and conditionaly include them in aggregates?
Sample 1 – solution
SELECT COUNT(CASE WHEN Flag = 1 THEN 1 END) as 'Flag1', COUNT(CASE WHEN Flag = 2 THEN 1 END) as 'Flag2' FROM t1
Only one table scan – 324 page reads.
Sample 2 – solution
SELECT SUM(CASE WHEN Date > '2008-04-30' THEN Number END), SUM(CASE WHEN Date > '2009-04-30' THEN Number END), SUM(CASE WHEN Date > '2010-04-30' THEN Number END) FROM t1;
Now, data is scaned once with 274 page reads. However, we can optimalize it further. All we need to do is to restrict the set to the rows needed by the largest date range.
SELECT SUM(CASE WHEN Date > '2008-04-30' THEN Number END), SUM(CASE WHEN Date > '2009-04-30' THEN Number END), SUM(CASE WHEN Date > '2010-04-30' THEN Number END) FROM t1 WHERE Date > '2008-04-30';
Great – 193 page reads.
Sumary
Conditional aggregation can be used to:
- optimalize overlapping aggregates which normally would require multiple data passes.
- non-overlaping aggregates grouped by a column with low selectivity.
- create custom aggregates.