Monday, June 08, 2015

Behavior of ORDER BY + UNION Can Be Confusing

This is an interesting example and it exposes a quirk with the concept of a UNION, which is, that the various subsets of a UNION cannot be guaranteed to be sorted by their subset character.

The difference is important to understand - it is the difference between a result set being determinant (in this case, predictably ordered) or not. It's also the difference between a syntax error and a properly ordered UNION'd result set.

Here's the key: after two results sets have been UNION’d, you cannot independently sort the rows in the first query, then the results in the second query, because the origin rowset is lost.

Here's a demonstration:
DECLARE @tbl TABLE ( id INT primary key, val VARCHAR(10) )
INSERT @tbl
( id, val )
VALUES 
( 3, 'Last' ),
( 1, 'First' ),
( 2, 'Beta' )
 
-- Works
SELECT *
FROM (
  SELECT TOP 1 *
FROM @tbl t
ORDER BY t.id DESC
  UNION ALL
  SELECT TOP 1 *
FROM @tbl t
ORDER BY t.id DESC
) x
id val
3 Last
3 Last
-- Same intent, but throws a syntax error
SELECT TOP 1 *
FROM @tbl t
ORDER BY t.id DESC
UNION ALL
SELECT TOP 1 *
FROM @tbl t
ORDER BY t.id DESC
Msg 156, Level 15, State 1, Line 25
Incorrect syntax near the keyword 'UNION'.

-- Note that the ORDER BY is sorting the post-union result set, but that our TOP 1 is non-determinant.
SELECT TOP 1 *
FROM @tbl t

UNION ALL
SELECT TOP 1 *
FROM @tbl t
ORDER BY t.id DESC

id val
1 First
1 First



Note that the last line in last query, is not sorting the bottom reference to @tbl, but rather is sorting the post-union result set.

Look at that values we're getting in the last subquery. Why the value of 1?

Note that I have a primary key on the table variable, which by default creates a clustered index. This is not the "default sort order" of an unsorted query return, but in this particular and highly simplistic case, it does provide us a sort when getting a TOP 1. Without the PRIMARY KEY in the DECLARE, both of these valid-syntax queries return 3 and 3, because it's now an unsorted heap.

Move the PRIMARY KEY clustered index to the val column, and the last query results in 2 and 2, because 'Beta' is now the first row returned, when sorted (ascending, by default).

The default sort for an query, without an ORDER BY in a query, will be non-determinant from query to query. In some queries, a nonclustered index or a joined table or other operation could provide the sort. In order words, sort order without an ORDER BY may be sometimes predictable, not reliable.

So, why are ORDER BY's allowed after each UNION statement in a subquery?  The ordering of the post-union result set can now be provided with an ORDER BY outside of the subquery, so I’m speculating that the syntax allows it.  Inside a subquery, we can assume that the ORDER BY for each query in a UNION is actually not referring to the post-UNION result set.

That’s weird, but wrapping the union statement in a subquery is not “hiding an error” as it may appear to be, it’s actually allowing you to now define three different ORDER BY’s if you want it – one for each pre-union result set and one for the post-union set.

Without parenthesis, there’s no syntax to support those three different ORDER BY’s, there is only one post-union result set ORDER BY.

So, why make this a blog post?

Because one of the developers I work with found the last query above in legacy code for an application they are supporting. A real developer made the very real mistake of using a TOP without an ORDER BY in each half of a UNION (of two far more complicated queries than the above examples), resulting in unpredictable and inconsistent results that also failed to enforce the business logic that was required.

Remember, NEVER use a TOP without an ORDER BY, and be aware of how a UNION can make this tricky!

1 comment:

Unknown said...

SELECT TOP 1 *, ROW_NUMBER() OVER (ORDER BY id DESC) OrderBy
FROM @tbl t

UNION ALL
SELECT TOP 1 *, ROW_NUMBER() OVER (ORDER BY id) OrderBy
FROM @tbl t