Jul 22, 2006

Resolution for concatenate column values issues

I would like to add a note before you go ahead and read this article. Please do not take this as an approach guaranteed to work. I have come across instances when the solution I suggested didn't work. Please read through this, purely from an academic perspective, and do not, I repeat, do not incorporate these suggestions in your production environment.

I am still keeping the article as I had originally written since I would like to maintain the arguements and the counter arguements.

Please read through the comments (given by Hugo Kornelis), to understand the complete picture. I would like to thank Hugo for his valuable inputs.

Finally, I would like to stress on the one point I had written in the article "Absence of proof is not proof of absence" :) Just because you don't see a reason for it to fail doesn't mean it won't.

On the lighter side, this reminds me of the conversation between Henry Wu and Ian Malcolm in Jurassic Park :)

"Henry Wu: You're implying that a group composed entirely of female animals will... breed?
Ian Malcolm: No, I'm simply saying that life, uh... finds a way. "

Feel free to give your comments. Happy Reading.
-Omni
-----------------------------------------------------------------------------------------------------

I am writing this post based on the a few issues pointed out by Anith Sen for using an approach for concatenating the column values (in SQL Server 2000). The approach that I suggested (in question) is given here.

Concatenate the values in a column in SQL Server 2000 and 2005

The examples, that I am going to use to explain on how to resolve the issue, are actually given, by Anith Sen and Umachander (Thanks to them), as issues with the "set based concatenation", in the SQL Server Discussion Forum.

Before I start it off, I want make clear that I am attempting to explain why are there issues in concatenation and what I think might be a solution to overcome the issues. Nevertheless, whether the approach I suggest works or not depends entirely on the query execution plan that you get for the query. And maybe after reading this if you cannot make it work, you will atleast know why it doesn't work :)

I am just trying to figure out why do we don't get the results as expected and maybe suggest a work around. Waiting for the experts inputs on this though :)

To start with, lets take the following query. It works fine..

declare @au_id varchar( 8000 )
set @au_id = ''
select @au_id = @au_id + a.au_id + ','
from pubs..authors as a
order by a.au_id
print @au_id


But, this doesn't.

declare @au_id varchar( 8000 )
set @au_id = ''
select @au_id = @au_id + a.au_id + ','
from pubs..authors as a
order by a.au_id + ','
print @au_id


If I analyze the query execution plan, i see that there is an index scan, followed by compute scalar, followed by a sort and then a select (from right to left, as always). Here the problem is, compute scalar computes 2 expressions. One for the value that is selected and assigned and one for the order by (appending a comma to the column and then use it for sorting). Since the order by expression has to be computed before the ordering, SQL Server brings in the other computation here (work of the optimizer :) and so @au_id + a.au_id + ',' is computed for all the rows with the initial value of @au_id( as ''). Since the sort happens after the computation, the value for @au_id is re-assigned for every row and will take the au_id of the last row (its the max here cos of the order).

Well, for this concatenation to work, what we need is just this:
In the query execution plan the select should happen immediately after the computation for the concatenation expression.

There are a few ways to do it.
1. Don't order by any column (Best of all)

2. No computation other than the appending should happen in the query. if you are ordering by a computed column (or getting the distinct column values), use that computation within an inline view.

Well, I would suggest the thumb rule is:
When you write your query, execute and check the execution plan, the last operation will be a select. Make sure the penultimate operation is the compute scalar (which does only the concatenation operation and nothing else), and if any other operation gets between these two, then your result is screwed.


I guess thats the bottom line. So to make the above query work. I would do something like this, to force the two computations to be split and bring the concatenation just before the select.

declare @au_id varchar( 8000 )
set @au_id = ''
select @au_id = @au_id + a.au_id + ','
from (select top 100 percent au_id, au_id + ',' as ord from pubs..authors
order by au_id + ',') as a
order by ord
print @au_id


Lets take the other examples. Check the query execution plans for the following as see when the compute scalar operation is with respect to the select.

This one doesn't work (why? check the QEP)..

DECLARE @Str VARCHAR(8000)
SET @Str = SPACE(0)
SELECT @Str = @Str + CustomerID + ','
FROM NorthWind..Orders
ORDER BY RIGHT(ShipAddress, 2)
PRINT @Str


Well, I can change it this way and it works (Check the QEP now)

DECLARE @Str VARCHAR(8000)
SET @Str = SPACE(0)
SELECT @Str = @Str + CustomerID + ','
FROM (select top 100 percent *,RIGHT(ShipAddress, 2) as ord from NorthWind..Orders
ORDER BY RIGHT(ShipAddress, 2)) as a
order by ord
PRINT @Str


The above strategy can be applied to all the queries that has an issue in the ordering or distinct. Here is an example for distinct, for it to work you may use it this way...

DECLARE @Str VARCHAR(8000)
SET @Str = SPACE(0)
SELECT @Str = @Str + c1.column_name from (select DISTINCT column_name
FROM pubs.INFORMATION_SCHEMA.COLUMNS where table_name = 'authors') AS c1
PRINT @Str


Can we be sure that this will work? There is nothing I know of now that will fail. Again, as someone said, absence of proof is not proof of absence :) I will just wait for the special cases, which is bound to come from someone.
But, more than the approach its the QEP that matters. I am sure it will work if we get the QEP as I had suggested.
Let me know if anybody thinks there is an issue with this approach. Let sort this once and for all.. its been a long time :)

6 comments:

Hugo Kornelis said...

Hi Omnibuzz,

>>But, more than the approach its the QEP that matters. I am sure it will work if we get the QEP as I had suggested.
Let me know if anybody thinks there is an issue with this approach<<

I think there is an issue. Getting the QEP right is nice - but what guarantee do you have that you get the same QEP when running the query on the production box? More memory, more processors, bigger tables, different statistics, more simultaneous users - all circumstances are different, so you have a fair chance of ending up with a completely different QEP!

And even if you do get the correct QEP on production, what guarantee do you have that you'll get the same QEP tomorrow? Or next week? Again, changing in available memory of other concurrently executing queries can lead to a different QEP.

In SQL Server 2000, the only guaranteed safe way to concatenate is to use a cursor-based approach. Even though many of the set-based appproaches give the correct result 99.99% of the times, that's still not enough (to me, at least, in an application that will be used in production - YMMV).

Best regards,

Hugo Kornelis

Omnibuzz said...

Hi Hugo,
Thanks for the insight. I know what I am going to say might sound ridiculous, but I am putting it across anyways. The optimization engine is built based on a defined set of rules. And though it can change the QEPs it cannot change the rules. When you say there will be a 0.01% of uncertainity in getting the result right, I would like to know what this uncertainity might be? Let me take this one at a time. Can we take the following two cases to be always true?
1.Any ordered result (with an order specified) will not change how many ever users or processors there are.
2.A select query with only computations in the columns selected will always be done just before the select.

Well, the first one is pretty straight forward and cannot be questioned :)

But I understand that the second point is not so straight forward. But can that be proved wrong. Any query. Or anybody who had seen it happening otherwise.

Please understand I am just trying to understand how the engine thinks. If there is a QEP that disproves this point, I would like to analyse it.

Thanks again for your input, Hugo. Appriciate it.

Hugo Kornelis said...

Hi Omnibuzz,

First: don't take the 99.99% too literally. I just filled in an arbitrarily chosen high number. The point I was trying to make is that even though it usually works, it is not guaranteed and hence not good for production.

>> 1.Any ordered result (with an order specified) will not change how many ever users or processors there are.

Correct. Provided that the ORDER BY is used in the outermost SELECT. An ORDER BY in a view or in a derived table can (and often will) be ignored.

>> 2.A select query with only computations in the columns selected will always be done just before the select.

Nope. AFAIK, this is not documented behaviour - so even if you are seeing it now, that's not saying anything.

You're asking for queries to prove me wrong, but you've provided them yourselves. The second query in your blog entry does a sort after the assignment on both SQL Server 2000 and 2005. The third query (the one you claim to be working correct on both 2000 and 2005) does indeed work on SQL Server 2000, but it doesn't work on SQL Server 2005 - at least not on my test system. I get the exact same execution plan for the second and third queries, and the exact same output.

>> Please understand I am just trying to understand how the engine thinks. If there is a QEP that disproves this point, I would like to analyse it.

Problem is, this won't work. Sure, you can tweak with the query until you do get something that works as expected on both your and my SQL Server 2005 test database. But how do you know if it also works on my neighbours computer? And ow do you know if you'll get the same execution plan tomorrow?

How the engine thinks is, for the most part, undocumented (though I believe that some aspects are covered by some whitepapers and some blogs from people on the SQL dev team). The optimizer can and will create different plans for the same query as circumstances (statistics, available memory, concurrent workloads, etc) change. Plus, the dev team is constantly tweaking the optimizer, so what appears to work now may well break when you upgrade to the next version, install a service pack, or even install a patch.

Bottom line is: never ever rely on undocumented behaviour in production. Many people have been bitten by this.
In SQL Server 6.5, people omitted an ORDER BY on a query with GROUP BY, because sorting was the only technique the optimizer had for grouping - and they had to pay the price when the dev team added other grouping operations in SQL Server 7.0.
In SQL Server 7.0 and 2000, people used SELECT TOP 100 PERCENT ... ORDER BY ... in views and omitted the ORDER BY in queries selecting from those views, because the optimizer always had the sort step as the final step in the execution plan anyway - and they paid the prices when the dev team added optimization techniques that eliminated or moved the sort step and their results were suddenly unordered.

And now, I'm afraid that people reading your blog will start relying on the undocumented behaviour of SELECT @x = @x + abc FROM def, and they might pay the price when they upgrade to the next version of SQL Server. Or to the next service pack. Or maybe even when they add extra memory to the server.....

Bottom bottom line: never ever rely on undocumented behaviour in production. And if you do, then at least make sure that it's not me carrying the pager at night. <g>

Best regards,

Hugo Kornelis

Omnibuzz said...

Hi Hugo,
Thanks for the insight. Thats what I needed to know. And I completely agree with you.

>And now, I'm afraid that people reading your blog will start relying on the undocumented behaviour of SELECT @x = @x + abc FROM def, and they might pay the price when they upgrade to the next version of SQL Server. Or to the next service pack. Or maybe even when they add extra memory to the server.....<

Don't worry about this. i didn't start this post with the intention of proving that my method is right. I wanted a definite yes or no. And I got NO for the answer. And your comment stays there for people to see :)

Thanks a lot for your time and patience.

With High Regards
-Omni

Mahesh said...

It does not work when the select is from a join of two tables. I found this and was surprised that this was provided as a solution. Mine is a many to many situation and hence there are 3 tables. function selects from 2 tables. I did not bother to check the QEP, but will do that.

I could have of course taken the result and run a concat query on that but now i think im better off concatnating in the front end.

Like someone on the blog said
"Absence of proof is not proof of absence"
-Mahesh

Mahesh said...

Just checked the query execution plan and the Compute Scalar comes before the NestedLoops/Inner Join Step, so the instead of comma seperated i get only the first child row, feel good for learning something new today. Thanks Guys!

Post a Comment