Jun 6, 2006

Understanding Outer Joins

Most of the times when we talk about the ANSI join syntax, we say that we use the ON clause for the joining columns and where for filter. Though there is no rule that we need to include joining columns in the on clause or the filter in the where clause. It is only for readability.

For inner joins, where exactly the join or the filter conditions appear do not have any impact on the result. But for outer join it does.

To understand it we need to know what the on clause and where clause means in the outer join.

Lets say table1 is left outer joined to table2 based on a few conditions.

  • All the rows in table1 is taken and the ON clause filter is applied to the rows in table2.
  • If all the conditions in the ON clause is met then the corresponding row in table2 is returned, else its not.
  • The filter is applied on table1 only if there is one in the WHERE clause.
Lets see what is the difference in having the filter in the ON clause and the WHERE clause with an example.

Let's say we create 2 temp tables
create table #temp1 (id int)
insert into #temp1
select 1
union all select 2
union all select 3
union all select 4
union all select 5

create table #temp2 (id int)
insert into #temp2
select 1
union all select 3
union all select 5


Now, Look at the following query.

select a.id as a_id, b.id as b_id
from #temp1 a left outer join #temp2 b
on a.id = b.id
where a.id <5


The result of this query is
a_id b_id
------- --------
1.00 1.00
2.00 NULL
3.00 3.00
4.00 NULL

Here the result is fairly straight-forward.

And If I want all the IDs selected from #temp1 but only those rows from #temp2 where a.id<5

select a.id as a_id, b.id as b_id
from #temp1 a left outer join #temp2 b
on a.id = b.id
and a.id <5



Here, the result is

a_id b_id
------- --------
1.00 1.00
2.00 NULL
3.00 3.00
4.00 NULL
5.00 NULL

Here, I am selecting all the rows from #temp1, but only those rows from #temp2 where the filter on #temp1 is true.

Now take this query where you join just on a filter:

select a.id as a_id, b.id as b_id
from #temp1 a left outer join #temp2 b
on a.id <5

The result goes this way:


a_id b_id
----------- -----------
1 1
1 3
1 5
2 1
2 3
2 5
3 1
3 3
3 5
4 1
4 3
4 5
5 NULL


The result is actually a cross product which , as stated by Tom Cooper, is exactly what the rules of SQL require since the select statement says match every row in a to every row in b where a.id < id =" 5">

And to that query if I join in WHERE clause, like this.

select a.id as a_id, b.id as b_id
from #temp1 a left outer join #temp2 b
on a.id <5

where a.id = b.id


The condition a.id = b.id is applied to the above result set and there are only two matches (since NULL doesn’t match to anything).
The result is that of an INNER JOIN with filter. You get this.

a_id b_id
------- --------
1.00 1.00
3.00 3.00

Just proves a few points here.. Most of them obvious.
1. There is no such rule as join conditions go in ON clause and filter in WHERE.
2. Where the filter goes depends on when and what has to be filtered.
3. For outer joins, Think carefully of the requirement and use the the join and filter conditions at appropriate places.
4. Unless you are very clear on how the outer join works, its very easy to screw up :)



Here is the order of processing the select query, as quoted by Tom Cooper (I couldn't have done it better, so I am giving it as is)
1. Apply the join clause(s), getting an intermediate result that has all the columns from both tables and all the rows which meet the ON clause criteria.
2.After the join clause has been applied and the intermediate result built, use the WHERE clause to filter out rows in the intermediate result that are not wanted. Any reference in the where clause is to this intermediate result, not the original tables.
3. If you have a GROUP BY clause, the rows left in the first intermediate result are grouped together by the group by expressions along with any desired aggregate function values (like SUM() or MAX()). This generates a second intermediate result and the first intermediate result is discarded.

4.If you have a HAVING clause, it is applied to the second intermediate result and any unwanted rows are filtered out.
5.The SELECT list is built from the intermediate result and the intermediate result is discarded.
6.The ORDER BY clause is applied to the SELECT list and the result is returned.

I should thank Tom Cooper for pointing out a few issues in this article and clarifying my doubts in the process.


1 comments:

Anonymous said...

Really interesting article! Thanks.

Post a Comment