SQL Garbage Collector

SQL to ECL - Metamorphosis

2013-02-27T19:22:00.000-05:00

As the old saying goes, it can be figured out whether you are a data guy or not just by the way you solve the problem:

An imperative programmer goes: for each order in the orders table, find the corresponding order details. . .

And the SQL programmer goes: take the orders table and join with the order details on the order ID key. . .

As a SQL programmer, you look at tables as sets and not as individual records. You would not want to bother about whether the join happens as a nested loop, hash, merge or map-reduce for that matter. You want the join to happen the way you intend to (functionally) and let your query engine find the best way to do it based on the data distribution, size etc. That is the SQL programming style.

Now, ECL is not SQL but has the SQL programming style. I find it to be a pseudo functional-declarative programming style with some object oriented concepts tossed in. Don’t pull your hair out. Not just yet!!!

In ECL, there are only 2 types of statements that I have come across.

Action: Something that produces an output.
Declaration: This is a single assignment (or, it cannot be re-assigned). These are called attributes in the ECL world.

In my experiments with ECL, I have created attributes to:
- hold a value (string, int, set, record set, table, file etc)
- a function or action
- a definition (More like a table definition)

Most of the ECL code are definitions. And your action can call a definition or pipe your definitions to cascade to a result. More on this later.

Of course, you did notice that there is no concept of a variable. And that is the only thing to “get” in ECL. If you really think about it, ECL gives you amazing abstraction over all the threading and grid-ding that happens behind the scenes. How is it able to do that? By signing a contract with you that says, a definition (or an attribute) does not change its state, ever, ever. So, the attribute points to the same thing whoever wants, or more specifically whichever machine in the grid wants. There, it eliminated any custom race condition you may introduce in your code (which is common in parallel computing world). As long as we adhere to that contract, you can tell HPCC what you need and it will get you the result. With me, so far??

It’s not as hard as you think to code without variables. Purely functional languages like Haskell do not allow you to re-assign.

So, how do you work around this constraint?

Blunt answer: Create a new definition to hold the mutated value.

Think about it, you don’t “really” use variables in the SQL world either.

Now, why definitions? If it's SQL programmer friendly, why not use SQL? When you try to answer this question, you will uncover the brilliance of this powerful language. Here is my attempt.

Rewind a few years, SQL Server 2005 introduced the common table expressions (CTE) or the “WITH” clause. What problem was it really trying to solve? Help generate number table using recursion, No!!!

In production applications, the biggest use of CTE was to remove code clutter. I was able to remove inline views and move it to the top of the query. Improved code readability.

Take this cooked-up query in SQL Server 2000:

SELECT *
FROM OrderDetails A
LEFT OUTER JOIN (SELECT productid,
  Max(transactiondate) AS LatestDate
FROM   transactions
WHERE  orderid <> ''
  AND type = 'Sell'
GROUP  BY productid
HAVING Year(latestdate) > 2011) B
ON
B.productid = A.productid
WHERE
A.status = 'Active'

With CTE, you can change it to:

WITH B
     AS (SELECT productid,
                Max(transactiondate) AS LatestDate
         FROM   transactions
         WHERE  orderid <> ''
                AND type = 'Sell'
         GROUP  BY productid
         HAVING Year(latestdate) > 2011)
SELECT *
FROM   orderdetails A
LEFT OUTER JOIN b
ON
b.productid = A.productid
WHERE
A.status = 'Active'

This is cleaner, I am now looking at two queries. One that generates the table B and the other one that uses it. Coming to think of it, Wouldn't it be great if I can move each of the complexities to its own shell (or definition). So that each of definition can be reused as you need. Like this:

Voila!!! This is exactly how the ECL code looks (Well, its a about 95% close and the compiler will help you fix the rest). But that's about it. This will run just fine for a single node with gigabytes of data or 100 nodes petabytes of data. Welcome to the world of big data.

Once you get past this stage, you will quickly move beyond SQL scope. There is a world of constructs to handle any type or size of data. It's got some powerful detergents built-in to clean up the dirtiest of data you have, that will blow your mind.

And, just so you know you can write the above query as a single line of code by substituting your definitions inline (recursively). And you will be able to build your SQL Server 2000 query in ECL. But, why you would want to do that is a different discussion!!!

ECL - Big data for the SQLly Inclined

2013-02-26T17:00:00.001-05:00

The last project I worked, we pushed the limits of RDBMS (misfit for our requirements) and I decided that the next time, I will consider beyond SQL for my data needs. I started exploring the NOSQL world - MongoDB,Neo4j, REDIS etc and I understood that had we been open to these technologies when we started our last project, we might have had a lot easier life.

Eventually, I started seeing Hadoop everywhere, our company was talking, customers were talking, my friends were talking. And I started learning the jargons around Hadoop (map reduce, hive, sqoop, HDFS,HBase) and I used to throw these words in my conversation along with a few zoo animals and figured out most of my friends were doing the same and we had a happy ecosystem going on. But, deep down I knew that I was ignoring the elephant in the room that was staring at me. I read through the famed map reduce research paper and I was able to get the concept. But, I was not able to get to start playing with hadoop. Setting up was easy and you can get the word count sample working in an hour. But, after that I was stuck. I understood the power of what it can do. But, I felt I did not know the right language to communicate with it. It's like someone asked me to write a web server in SQL. Of course you can do it, but I don't want to. To me, SQL is "the" reference implementation of a Domain Specific Language. And the ease at which you can instruct your RDBMS to do a complex task was mind blowing as long as you are operating within the problem domain.

In retrospect, I understood that the reason why I had so much reluctance to get into hadoop was because I am not a technology guy (there, I said it). I like to solve logical problems (puzzles or problems, I don't care). From a problem solver perspective, my problem statement does not change whether I am working with 1 record or 1gazellion records. It does not change if I have 1 line of text or the entire world wide web to process. I wanted that abstraction. SQL was giving me that (almost), till the data spilled over to the next machine. So, I started looking for languages on top of Hadoop that can help me out. Looked at Pig and Hive and a few others but, i felt this was like LINQ for SQL. You can change the programming languange but you cannot change the fundamental building blocks. I don't want to come out wrong. I love LINQ, but not so much when i have to write complex SQL queries in LINQ.

And, so I started exploring options that were outside Hadoop. Come on, big data is such a "big" pie, and it will not be monopolized. Anyways, my search ended with ECL. A programing language for taming the super computing grid called High-Performance Computing Cluster. It was open source, installation was exactly like how it was for SQL Server.
"Download the VM and download an IDE (looks like your SQL Server management studio). Connect to the server and get going."

Played with it for a few days. They have some tutorial videos in their site (google HPCC systems). My interest was mainly because the programming style was so different. Not similar to any language that I knew. But, I was able to relate to it. I didn't have to skew my thinking to fit to their programing style.

Also, it's been "the" programming language for HPCC for the last 10 years or so and it has undergone a lot of refinement over the years. So, I knew that I can take a deep breath and give some time to understand this language.

Fast forward a few months... I am still in love with ECL. And some day, someone may write ECL for Hadoop. But, till then, I am taming big data, the ECL way.

If you got an hour or so, give ECL a try and let me know what you think.

There a mental switch that you need to turn on to be able to easily starting thinking in ECL and with that, it becomes pretty much like SQL, actually even more elegant in a few cases. I will anyway write about it in my next post...breaking down a complex SQL query and building it in ECL. It should be a lot more easy to understand, I hope.

Custom SQL Stored Procedure Best Practices Analyzer - SQL Cop, Maybe?

2010-04-16T06:26:00.002-04:00

Recently, I moved to Seattle as a Data Architect for a product our company was developing. I also started taking up most of the DBX roles in that project.

Though our architecture is predominantly CRUD based and we have our in-house CRUD proc generator, some business logic invariably seeps into the SPs. And we had let it to the discretion of the developers on whether to do the processing at Data or the Business Layer. I thought it would make sense to keep a rudimentary check on what logic was going into the SP and make sure our developers don't misuse the liberty and continue to follow the guidelines defined.

After a futile search for a tool like FXCop for SQL Server, I thought it would be fun to build a simple Stored Procedure validation tool that will solve the purpose. What follows is the process on how I went about building the tool. The code is given at the end of this post. Feel free to use it, extend it and share it.

To start with, I wanted the validation to be done on 4 entities
1. Proc Name
2. Proc Definition
3. Input parameters
4. Output parameters

I decided to use the EVENTDATA generated by the DDL trigger to do the validations on the proc name and the definition. And I plan to query the sys.parameters table to do any validations on the proc parameters.

We will be doing 3 types of validations on the entities:
1. Verify that entity contains a specific text
2. Verify that entity doesn't contain a specific text
3. For names, verify the length of the value is within a min and max length

With the primary requirements stated, lets define the other constraints and requirements:
1. Version: SQL Server 2005.
2. The validation rules have to run every time a SP is compiled to check for best practices.
3. Existing validation rules can be updated and new ones added.
4. The administrator should be able to set whether non-compliance to a validations will fail the compile or not (IsCriticalValidation).
5. Log all compiled non-compliance for reference.
6. There should be an option for the admin to over-ride any validation for any SP
7. CLR is disabled in the server (Just didn't want the solution to mandate CLR)

The above requirements compelled me to follow this design approach:
1. Use a DDL Trigger (Requirements: 1,2)
2. Have a table to store validation rules (Reqirements: 2,3,4)
3. Have a table for logging non-compliance (Requirement: 5)
4. Have a table for storing SP exceptions (Requirement: 6
5. Everything in T-SQL (Requirement: 7)

Lets start with building the tables:

-- Table for proc validation rules
CREATE TABLE dbo.__SYS_ProcValidationRules(
      RuleID smallint NOT NULL,
      RuleDescription nvarchar(256) NOT NULL,
      IsActive bit NOT NULL,
      SearchableEntity nvarchar(32) NULL,
      SearchType nvarchar(32) NULL,
      IsCaseSensitive bit NULL,
      RuleParameter sql_variant NULL,
      IsCriticalValidation bit NOT NULL, 
      PRIMARY KEY CLUSTERED (RuleID ASC)
)

Now, I added a few best practices that you want your developers to follow into this rules table. The table now looks like this:

Now we create the other tables for the sake of completeness. We will not be filling any data now.

--Table to hold SP exceptions
CREATE TABLE dbo.__SYS_ProcExceptions(
      SchemaName nvarchar(256) NOT NULL DEFAULT ('dbo'),
      ProcName nvarchar(1000) NOT NULL,
      ExceptionReason nvarchar(2000) NULL,
      ExceptionRuleID smallint NULL,
      CreateDate datetime NOT NULL DEFAULT (getdate()),
      CreateUser nvarchar(100) NOT NULL DEFAULT (suser_sname())
)

GO

--Table to maintain log on non-compliant SPs created
CREATE TABLE dbo.__SYS_ProcNonCompliance(
      SchemaName nvarchar(256) NOT NULL DEFAULT ('dbo'),
      ProcName nvarchar(1000) NOT NULL,
      RuleDescription nvarchar(256) NOT NULL,
      IsException bit NOT NULL,
      CreateDate datetime NOT NULL DEFAULT (getdate())
)

We now create the function that will accept the rule and entity value and returns whether the rule passed or not.

-- Function to evaluate any rule
CREATE FUNCTION dbo.__SYS_ValidateRule(@in_SearchText varchar(MAX), 
@in_SearchType varchar(32),
@in_IsCaseSensitive bit, 
@RuleParameter sql_Variant)
RETURNS BIT
WITH ENCRYPTION
AS
BEGIN
    IF @in_SearchType = 'MAX LEN' AND (LEN(@in_SearchText) <= CAST(@RuleParameter AS INT))
        RETURN(1);
    IF(@in_SearchType = 'MIN LEN' AND LEN(@in_SearchText) >= CAST(@RuleParameter AS INT))
        RETURN(1);
    IF(@in_SearchType = 'LIKE')
    BEGIN
        IF(@in_IsCaseSensitive = 0 AND @in_SearchText LIKE CAST(@RuleParameter AS VARCHAR(256)))
        RETURN(1);
        IF(@in_IsCaseSensitive = 1 AND @in_SearchText COLLATE Latin1_General_BIN LIKE 
CAST(@RuleParameter AS VARCHAR(256)) COLLATE Latin1_General_BIN)
        RETURN(1);
    END
    IF(@in_SearchType = 'NOT LIKE' AND @in_SearchText NOT LIKE CAST(@RuleParameter AS VARCHAR(256)))
        RETURN(1);
    
    RETURN(0);
END

The only one thats remaining now is the Database Trigger for create and alter procedure. You can get the definition of that trigger from the link below.

This link holds all the necessary scripts. After running the scripts, here is what happens when I try to compile a sub-standard procedure.

Suggestions and feedback welcome!!

A Scenario to Ponder #16

2010-03-14T12:11:00.005-04:00

During my school days, I used to play this game that I would call as "word mutation". The problem statement gives the starting and ending 4-letter words.

The game is to find the path from the starting to the ending word by changing one alphabet at a time and ,of course, each of the words in the path should be a valid word.

Given below is one example:

Start: WORD
End : COME

One solution is
WORD
WORE
CORE
COME

So, here is the SQL puzzle. Given the starting and ending 4-letter words, Can you write a query to find the shortest path, between these two words. If you have multiple shortest paths, just show one.

You need to have a table of words and you can build it using the list available here:
http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt

Please post you answers in the comments section.
Happy Querying!!!

Given below is my take at solving this puzzle. Not very proud of the solution. I feel I could have as well written it in C#.
All you SQL gurus... I am really looking forward to a see a better looking/performing solution.

/* 
    I created a view VW_WORD that will only 
    select 4 letter words from the Words table. 
    I am using that as my Word Dictionary for 
    this solution.
*/

SET NOCOUNT ON;

DECLARE        @StartWord        CHAR(4);
DECLARE        @EndWord        CHAR(4);

SET            @StartWord        = 'WORD';
SET            @EndWord        = 'COME';

DECLARE @WordsTable as TABLE
( 
    Word                char(4),
    WordsBranch            varchar(max),
    Closeness            tinyint,
    IterationCounter    int
)    
    
DECLARE    @IterationCounter        int;
SET        @IterationCounter        = 1;

;WITH Positions AS
(
    SELECT 1 AS Pos
    UNION ALL
    SELECT 2
    UNION ALL
    SELECT 3
    UNION ALL
    SELECT 4
)
INSERT INTO @WordsTable (Word, WordsBranch, Closeness, IterationCounter)
SELECT    
        UPPER(Word),
        CAST(UPPER(@StartWord) + '-->' + UPPER(Word) AS VARCHAR(MAX)) AS WordsBranch,
        (SELECT COUNT(*) FROM Positions WHERE SUBSTRING(Word, Pos,1) = SUBSTRING(@EndWord,Pos,1)) as Closeness,
        @IterationCounter as IterationCounter
FROM 
        VW_WORD
WHERE 
        DIFFERENCE(Word,@StartWord) >= 3 
AND     (SELECT COUNT(*) FROM Positions WHERE SUBSTRING(Word, Pos,1) = SUBSTRING(@StartWord,Pos,1)) >= 3  
AND    (SELECT COUNT(*) FROM Positions WHERE SUBSTRING(Word, Pos,1) = SUBSTRING(@EndWord,Pos,1)) >= 1  

DELETE FROM @WordsTable WHERE Closeness <> (SELECT MAX(Closeness) FROM @WordsTable)  

WHILE NOT EXISTS (SELECT 1 FROM @WordsTable WHERE Word = @EndWord)
BEGIN
    SET @IterationCounter = @IterationCounter + 1

    IF(@IterationCounter = 10)
        BREAK; /* No solution Found */

    ;WITH Positions AS
    (
        SELECT 1 AS Pos
        UNION ALL
        SELECT 2
        UNION ALL
        SELECT 3
        UNION ALL
        SELECT 4
    )
    INSERT INTO @WordsTable (Word, WordsBranch, Closeness, IterationCounter)
    SELECT    
            UPPER(A.Word),
            WordsBranch + '-->' + UPPER(A.Word) AS WordsBranch,
            (SELECT COUNT(*) FROM Positions WHERE SUBSTRING(A.Word, Pos,1) = SUBSTRING(@EndWord,Pos,1)),
            @IterationCounter as IterationCounter
    FROM 
            VW_WORD AS A, 
            (SELECT * FROM @WordsTable WHERE IterationCounter = @IterationCounter - 1) AS B
    WHERE 
            DIFFERENCE(A.Word,B.Word) >= 3 
    AND    (SELECT COUNT(*) FROM Positions WHERE SUBSTRING(A.Word, Pos,1) = SUBSTRING(B.Word,Pos,1)) = 3  
    AND    (SELECT COUNT(*) FROM Positions WHERE SUBSTRING(A.Word, Pos,1) = SUBSTRING(@EndWord,Pos,1)) >= B.Closeness 
    AND    NOT EXISTS (SELECT 1 FROM @WordsTable C where C.Word = A.Word)
    
    DELETE FROM @WordsTable WHERE IterationCounter < @IterationCounter - 1
END

IF EXISTS(SELECT 1 FROM @WordsTable WHERE Closeness = 4)
    SELECT WordsBranch FROM @WordsTable WHERE Closeness = 4
ELSE
    PRINT 'No solution found'

Solving Sudoku using SQL Server 2005 - Step by Step - Part #6

2009-11-14T03:06:00.006-05:00

Implementation of RunSolveAlgorithm4:

This is the last post in this series. The previous algorithm is the last one that tries to solve the puzzle logically. This one will take the latest unsolved sudoku board that we get after running through the first three algorithms and use brute force to solve the puzzle. I am not going to reinvent the wheel here.

This is a SQL Server implementation of "solving sudoku using recursive subquery factoring" written by Anton Scheffer for Oracle 11gR2.

Here is the logic that was used. Using recursive CTEs, start filling the blank cells (one cell at a time) with all possible valid values for the cell, starting from left to right ,top to bottom. Each empty cell will be a root node with subsequent solutions branching under it. Each valid value filled in that cell will be a new solution node in the branch, which will become the root node for the next blank cell. Now, fill the second cell the same way for each of the node and continue.

The branching of a node will terminate when one of the following conditions are met:
1. If any node cannot no longer have a valid value for the next empty cell, we stop branching that node - It can no longer lead us to the valid solution.
2. When all cells in the sudoku board are filled, stop processing that node. The node holds the solution.

Here is the implementation:

ALTER PROC RunSolveAlgorithm4
AS
SET NOCOUNT ON
BEGIN
    WITH SUDOKU_BOARD_AS_STRING AS /*Convert our latest Sudoku Board as a continous string*/
    (
    SELECT CAST((
             SELECT ISNULL(VAL,0) FROM SUDOKU_BOARD ORDER BY YPOS,XPOS FOR XML PATH('')
           ) AS VARCHAR(81)) AS SUDOKU_PROBLEM
    ),
    ALL_SOLNS AS
    ( SELECT SUDOKU_PROBLEM AS SOLN, CHARINDEX('0',SUDOKU_PROBLEM,1) CUR_POSN FROM SUDOKU_BOARD_AS_STRING
      UNION ALL /*Recursive member - Replace the next blank cell with a possible value*/
      SELECT CAST(LEFT(SOLN,CUR_POSN-1) + CAST(A.NUM AS CHAR(1)) + RIGHT(SOLN,81-CUR_POSN) AS VARCHAR(81)), 
             CHARINDEX( '0',SOLN,CUR_POSN+1)  
      FROM ALL_SOLNS,NUMBERS A 
      WHERE CUR_POSN > 0  /*Branching termination condition 2. No more cells to fill. SOLUTION!!! */
        AND A.NUM NOT IN (/*Pick the solution only when the current value is not in any other cell of 
                            the same row, column and 3X3 block. Branching termination condition 1*/
                          SELECT substring(SOLN,(B.NUM-1)*9 + C.NUM,1)  FROM NUMBERS B, NUMBERS C
                          WHERE B.NUM = (CUR_POSN-1)/9+1 
                             OR C.NUM = (CUR_POSN-1)%9 + 1 
                             OR ((B.NUM-1)/3)*3 + (C.NUM-1)/3 = (((CUR_POSN-1)/27)*3 + ((CUR_POSN-1)%9)/3))
    ),
    SUDOKU_SOLUTION AS
    ( SELECT A.NUM AS XPOS, B.NUM AS YPOS, SUBSTRING(SOLN,(B.NUM-1)*9 + A.NUM,1) AS VAL 
      FROM NUMBERS A,NUMBERS B,(SELECT TOP 1 SOLN FROM ALL_SOLNS WHERE CUR_POSN = 0) C
    )
    UPDATE SUD
    SET SUD.VAL = SOL.VAL
    FROM SUDOKU_BOARD SUD, SUDOKU_SOLUTION SOL
    WHERE SUD.XPOS = SOL.XPOS
    AND SUD.YPOS = SOL.YPOS    

    /* Update the solution board too so that its in sync with sudoku board*/
    UPDATE SOL
    SET SOL.VAL = SUD.VAL
    FROM SUDOKU_BOARD SUD, SOLUTION_BOARD SOL
    WHERE SUD.XPOS = SOL.XPOS
      AND SUD.YPOS = SOL.YPOS    
END

This algorithm can be run directly after loading the data in the sudoku board, without running the other three algorithms. However, due to the recursive nature of this algorithm, the number of branches can be exponentially reduced by filling a few extra cells in the sudoku board. So, I am having this as the last algorithm to be used.

That brings this series to an end (atleast for now).

One final note: If you are planning to build more algorithms on this, add them before this one, for obvious reasons - so that you don't work on a solved puzzle :) You can get the source of all the code in this series from here.

Solving Sudoku using SQL Server 2005 - Step by Step - Part #5

2009-11-14T02:13:00.011-05:00

Implementation of RunSolveAlgorithm3:

The last algorithm that we implemented was able to solve an easy puzzle. Now, lets take a hard one and see if the solution we have built till now in this series can solve it.

 EXEC SolveSudoku 
'030,001,000,,006,000,050,,500,000,983,,080,006,302,,000,050,000,,903,800,060,,714,000,009,,020,000,800,,000,400,030'

post-solve sudoku board - before implementing algorithm 3

post-solve solution board - before implementing algorithm 3

We see that the current solve methods cannot handle this puzzle. Lets build the next algorithm.

Implementation of RunSolveAlgorithm3:
The next algorithm is again an implementation from sudoku solver which they call the solve method B.
It is a little complex and I enjoyed implementing this one. You see that any row in the board will belong to three 3X3 block and the row will have 3 cells in each block. Now, check if there are any values in the row occur only in one of the 3 blocks. If it does, it means that for that particular block, the value exists in that row. So, that value can be removed from the other 6 cells in the block. We do the same for the columns.

We can do the check the other way round too, where any block will belong to 3 rows and will have 3 cells in each row. Now, check if there are any values in the block occur only in one of the 3 rows. If it does, it means that for that particular row, the value exists in that block. So, that value can be removed from the other 6 cells in the row. We can do it similarly for columns.

With the algorithm explained, here is the implementation:

ALTER PROC RunSolveAlgorithm3
AS
SET NOCOUNT ON
BEGIN
  DECLARE @RowCount SMALLINT,
          @Counter SMALLINT

  SET     @RowCount = 1
  SET     @Counter = 1

  WHILE(@RowCount > 0 AND dbo.VerifySolve() = 0)
  BEGIN
      SET @RowCount = 0;
      SET @Counter = 0;
      WHILE (@Counter <=9) 
      BEGIN  

        /*For each row, check if any number occur only in a specific block, then the number will be 
          in that row, remove it from all other rows in that block */
        WITH YSOL AS
        (
        SELECT YPOS,(XPOS-1)/3 + 1 AS XBLK, SUBSTRING(VAL,NUM,1) AS VAL FROM SOLUTION_BOARD A, NUMBERS B
                       WHERE B.NUM <=LEN(A.VAL)
                       AND LEN(VAL) > 1 
        GROUP BY YPOS,(XPOS-1)/3 + 1 ,SUBSTRING(VAL,NUM,1)  
        ),
        YSOL_DEL AS
        (
        SELECT D.NUM AS XPOS,C.NUM AS YPOS,A.VAL FROM YSOL A LEFT OUTER JOIN YSOL B ON A.YPOS = B.YPOS
        AND A.XBLK <> B.XBLK AND A.VAL= B.VAL 
        INNER JOIN NUMBERS C ON (A.YPOS -1)/3 =(C.NUM-1)/3
        INNER JOIN NUMBERS D ON A.XBLK =(D.NUM-1)/3+1
        WHERE B.YPOS IS NULL
        AND A.YPOS <> C.NUM
        AND A.VAL = @Counter
        )
        UPDATE SOL SET VAL = REPLACE(SOL.VAL,YDEL.VAL,'')
          FROM   SOLUTION_BOARD SOL, YSOL_DEL YDEL
          WHERE  SOL.XPOS = YDEL.XPOS
          AND SOL.YPOS = YDEL.YPOS    
          AND LEN(SOL.VAL) > 1;

        /*For each column, check if any number occur only in a specific block, then the number will be 
          in that column, remove it from all other columns in that block */
        WITH XSOL AS
        (
        SELECT XPOS,(YPOS-1)/3 + 1 AS YBLK, SUBSTRING(VAL,NUM,1) AS VAL FROM SOLUTION_BOARD A, NUMBERS B
                       WHERE B.NUM <=LEN(A.VAL)
                       AND LEN(VAL) > 1 
        GROUP BY XPOS,(YPOS-1)/3 + 1 ,SUBSTRING(VAL,NUM,1)  
        ),
        XSOL_DEL AS
        (
        SELECT D.NUM AS YPOS,C.NUM AS XPOS,A.VAL FROM XSOL A LEFT OUTER JOIN XSOL B ON A.XPOS = B.XPOS
        AND A.YBLK <> B.YBLK AND A.VAL= B.VAL 
        INNER JOIN NUMBERS C ON (A.XPOS -1)/3 =(C.NUM-1)/3
        INNER JOIN NUMBERS D ON A.YBLK =(D.NUM-1)/3+1
        WHERE B.XPOS IS NULL
        AND A.XPOS <> C.NUM
        AND A.VAL = @Counter
        )
        UPDATE SOL SET VAL = REPLACE(SOL.VAL,XDEL.VAL,'')
          FROM   SOLUTION_BOARD SOL, XSOL_DEL XDEL
          WHERE  SOL.YPOS = XDEL.YPOS
          AND SOL.XPOS = XDEL.XPOS    
          AND LEN(SOL.VAL) > 1;

        /*For each block, check if any number occur only in a specific row, then the number will be 
          in that block, remove it from all other blocks in that row*/
         WITH YBLK AS
         (
             SELECT ((YPOS-1)/3)*3 + (XPOS-1)/3 + 1 AS BPOS, YPOS, 
                    SUBSTRING(VAL,NUM,1) AS VAL 
             FROM   SOLUTION_BOARD A, NUMBERS B
             WHERE  B.NUM <=LEN(A.VAL)
                AND LEN(VAL) > 1 
             GROUP BY 
                   ((YPOS-1)/3)*3 + (XPOS-1)/3,YPOS,SUBSTRING(VAL,NUM,1)  
         ),
         YBLK_DEL AS 
         (
            SELECT  A.YPOS,C.NUM AS XPOS ,A.VAL 
            FROM YBLK A LEFT OUTER JOIN YBLK B 
            ON      A.BPOS = B.BPOS
                AND A.YPOS <> B.YPOS 
                AND A.VAL= B.VAL 
            INNER JOIN NUMBERS C ON 1 = 1
            WHERE B.YPOS IS NULL 
            AND A.BPOS = 8
            AND (C.NUM-1)/3 + 1 <> A.BPOS  - ((A.YPOS-1)/3)*3 
            AND A.VAL = @Counter
         )
         UPDATE SOL SET VAL = REPLACE(SOL.VAL,YDEL.VAL,'')
         FROM  SOLUTION_BOARD SOL, YBLK_DEL YDEL
         WHERE SOL.XPOS = YDEL.XPOS
           AND SOL.YPOS = YDEL.YPOS    
           AND LEN(SOL.VAL) > 1;
        
        /*For each block, check if any number occur only in a specific column, then the number will be 
          in that block, remove it from all other blocks in that column*/
         WITH XBLK AS
         (
            SELECT ((XPOS-1)/3)*3 + (YPOS-1)/3 + 1 AS BPOS, XPOS, 
                   SUBSTRING(VAL,NUM,1) AS VAL 
            FROM   SOLUTION_BOARD A, NUMBERS B
            WHERE  B.NUM <=LEN(A.VAL)
              AND  LEN(VAL) > 1 
             GROUP BY 
                  ((XPOS-1)/3)*3 + (YPOS-1)/3,XPOS,SUBSTRING(VAL,NUM,1)  
         ),
         XBLK_DEL AS 
         (
            SELECT A.XPOS, C.NUM AS YPOS ,A.VAL 
            FROM XBLK A LEFT OUTER JOIN XBLK B 
              ON     A.BPOS = B.BPOS
                 AND A.XPOS <> B.XPOS 
                 AND A.VAL= B.VAL 
            INNER JOIN NUMBERS C ON 1 = 1
            WHERE B.XPOS IS NULL AND 
            A.BPOS = 8
            AND (C.NUM-1)/3 + 1 <> A.BPOS  - ((A.XPOS-1)/3)*3 
            AND A.VAL = @Counter
         )
         UPDATE SOL SET VAL = REPLACE(SOL.VAL,XDEL.VAL,'')
         FROM  SOLUTION_BOARD SOL, XBLK_DEL XDEL
         WHERE SOL.YPOS = XDEL.YPOS
           AND SOL.XPOS = XDEL.XPOS    
           AND LEN(SOL.VAL) > 1;

        SET @Counter = @Counter + 1
     END

    /* If the above updates led to a determining the value of any cell in the solution 
           board (Only one digit exists in the cell), then we update the sudoku_board */
    UPDATE SUD
    SET VAL = SOL.VAL
    FROM SUDOKU_BOARD SUD, SOLUTION_BOARD SOL
    WHERE SUD.XPOS = SOL.XPOS
      AND SUD.YPOS = SOL.YPOS
      AND LEN(SOL.VAL) = 1
      AND SUD.VAL IS NULL

    SET @RowCount = @@ROWCOUNT    
    
    /* We rerun SolveAlgorithms 1 and 2 to see if there are any more solves possible */

    EXEC RunSolveAlgorithm1;
    EXEC RunSolveAlgorithm2;
  END
END
GO

post-solve sudoku board - after implementing Algorithm 3 (Solved)

With this implementation, we should be able to solve most medium and some hard puzzles. There are a lot more well known algorithms available to solve the puzzle logically, which should be equally interesting to implement. If you come up with an implementation other than the ones given here, please feel free to post the link or the actual implementation in the comments section. But, I am done with my implementations for now. The next post, our last algorithm, will be a brute force algorithm, which will be a fall back in case our first 3 algorithms are not able to solve the puzzle.

Solving Sudoku using SQL Server 2005 - Step by Step - Part #4

2009-11-14T01:37:00.006-05:00

Implementation of RunSolveAlgorithm2:

We implemented RunSolveAlgorithm1 in previous post of this series . The next algorithm is the implementation of Solve Method A from sudoku solver.

In this algorithm, we check all the cells (having mutiple values) in each row and see if a particular value occurs only once in that row. Then update that as the solution for the cell having that value. We do the similar check for column and the 3X3 block.

This can solve the easy to medium puzzles. Here goes the implementation.

ALTER PROC RunSolveAlgorithm2
AS
SET NOCOUNT ON
BEGIN
    DECLARE @RowCount int,
        @UpdateRowCount int

    SET     @RowCount = 1
    SET    @UpdateRowCount = 0

    WHILE(@RowCount > 0 AND dbo.VerifySolve() = 0)
    BEGIN
        SET @RowCount = 0;

        /* Take all the cells, having mutiple values, in each row and see if a particular value occurs only 
           once in that row. Then update that as the solution for the cell */
        WITH XSOL AS
        (
        SELECT XPOS,SUBSTRING(VAL,NUM,1) AS VAL FROM SOLUTION_BOARD A, NUMBERS B
                       WHERE B.NUM <=LEN(A.VAL)
                       AND LEN(VAL) > 1
        GROUP BY XPOS,SUBSTRING(VAL,NUM,1) HAVING COUNT(*) = 1
        )
        UPDATE SOL 
        SET VAL = XSOL.VAL
        FROM SOLUTION_BOARD SOL, XSOL
        WHERE 
              SOL.XPOS = XSOL.XPOS
          AND LEN(SOL.VAL) > 1
          AND CHARINDEX(XSOL.VAL,SOL.VAL) > 0;

        SET @UpdateRowCount = @@ROWCOUNT;
        SET @RowCount = @RowCount + @UpdateRowCount;
        IF(@UpdateRowCount > 0) /* Need to rerun algorithm 1 for clean up if any cell was updated */
            EXEC RunSolveAlgorithm1; 


        /* Take all the cells, having mutiple values, in each column and see if a particular value occurs only 
           once in that column. Then update that as the solution for the cell */
        WITH YSOL AS
        (
        SELECT YPOS,SUBSTRING(VAL,NUM,1) AS VAL FROM SOLUTION_BOARD A, NUMBERS B
                       WHERE B.NUM <=LEN(A.VAL)
                       AND LEN(VAL) > 1
        GROUP BY YPOS,SUBSTRING(VAL,NUM,1) HAVING COUNT(*) = 1
        )
        UPDATE SOL 
        SET VAL = YSOL.VAL
        FROM SOLUTION_BOARD SOL, YSOL
        WHERE 
              SOL.YPOS = YSOL.YPOS
          AND LEN(SOL.VAL) > 1
          AND CHARINDEX(YSOL.VAL,SOL.VAL) > 0;

        SET @UpdateRowCount = @@ROWCOUNT;
        SET @RowCount = @RowCount + @UpdateRowCount;
        IF(@UpdateRowCount > 0) /* Need to rerun algorithm 1 for clean up if any cell was updated */
            EXEC RunSolveAlgorithm1; 


        /* Take all the cells, having mutiple values, in each 3X3 block and see if a particular value occurs only 
           once in that block. Then update that as the solution for the cell */
        WITH BSOL AS
        (
        SELECT ((YPOS-1)/3)*3 + (XPOS-1)/3 AS BPOS,SUBSTRING(VAL,NUM,1) AS VAL FROM SOLUTION_BOARD A, NUMBERS B
                       WHERE B.NUM <=LEN(A.VAL)
                       AND LEN(VAL) > 1
        GROUP BY ((YPOS-1)/3)*3 + (XPOS-1)/3,SUBSTRING(VAL,NUM,1) HAVING COUNT(*) = 1
        )
        UPDATE SOL 
        SET VAL = BSOL.VAL
        FROM SOLUTION_BOARD SOL, BSOL
        WHERE 
              ((SOL.YPOS-1)/3)*3 + (SOL.XPOS-1)/3 = BSOL.BPOS
          AND LEN(SOL.VAL) > 1
          AND CHARINDEX(BSOL.VAL,SOL.VAL) > 0;

        SET @UpdateRowCount = @@ROWCOUNT;
        SET @RowCount = @RowCount + @UpdateRowCount;
        IF(@UpdateRowCount > 0) /* Need to rerun algorithm 1 for clean up if any cell was updated */
            EXEC RunSolveAlgorithm1; 

    END
END
GO

When I call the proc SolveSudoku now, you can see that the problem is solved and when solved, the solution board and sudoku board are in sync.

EXEC SolveSudoku 
'790,000,300,,000,006,900,,800,030,076,,000,005,002,,005,418,700,,400,700,000,,610,090,008,,002,300,000,,009,000,054'

post-solve sudoku board - before implementing Algorithm 2

post-solve sudoku board - after implementing Algorithm 2 (Solved)

post-solve solution board - before implementing Algorithm 2

post-solve solution board - after implementing Algorithm 2 (Same as the sudoku board)

For the next algorithm will take up a harder puzzle and see how well we fare.

Solving Sudoku using SQL Server 2005 - Step by Step - Part #3

2009-11-12T15:14:00.007-05:00

Implementation of RunSolveAlgorithm1:

In the previous post of this series, we created the procedure stub for each algorithm. This post will implement RunSolveAlgorithm1.

The first algorithm will do the primary clean up on the solution board. It will implement the basic rules of Sudoku. The rule is that a number cannot appear in a cell if it already appears in any other cell in the same row or column or the block in the sudoku board. It will remove those numbers from the possible cadidate values of each cell in the solution board.

This cannot usually solve the puzzle completely unless the problem is extremely easy. But, this will make sure our other algorithms need not worry about figuring out the obvious.

So, here goes the implementation. Once you understand the row level update, the column and block level update are pretty much the same. I have added comments in place, so it should be fairly easy to understand.

ALTER PROC RunSolveAlgorithm1
AS
SET NOCOUNT ON
BEGIN
   DECLARE @RowCount SMALLINT,
           @Counter  SMALLINT
 
   SET     @RowCount = 1
   SET     @Counter = 1

   /* We will keep running this algorithm till there are no more
      cells left to update in the sudoku board  */
   
   WHILE(@RowCount > 0 AND dbo.VerifySolve() = 0)
   BEGIN
   /* We need to use the counter logic to make sure we don't do 
   mutiple updates on the same value such that it over writes 
   each other */
      SET @Counter = 0;
      WHILE (@Counter <=9) 
      BEGIN  
         /*Remove the number from each cell in solution board if it exists in any of the cells 
           in the same column from the sudoku board */
          WITH SUD_XPOS AS
          (
            SELECT XPOS,VAL,
            ROW_NUMBER() OVER (PARTITION BY XPOS ORDER BY VAL ) AS COUNTER 
            FROM SUDOKU_BOARD 
            WHERE VAL IS NOT NULL
          )
          UPDATE SOL SET VAL = REPLACE(SOL.VAL,SUD.VAL,'')
          FROM   SOLUTION_BOARD SOL, SUD_XPOS SUD
          WHERE  SOL.XPOS = SUD.XPOS
          AND LEN(SOL.VAL) > 1
          AND COUNTER = @Counter;
          
          /*Remove the number from each cell in solution board if it exists in any of the cells 
            in the same row from the sudoku board */
          WITH SUD_YPOS AS
          (
            SELECT YPOS,VAL,
            ROW_NUMBER() OVER (PARTITION BY YPOS ORDER BY VAL ) AS COUNTER  
            FROM SUDOKU_BOARD 
            WHERE VAL IS NOT NULL
          )
          UPDATE SOL SET VAL = REPLACE(SOL.VAL,SUD.VAL,'')
          FROM   SOLUTION_BOARD SOL, SUD_YPOS SUD
          WHERE  SOL.YPOS = SUD.YPOS
          AND LEN(SOL.VAL) > 1
          AND COUNTER = @Counter;

          /*Remove the number from each cell in solution board if it exists in any of the cells 
            in the same 3X3 block from the sudoku board */
          WITH SUD_BLOCK AS
          (
            SELECT ((YPOS-1)/3)*3 + (XPOS-1)/3 AS BPOS,VAL,
                   ROW_NUMBER() OVER (PARTITION BY ((YPOS-1)/3)*3 + (XPOS-1)/3 ORDER BY VAL ) AS COUNTER  
            FROM SUDOKU_BOARD 
            WHERE VAL IS NOT NULL
          )
          UPDATE SOL SET VAL = REPLACE(SOL.VAL,SUD.VAL,'')
          FROM   SOLUTION_BOARD SOL, SUD_BLOCK SUD
          WHERE  ((SOL.YPOS-1)/3)*3 + (SOL.XPOS-1)/3 = SUD.BPOS
          AND LEN(SOL.VAL) > 1
          AND COUNTER = @Counter;
          
          SET @Counter = @Counter  + 1
      END /* WHILE (@Counter <=9) */

      /*If the above updates led to a determining the value of any cell in the solution 
        board (Only one digit exists in the cell), then we update the sudoku_board */
      UPDATE SUD
      SET VAL = SOL.VAL
      FROM SUDOKU_BOARD SUD, SOLUTION_BOARD SOL
      WHERE SUD.XPOS = SOL.XPOS
      AND SUD.YPOS = SOL.YPOS
      AND LEN(SOL.VAL) = 1
      AND SUD.VAL IS NULL

      SET @RowCount = @@ROWCOUNT
   END /* WHILE(@RowCount > 0 AND dbo.VerifySolve() = 0) */
   
END

When I call the proc SolveSudoku, you can see that though the problem is not solved, it does fill a few more cells in the sudoku board and the solution board has a lot less choices in the unsolved cells.

EXEC SolveSudoku 
'790,000,300,,000,006,900,,800,030,076,,000,005,002,,005,418,700,,400,700,000,,610,090,008,,002,300,000,,009,000,054'

post-solve sudoku board - before implementing Algorithm 1

post-solve sudoku board - after implementing Agorithm 1

post-solve solution board - before implementing Algorithm 1

post-solve solution board - after implementing Agorithm 1

That ends this post. Lets see if the next algorithm solves this puzzle.

Solving Sudoku using SQL Server 2005 - Step by Step - Part #2

2009-11-10T13:05:00.008-05:00

In the first part of this series, we created the base objects needed to work on our solution. Now, there are two parts to building the solution:

The core algorithms that will solve the puzzle for us
The surrounding objects that will facilitate the solve and execute the core algorithms

The core algorithms are the crux of this whole exercise. Each of these algorithm will ideally take the unsolved sudoku board and try to fill as many cells as possible using the logic implemented in the specific alogorithm.

We will be creating the procedure stubs for the core algorithms (right now I assumed we will have 4 of them) which will be implemented later. All my future posts in these series will be about implementing each of the algorithm stubs given below.

/* YET TO IMPLEMENT */
CREATE PROC RunSolveAlgorithm1
AS
BEGIN
     PRINT 'SOLVE ALGORITHM 1 NOT IMPLEMENTED'
RETURN 0    
END
GO

/* YET TO IMPLEMENT */
CREATE PROC RunSolveAlgorithm2
AS
BEGIN
     PRINT 'SOLVE ALGORITHM 2 NOT IMPLEMENTED'
RETURN 0    
END
GO

/* YET TO IMPLEMENT */
CREATE PROC RunSolveAlgorithm3
AS
BEGIN
     PRINT 'SOLVE ALGORITHM 3 NOT IMPLEMENTED'
RETURN 0    
END
GO

/* YET TO IMPLEMENT */
CREATE PROC RunSolveAlgorithm4
AS
BEGIN
     PRINT 'SOLVE ALGORITHM 4 NOT IMPLEMENTED'
RETURN 0    
END
GO

Here is the main procedure that will be called to solve the sudoku puzzle. It prints the sudoku board immediately after loading the puzzle (pre-solve) and again after running all the algorithms (post-solve).

/* This proc is the main proc that will accept the problem as an input and do the following:
   Verify that the input is valid
   Load the problem into the sudoku and solution board
   Call the algorithms to solve the problem
 
   Sample Inputs:
   EXEC SolveSudoku 
'030,001,000,,006,000,050,,500,000,983,,080,006,302,,000,050,000,,903,800,060,,714,000,009,,020,000,800,,000,400,030'
   EXEC SolveSudoku 
'790,000,300,,000,006,900,,800,030,076,,000,005,002,,005,418,700,,400,700,000,,610,090,008,,002,300,000,,009,000,054'
*/
CREATE PROC SolveSudoku
(@in_szProblem varchar(120))
AS
SET NOCOUNT ON
BEGIN
     declare @szProblem   varchar(81)
     set     @szProblem = replace(@in_szProblem,',','')

     /* CHECK IF THE DATA IS VALID */
     IF(PATINDEX('%[^0-9]%', @szProblem) > 0) 
     BEGIN
          UPDATE SUDOKU_BOARD SET VAL = NULL;
          PRINT 'BAD DATA'
          RETURN 0
     END     
 
     /* PROCEDURE TO LOAD THE INPUT DATA INTO SUDOKU AND THE SOLUTION BOARD */
     EXEC LoadInputData @szProblem;
 
     /* RUN SOLVE ALGORITHM 1 */
     EXEC RunSolveAlgorithm1
     IF(dbo.VerifySolve() = 1) /*SOLVED*/
     BEGIN
          EXEC PrintBoard
          PRINT 'SOLVED'
          RETURN 0
     END

     /* RUN SOLVE ALGORITHM 2 */
     EXEC RunSolveAlgorithm2
     IF(dbo.VerifySolve() = 1) /*SOLVED*/
     BEGIN
          EXEC PrintBoard
          PRINT 'SOLVED'
          RETURN 0
     END

     /* RUN SOLVE ALGORITHM 3 */
     EXEC RunSolveAlgorithm3
     IF(dbo.VerifySolve() = 1) /*SOLVED*/
     BEGIN
          EXEC PrintBoard
          PRINT 'SOLVED'
          RETURN 0
     END

     /* RUN SOLVE ALGORITHM 4 */
     EXEC RunSolveAlgorithm4
     IF(dbo.VerifySolve() = 1) /*SOLVED*/
     BEGIN
          EXEC PrintBoard
          PRINT 'SOLVED'
          RETURN 0
     END

     EXEC PrintBoard /*UNSOLVED*/
     PRINT 'UNSOLVED'
     RETURN 0    
END
GO

The main procedure accepts the input problem as a string. It verifies to make sure the input string contains only numeric data and loads the data into the solution and sudoku board using the following procedure:

/* This procedure will place the problem in the sudoku board and
   the starting solution used for processing in the solution board. 
   Input data is a contiguous string filling cells from left to right, top to bottom. 
   Blank cells will have 0 
*/
CREATE PROC LoadInputData
(@in_szData varchar(81)) 
AS
SET NOCOUNT ON
BEGIN 
    UPDATE SUDOKU_BOARD 
    SET VAL = CASE WHEN substring(@in_szData,(YPOS-1)*9 + XPOS,1) = '0' 
              THEN NULL 
              ELSE substring(@in_szData,(YPOS-1)*9 + XPOS,1) END  
  
    UPDATE SOLUTION_BOARD 
    SET VAL = CASE WHEN substring(@in_szData,(YPOS-1)*9 + XPOS,1) = '0' 
              THEN '123456789' 
              ELSE substring(@in_szData,(YPOS-1)*9 + XPOS,1) END
 
    EXEC PrintBoard  
END
GO

You will see that the main procedure calls a function VerifySolve() after running each algorithm. This will check for the following in the sudoku board:

the sum of value of cells across each row (summing up values grouping by YPOS) = 45
the sum of value of cells across each column (summing up values grouping by XPOS) = 45
the sum of value of cells across each 3X3 block(summing up values grouping by (YPOS-1)/3)*3 + (XPOS-1)/3) = 45
There are nine unique values filling the board (An over kill to make sure any scenario that escapes the above 3 checks is caught here). For example, filling the board with all cells as 5 will pass the first 3 checks

Here is the implementation of the verification function:

/* This function can be called anytime to verify if the sudoku board has a complete solution */
CREATE FUNCTION dbo.VerifySolve()
RETURNS BIT
AS
BEGIN RETURN(
 SELECT CASE WHEN COUNT(DISTINCT SUM_VAL) = 1 
              AND MAX(SUM_VAL)=45 
              AND (SELECT COUNT(DISTINCT VAL) FROM SUDOKU_BOARD) = 9 
             THEN 1 ELSE 0 END 
 FROM
 ( 
   SELECT SUM(VAL) AS SUM_VAL FROM SUDOKU_BOARD GROUP BY ((YPOS-1)/3)*3 + (XPOS-1)/3
   UNION ALL
   SELECT SUM(VAL) FROM SUDOKU_BOARD GROUP BY YPOS
   UNION ALL
   SELECT SUM(VAL) FROM SUDOKU_BOARD GROUP BY XPOS
 ) AS GROUP_TOTAL)
END
GO

Now, we have the basic solution in place. As you can see, we are no closer to solving the puzzle than we were when we started this series. We are just making sure that when we implement an algorithm, we will concentrate on just the core algorithm and nothing else. Since none of the algorithms are implemented, when we call the procedure SolveSudoku, it will just print the sudoku board without solving it. As we start implementing each algorithm, SolveSudoku will start filling more and more cells in the board. Right now, the pre-solve (before running the algorithms) and post-solve (after running the algorithms) sudoku board is the same. Here is an example:

EXEC SolveSudoku 
'790,000,300,,000,006,900,,800,030,076,,000,005,002,,005,418,700,,400,700,000,,610,090,008,,002,300,000,,009,000,054'

pre-solve sudoku board:

post-solve sudoku board:

post solve - solution board (EXEC PrintSolution)

All that is left to do now is implementing the core algorithms. I will be giving the the pre-solve and post solve after implementing each core algorithms. The algorithms will start knocking off the invalid numbers from each cell in the solution board. We have to keep building better algorithms till each cell in the solution board is left with only one number. We have then solved the problem. Long way to go, huh?

We will start with building the first algorithm in the next post.

Solving Sudoku using SQL Server 2005 - Step by Step - Part #1

2009-11-09T18:31:00.008-05:00

This one is going to be a series. I thought I was going to come up with a single query to solve Sudoku (without choosing the brute force method). When I started creating the tables I needed, I figured out there there are way too many aspects to solving Sudoku logically. I thought it would be a good idea to give a continous update as I go about building the solution.

This post is all about creating the necessary tables and basic procedures which I think we need to build the solution.

First, I am creating a database called Sudoku which will hold all our objects and the numbers table which I hope we will use frequently.

CREATE DATABASE SUDOKU
GO
USE SUDOKU
GO

/* NUMBERS TABLE USED TO LOAD DEFAULT TABLE DATA*/
CREATE TABLE NUMBERS
(
  NUM INT NOT NULL
)
GO

/* FILL THE NUMBER TABLE */
DECLARE @COUNT INT
SET @COUNT = 0
WHILE (@COUNT <9)
BEGIN
    SET @COUNT = @COUNT + 1
    INSERT INTO NUMBERS SELECT @COUNT
END
GO

We need 2 main tables to start with. One is the SUDOKU_BOARD which will hold the original sudoku problem and will be updated with the confirmed correct value of each cell that we identify in our solve.
The second table is SOLUTION_BOARD, which is like a work table we will play around with to solve the problem. Solution board will hold all possible candidate values for each cell. Both the tables will be prefilled with 81 rows, one row for each cell in the sudoku board. Each row will hold the xy position of the cell and the value of the cell.

/* THE PROBLEM TO BE SOLVED WILL BE PLACED IN THIS BOARD */
CREATE TABLE SUDOKU_BOARD 
(
    XPOS SMALLINT NOT NULL CHECK (XPOS BETWEEN 1 AND 9),
    YPOS SMALLINT NOT NULL CHECK (YPOS BETWEEN 1 AND 9),
    VAL SMALLINT NULL CHECK(VAL BETWEEN 1 AND 9)
)
GO

/* BUILD THE SUDOKU BOARD WITH ALL POSITIONS. THIS IS A ONE TIME OPERATION*/
INSERT INTO SUDOKU_BOARD(XPOS,YPOS)
SELECT X.NUM,Y.NUM FROM NUMBERS X, NUMBERS Y
GO

/* THE SOLUTION WILL BE DERIVED IN THIS BOARD FOR THE GIVEN PROBLEM*/
CREATE TABLE SOLUTION_BOARD 
(
    XPOS SMALLINT NOT NULL CHECK (XPOS BETWEEN 1 AND 9),
    YPOS SMALLINT NOT NULL CHECK (YPOS BETWEEN 1 AND 9),
    VAL VARCHAR(9) DEFAULT '123456789'
)
GO

/* BUILD THE SOLUTION BOARD WITH ALL POSITIONS AND DEFAULT VALUES*/
INSERT INTO SOLUTION_BOARD(XPOS,YPOS)
SELECT X.NUM,Y.NUM FROM NUMBERS X, NUMBERS Y
GO

Though, the two tables are effective for our solve, it is not easily to comprehend the data. So, I am creating 2 procedures that will display the board in the way we are used to seeing it - one for viewing the sudoku board and the other for viewing the solution board. We can call these procedures whenever we might need to view either of the tables, during our solves.

/* CALL THIS PROCEDURE TO VIEW THE SUDOKU BOARD AT ANY TIME*/
CREATE PROC PrintBoard
AS
SET NOCOUNT ON
BEGIN
        SELECT TOP 100 PERCENT 'Y' + CAST(YPOS AS CHAR) AS 'Y/X', 
               ISNULL(CAST([1] AS VARCHAR(1)),'') AS [X1],
               ISNULL(CAST([2] AS VARCHAR(1)),'') AS [X2],
               ISNULL(CAST([3] AS VARCHAR(1)),'') AS [X3],
               ISNULL(CAST([4] AS VARCHAR(1)),'') AS [X4],
               ISNULL(CAST([5] AS VARCHAR(1)),'') AS [X5],
               ISNULL(CAST([6] AS VARCHAR(1)),'') AS [X6],
               ISNULL(CAST([7] AS VARCHAR(1)),'') AS [X7],
               ISNULL(CAST([8] AS VARCHAR(1)),'') AS [X8],
               ISNULL(CAST([9] AS VARCHAR(1)),'') AS [X9]
        FROM   SUDOKU_BOARD
        PIVOT (SUM(VAL) FOR XPOS IN ([1],[2],[3],[4],[5],[6],[7],[8],[9])) AS SB
        ORDER BY YPOS
END
GO

/* CALL THIS PROCEDURE TO VIEW THE SOLUTION BOARD AT ANY TIME*/
CREATE PROC PrintSolution
AS
SET NOCOUNT ON
BEGIN 
        SELECT TOP 100 PERCENT 'Y' + CAST(YPOS AS CHAR) AS 'Y/X', 
               [1] AS [X1],
               [2] AS [X2],
               [3] AS [X3],
               [4] AS [X4],
               [5] AS [X5],
               [6] AS [X6],
               [7] AS [X7],
               [8] AS [X8],
               [9] AS [X9]
        FROM   SOLUTION_BOARD
        PIVOT (MIN(VAL) FOR XPOS IN ([1],[2],[3],[4],[5],[6],[7],[8],[9])) AS SB
        ORDER BY YPOS
END
GO

I guess, this should setup the environment that we need to start building our solution. Please feel free to use the scripts above if you want to come up with your own solution.

If you plan to write a solution, I suggest you read through this link. They have got a good read on solving sudoku by logic and have a javascript implementation of the same. The solution I plan to write, if not a direct implementation of their logic, will atleast be based on theirs. And I wish to give them the due credit.

And if you do come up with a solution, please post it in the comments section. I would love to see it.

The next part of this series is available here.

Object Oriented SQL Programming with SQL Server 2005

2009-11-09T03:56:00.004-05:00

In this post, I have attempted a crude implementation of Object Oriented SQL Programming using the APPLY operator in SQL Server 2005. I feel that giving some thought in these lines, we can bring in more flexibility, abstraction and reusability to the way we query the database and, may be, create a new style for data access.

There is only one rule that I am going to follow here. All table access will be done using an INLINE TABLE VALUED function. No query will directly access the table. I would like to illustrate it with a small example.

I am going to use the AdventureWorks database and write queries to fetch sales order information. To follow the rules defined, I will be creating two functions to get the order header and detail information respectively.

CREATE FUNCTION GetOrderHeader(@OrderID int)
RETURNS TABLE
AS
RETURN(
         SELECT * FROM Sales.SalesOrderHeader 
         WHERE SalesOrderID = coalesce(@OrderID,SalesOrderID)
      )


CREATE FUNCTION GetOrderDetail(@OrderID int,@DetailID int)
RETURNS TABLE
AS
RETURN(
         SELECT * FROM Sales.SalesOrderDetail 
         WHERE SalesOrderID = coalesce(@OrderID,SalesOrderID)
           AND SalesOrderDetailID = coalesce(@DetailID,SalesOrderDetailID)
      )

With the above 2 functions in place, here are some of the queries I can perform to fetch Order Information

--Fetch all order header information
select * from GetOrderHeader(NULL) A 

--Fetch order header information for order ID 43659
select * from GetOrderHeader(43659) A 

--Fetch all order header/detail information
select * from GetOrderHeader(NULL) A 
CROSS APPLY GetOrderDetail(A.SalesOrderID,NULL) B

--Fetch order header/detail information for Order ID 43659
select * from GetOrderHeader(43659) A 
CROSS APPLY GetOrderDetail(A.SalesOrderID,NULL) B

--Fetch information for Order ID 43659 and Detail ID 1 
select * from GetOrderHeader(43659) A 
CROSS APPLY GetOrderDetail(A.SalesOrderID,1) B

Adding another function to fetch the product information

CREATE FUNCTION GetProductInfo(@ProductID int,@Name nvarchar(50))
RETURNS TABLE
AS
RETURN(
         SELECT * FROM Production.Product 
         WHERE ProductID = coalesce(@ProductID,ProductID)
           AND Name like '%' + isnull(@Name,'') + '%' 
)

I can now search for orders which has a particular product using this query

select * from GetOrderHeader(43659) A 
CROSS APPLY GetOrderDetail(A.SalesOrderID,NULL) B
CROSS APPLY GetProductInfo(B.ProductID,'Mountain') C

According to me, the above query is easier to read and much more maintainable than the one that we will usually write:

SELECT * FROM Sales.SalesOrderHeader A 
INNER JOIN Sales.SalesOrderDetail B 
        ON A.SalesOrderID = B.SalesOrderID
INNER JOIN Production.Product 
        ON B.ProductID = C.ProductID
WHERE SalesOrderID = 43659
  AND Name like '%Mountain%'

This type of querying through functions has its own advantages.

For instance, there might be different types of data access that can happen in a particular table(filter, check for existence, etc). Each of this requirement can be a implemented as a seperate function and functions will be written around a particular object, like Orders.
Sometimes, a table may be normalized for better data access. The function can join the normalized tables, code tables, if any. All these can be abstracted from the query
Query will be focus on business implementation while the function will focus on data access.

Having said all these, SQL Server engine performs joins better than apply. So, none of the above suggestions can actually be implemented in a production system. This post is just an attempt to show that there is a scope for changing the way data access is done and if it is a better option or not is something that I leave as an open question. So, what do you think?

A Scenario to Ponder #15

2009-11-06T01:58:00.003-05:00

I guess, by now, most of you are familiar with the SQL Server 2005 sample database - AdventureWorks. We will use one of the tables from this database to create our scenario - Production.BillOfMaterials. To simplify our requirements, we will be selecting only those records in this table, that has the EndDate as NULL.

Before I go ahead and explain the scenario, lets see how this table is built. Its a hierarchical table. It shows relationship of each component (Column: ComponentID) with its parent component (Column: ProductAssemblyID). There can be mutiple levels of hierarchy.

For example, take ComponentID 749. It is the ultimate parent, since the ProductAssemblyID for this component is NULL. Finding the components which has the parent (ProductAssemblyID) as 749, we see that we need 14 other components to build 749. One such child component is 519 which inturn needs 4 other components for it to be built and so on.

Now, here is the scenario:
Our company uses this table Production.BillOfMaterials to identify the bill of materials for any component. Lets say that the company has decided to discontinue production/use of a particular Component A.

To optimize the inventory, we need to come up with a query using this table that will generate the list of components that can also be discontinued along with this component A based on the following rules:

All parent components (at any level) which uses component A can be discontinued.
All child components (at any level) which is used to build component A but is not used to build any other component can be discontinued.
Child component (at any level) which is used to build a parent (dentified in rule 1) but is not used to build any other component can also be discontinued.

Can you come up with a query to achieve this? The query will accept one input - the discontinued ComponentID.

Please post your answer in the comments section.

Understanding Transaction Isolation Levels in SQL Server

2009-10-28T15:14:00.003-04:00

One of the most important concepts pertaining to any DBMS that every database programmer must know is the Transaction Isolation Levels. Sometimes, even the most seasoned database developers get confused on how multiple connections running in different isolation levels affect each other. This post is an attempt to explain through examples on how the isolation levels differ from each other.

The following table (taken from SQL Server Books Online) shows the different isolation levels (SQL-99 Standard) and its effect on concurrency.
Read uncommitted is the lowest isolation level and allows highest concurrency with least locking. Serializable is highest isolation level with lowest concurrency and holds exclusive range lock on the data in transaction.
Rather than looking at each Isolation Level and figuring out what it does, I thought it will make sense to understand how each concurrency problem is addressed as isolation level goes higher:

DIRTY READ: Ability to read uncommitted changes made by another transaction.
NON-REPEATABLE READ: Read locks are released immediately after the data is read. If the first transaction, that allows non-repeatable read, selects a row from a table, a second transaction can update (and commit) the same row even before the first transaction is complete. It means that if the first transaction selects the data again after the second transaction has completed, it will read the updated data.
PHANTOM: Read locks are held till the end of the transaction but no range locks are acquired. If the first transaction selects all the records in a table, a second transaction can insert (and commit) into the table. If the first transaction, reads through the table again, it will see the inserted (phantom) record.

The table below is a different view of the isolation levels. We will be using this table as a reference for our testing. We will take each scenario and compare the highest isolation level that allows against the lowest isolation level that denies.
Let's first set the environment for testing.

/* TEST_TABLE IS THE TABLE THAT WILL PARTICIPATE IN THE TRANSACTION */
CREATE TABLE TEST_TABLE
(
ID   INT         PRIMARY KEY,
NAME VARCHAR(10)
)

/* SOME TEST DATA FOR OUR TEST_TABLE */
INSERT INTO TEST_TABLE
SELECT 1, 'SMITH'
UNION ALL
SELECT 3, 'ADAMS'
UNION ALL
SELECT 5, 'SANDERS'

/* THIS TABLE WILL HOLD THE LOG OF WHAT IS HAPPENING. WILL BE EXPLAINED LATER */
CREATE TABLE TRANSACTION_STATUS
(
EXEC_ORDER          INT          IDENTITY(1,1),
PROCESS_DESCRIPTION VARCHAR(100),
ABSOLUTE_TIME       DATETIME
)

/* A BETTER VIEW OF THE LOG TABLE FOR OUR ANALYSIS */
CREATE VIEW VW_TRANSACTION_STATUS_LOG
AS
SELECT TOP 100 PERCENT
TS1.PROCESS_DESCRIPTION,
DATEDIFF(SS,TS2.ABSOLUTE_TIME,TS1.ABSOLUTE_TIME) AS RELATIVE_TIME
FROM
TRANSACTION_STATUS TS1,
TRANSACTION_STATUS TS2
WHERE
TS2.EXEC_ORDER = 1 
ORDER BY
TS1.EXEC_ORDER

Now, we create a transaction (Connection 1) that will select from the TEST_TABLE twice, with a delay of 10 seconds between them. We will also be logging each step happening in the transaction into our TRANSACTION_STATUS table. Given below is a sample screen shot of connection 1. Taking the logging part out, only on the script within the selected area is pertinent to our testing.

We create another transaction (connection 2), that will start 5 seconds after the first transaction (connection 1). The transaction will update a record in the test table and wait for 10 seconds and rollback the transaction.

Now, if we start both the transactions simltaneously and look at the log, we will know the sequence of execution of commands across the two transactions.
Note: TEST_TABLE will be refreshed to have the three rows as given below, before starting each test scenario.
With, the setup done, lets start with the first scenario:
Scenario 1 - Dirty Read: Comparing READ UNCOMMITTED and READ COMMITTED isolation level.
READ UNCOMMITTED: The below screen shot shows 3 connections. I tried to keep the image pretty self-explanatory, but I will walk through the screen for the first one:

Connection 1 runs at READ UNCOMMITTED isolation level. Connection 1 and 2 were started simultaneously.
You can see from the result set of Connection 1 that the second query of connection 1 reads through the uncommitted data of Connection 2.
Connection 3 shows the event log of the sequence of the commands between the first two connections.

READ COMMITTED: Here is what happens when the first connection runs in READ COMMITTED isolation level.

Scenario 2 - Non-repeatable Read: Comparing READ COMMITTED and REPEATABLE READ isolation level.
READ COMMITTED: The second connection updates the table and commits. And connection 1 reads the updated data in the second select.

REPEATABLE READ: And the same in REPEATABLE READ isolation level.

Phantom: Comparing REPEATABLE READ and SERIALIZABLE isolation level
REPEATABLE READ: Connection 2 inserts a record and connection 1 reads the inserted record in the second select.

SERIALIZABLE: Connection 2 had to wait for connection 1 to commit before it can insert.

You can see that SERIALIZABLE isolation level gives the best data consistency within a transaction. However, any attempt to update the table is put on hold till the transaction is complete. Though, we would love to have all our transactions running in SERIALIZABLE isolation level, we have to go for the lowest isolation level (that will work for the transaction) to gain performance and allow concurrency.

The default isolation level (READ COMMITTED) of SQL Server, should work in most scenarios as it makes sure you read only clean (committed) data. However, specific scenarios might require you to go for a higher level of isolation. I would try avoiding READ UNCOMMITTED isolation level (or using the NOLOCK table hint). Instead, as a best practice, try to keep the duration of transactions as short as possible.

That ends this post. For further reading, there is a wealth of information on isolation levels on the web and BOL (of course).

In the above post, I haven't considered READ COMMITTED SNAPSHOT and DATABASE SNAPSHOT introduced in SQL Server 2005. I hope I can write a post on that later. Till then, happy querying!!

Custom Sorting and Paging in SQL Server Stored Procedure

2009-10-23T12:23:00.002-04:00

This post is about achieving pagination using stored procedure, without using dynamic SQL. The solution presented should work for both SQL Server 2000 and 2005.

Most of the web application displaying transactional data will have a data grid with sorting and paging functionality and some filter criteria.

We usually use the default sorting/paging functionality available in the datagrid. But, this requires the entire search resultset to be cached in the client side, which is fine for a few hundered records but can be a drag as the number of records increase.

I was pulled into a situation where there were millions of records in the table and the filter criteria was not good enough to bring down the number of records being sent to the client side.

To fix this issue, we had to implement the sorting and paging in the database and send only the required data to be viewed in the page.

A quite common solution is to use dynamic SQL. I however wanted to come up with a solution without using dynamic SQL. So here it goes.

These are my requirements:
1) In AdventureWorks database I need to write a stored procedure to select data from Sales.SalesOrderDetail
2) The table should be joined with the Production.Product table to get the product name for the product code
3) The SP should allow sorting (ASC or DESC) on the following columns

ORDER_ID
TRACKING_NBR
PRODUCT_NAME
MODIFIED_DATE

4) The default sorting will be based on ORDER_ID, DETAIL_ID
5)The filtering can be done on columns available for sorting
6) The SP will accept the page number and page size as input and return only that page based on the sort and filter codition
7) The SP will also have to return the total records that are fetched based on the filter condition without pagination. This will be returned as an output parameter.

The above requirements should encompass all the functionality expected out of a database side pagination.

Below is the solution without using dynamic SQL.
Please Note: I am not the guy who came out with all the techniques used to build the solution SP. Many of the techniques had been commonly used even before I knew what SQL Server was. I am just collating all the ideas in one place.

Please feel free to give your comments on how to improve this. If you think I missed a common functionality required in the SP, let me know, I will be happy to add it.



CREATE PROC SalesOrderDetailCustomSort

(

@in_SalesOrderID          INT           = NULL,         /* FILTER - ORDER ID */

@in_CarrierTrackingNumber VARCHAR(25)   = '',           /* FILTER - LIKE SEARCH ON TRACKING NUMBER */

@in_ProductName           NVARCHAR(50)  = '',           /* FILTER - PRODUCT NAME */

@in_ModifiedDateFrom      DATETIME      = '1753-01-01', /* FILTER - MODIFIED DATE FROM */

@in_ModifiedDateTo        DATETIME      = '9999-12-31', /* FILTER - MODIFIED DATE TO */

@PageSize                 SMALLINT      = 25,           /* NUMBER OF RECORDS PER PAGE */

@TargetPage               SMALLINT      = 1,            /* PAGE THAT NEEDS TO BE RETURNED */



@OrderBy                  VARCHAR(50)   = '',           /* ORDER BY COLUMN. WILL HAVE THE SAME NAME

AS THE RESULT SET COLUMN. IN OUR EXAMPLE

WE ALLOW SORTING ON THE FOLLOWING COLUMNS

ORDER_ID

TRACKING_NBR

PRODUCT_NAME

MODIFIED_DATE

*/                                                 

@SortOrder                VARCHAR(4)    = '',           /* SORT ORDER - ASC OR DESC */                                                             

@TotalRecCount            INT        OUTPUT             /* TOTAL NUMBER OF RECORDS FETCHED

BASED ON THE FILTERS INCLUDING

ALL PAGES */                                                              

)                                                                    

AS

BEGIN



/* DECLARE LOCAL VARIABLES FOR FILTER CONDITIONS */



DECLARE @CarrierTrackingNumber  VARCHAR(27), /* 2 CHARACTERS MORE THAN THE INPUT BECAUSE

WE WILL BE DOING A LIKE SEARCH */

@ProductName            NVARCHAR(52),

@ModifiedDateFrom       DATETIME,

@ModifiedDateTo         DATETIME



SELECT  @CarrierTrackingNumber  =       '%' +   @in_CarrierTrackingNumber   + '%',

@ProductName            =       '%' +   @in_ProductName             + '%',

@ModifiedDateFrom       =               @in_ModifiedDateFrom,

@ModifiedDateTo         =               @in_ModifiedDateTo





/* THE TABLE BELOW WILL STORE THE PRIMARY KEYS TO ALL THE RECORDS THAT SATISFY THE FILTER

CRITERIA, SORTED IN THE ORDER REQUIRED. THIS IS NEEDED TO FIND THE TOTAL RECORDS FETCHED

BY THE FILTER CRITERIA */





DECLARE @SORTED_DETAILS_LIST TABLE

(

ID              INT     IDENTITY(1,1)   PRIMARY KEY,

ORDER_ID        INT,

DETAIL_ID       INT

)



INSERT INTO @SORTED_DETAILS_LIST

(    

ORDER_ID,

DETAIL_ID

)

SELECT

SalesOrderID,

SalesOrderDetailID

FROM

Sales.SalesOrderDetail OD

INNER JOIN Production.Product     PR

ON

OD.ProductID = PR.ProductID

WHERE

OD.SalesOrderID           =             ISNULL(@in_SalesOrderID,OD.SalesOrderID)

AND OD.CarrierTrackingNumber  LIKE          @CarrierTrackingNumber

AND PR.Name                   LIKE          @ProductName

AND OD.ModifiedDate           BETWEEN       @ModifiedDateFrom  AND  @ModifiedDateTo

ORDER BY



/* WE NEED TO CONVERT ALL THE COLUMNS THAT CAN BE QUALIFY FOR

A SORT POSITION TO HAVE THE SAME DATATYPE. SO CONVERTING

ALL OF THEM TO VARCHAR */



CASE WHEN @OrderBy = 'PRODUCT_NAME'    AND @SortOrder <> 'DESC'

THEN CAST(PR.Name AS VARCHAR(50))

WHEN @OrderBy = 'TRACKING_NBR'    AND @SortOrder <> 'DESC'

THEN OD.CarrierTrackingNumber

WHEN @OrderBy = 'MODIFIED_DATE'   AND @SortOrder <> 'DESC'

THEN CONVERT(VARCHAR,OD.ModifiedDate,102)

ELSE NULL

END     ASC,



/* THE SAME HAS TO BE REPEATED FOR DESCENDING SORT ORDER */



CASE WHEN @OrderBy = 'PRODUCT_NAME'    AND @SortOrder =  'DESC'

THEN CAST(PR.Name AS VARCHAR(50))

WHEN @OrderBy = 'TRACKING_NBR'    AND @SortOrder =  'DESC'

THEN OD.CarrierTrackingNumber

WHEN @OrderBy = 'MODIFIED_DATE'   AND @SortOrder =  'DESC'

THEN CONVERT(VARCHAR,OD.ModifiedDate,102)

ELSE NULL

END    DESC,



/* WE WILL ALWAYS SORT BY ORDER_ID AND DETAIL_ID EVEN IF NO SORT CONDITIONS ARE SPECIFIED */



CASE WHEN @SortOrder <> 'DESC'

THEN OD.SalesOrderID       ELSE NULL END ASC,

CASE WHEN @SortOrder =  'DESC'

THEN OD.SalesOrderID       ELSE NULL END DESC,

CASE WHEN @SortOrder <> 'DESC'

THEN OD.SalesOrderDetailID ELSE NULL END ASC,

CASE WHEN @SortOrder =  'DESC'

THEN OD.SalesOrderDetailID ELSE NULL END DESC



/* STORE THE TOTAL RECORDS SELECTED BY THE LAST QUERY INTO THE OUTPUT PARAMETER */



SET @TotalRecCount =  @@ROWCOUNT





/* FETCH THE COMPLETE DATA TO BE RETURNED BACK ONLY FOR THOSE ORDERS THAT ARE SELECTED BASED

ON FILTER AND PAGINATION REQUIREMENT */



SELECT

SDL.ID                   AS ROW_ID,              

OD.SalesOrderID          AS ORDER_ID,

OD.SalesOrderDetailID    AS DETAIL_ID,

OD.CarrierTrackingNumber AS TRACKING_NBR,

OD.OrderQty              AS ORDERED_QTY,

PR.Name                  AS PRODUCT_NAME,

OD.SpecialOfferID        AS SPECIAL_OFFER,

OD.UnitPrice             AS UNIT_PRICE,

OD.UnitPriceDiscount     AS DISCOUNT,

OD.LineTotal             AS TOTAL_COST,

OD.ModifiedDate          AS MODIFIED_DATE

FROM



@SORTED_DETAILS_LIST      SDL

INNER JOIN Sales.SalesOrderDetail  OD            

ON

SDL.ORDER_ID  = OD.SalesOrderID

AND SDL.DETAIL_ID = OD.SalesOrderDetailID

INNER JOIN Production.Product PR

ON

OD.ProductID = PR.ProductID

WHERE

SDL.ID BETWEEN (@TargetPage-1)*@PageSize + 1 AND (@TargetPage)*@PageSize

ORDER BY

SDL.ID ASC



END

When I execute the SP with the following parameters:

DECLARE @OutTotalRecCount int

EXEC SalesOrderDetailCustomSort @in_ModifiedDateFrom = '2003-01-01',                                
@in_ProductName = 'Tour',
@OrderBy = 'TRACKING_NBR',
@PageSize = 25,
@TargetPage = 8,
@TotalRecCount = @OutTotalRecCount OUTPUT

SELECT @OutTotalRecCount AS TOTAL_RECORDS_FETCHED

This is the result set that is returned back:

A Scenario to Ponder #14

2009-10-10T11:44:00.001-04:00

Lots of things happened since the last post I made in this blog. I got married. Didn't do much justice to my blog followers since then. My wife couldn't stand me being interested in ONLY her :) So, she pushed me to get this blog started again and I hope she keeps pushing me to continue this blog.

Now, Here is the scenario.
I had been playing Chicktionary (google it if you had not heard of it) in Club Bing. There are some other versions of the same game - Word Warp, Text Twist, Jumble, etc. In a gist, its a game where you are given a chain of letters and you need to come up with as many words as you can with these letters in a given time. It's a pretty addictive game.

Within a few days, I thought it will be even more interesting if I can write a query which will generate the words for me when I give the chain of letters.

To begin with that, I had to set the environment:
1. I need the list of words that my query will be searching from. On googling I found that the following link had a good list of words:

http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt

2. I loaded this list to a table [WordList] ([Word] varchar(50)) in a SQL Server database.

3. I know that none of these games accept any word that is less than 3 characters long. So I deleted those words from the table.

4. When you do all the above steps, you should have 109441 words in your table.

With the environment set, I need to write a query (or a stored procedure) that will accept a string of letters and get me all the words in the WordList table that can be formed using the string of letters and it should be displayed in order of longest word to the shortest word.

There is always a brute force method to solve these kind of scenarios. But, since this game is timed, I want the most optimal solution to get the result generated as soon as possible.

So, Get me the best solution you can think of. Post as many solutions as you want/can. It can be in SQL Server 2000 or SQL Server 2005. The solution that I have is for SQL Server 2005.
Happy Querying!!
Here is an example Output I am looking for the input string: USILDF

This is most optimized solution I can come up with for this scenario. If you can find a better solution, please post in the comments section. I would love to see it.

DECLARE @INPUT_WORD           VARCHAR(10)

SET         @INPUT_WORD =     'USILDF';

WITH NUMBERS AS
(
SELECT 1 AS NUM
UNION ALL
SELECT NUM+1 FROM NUMBERS WHERE NUM+1 <= LEN(@INPUT_WORD)
),
WRDLIST AS
(
SELECT WORD,SUBSTRING(WORD,NUM,1) CHR,COUNT(*) CNT
FROM WORDLIST, NUMBERS
GROUP BY WORD,SUBSTRING(WORD,NUM,1)
),
INPUTWORD AS
(
SELECT SUBSTRING(@INPUT_WORD,NUM,1) CHR,COUNT(*) CNT
FROM NUMBERS
GROUP BY SUBSTRING(@INPUT_WORD,NUM,1)
)
SELECT
A.WORD
FROM
WRDLIST A, INPUTWORD B
WHERE
A.CHR = B.CHR
AND A.CNT <=B.CNT
GROUP BY
A.WORD
HAVING
SUM(A.CNT) = LEN(A.WORD)
ORDER BY
LEN(A.WORD) DESC,A.WORD DESC

A Scenario to Ponder #13

2007-10-09T11:26:00.002-04:00

Last February, I had come over to Atlanta, Georgia from India for a new project and had been quite overwhelmed by the new way of life, the work and the country that I wasn't able to do justice to this blog. I apologize for those who had been following this mini series of scenarios. Hope I can get some time now to revive this.

Here goes the scenario:

I have a response/resolution time reporting requirement for a Service Call Management System:
My main table is #ServiceCalls. The table definition goes like this:

create table #ServiceCalls
(
SvcCallNo int,
SvcCallDesc varchar(100),
CreateTime datetime,
ResolutionTime datetime
)

Here is the data for this table

insert into #ServiceCalls
select 4709,'I have the complaint #4709','Oct 9 2007 6:59AM','Oct 10 2007 12:59PM' union all
select 4716,'I have the complaint #4716','Oct 8 2007 12:23PM','Oct 10 2007 6:23PM' union all
select 4685,'I have the complaint #4685','Oct 7 2007 12:04PM','Oct 9 2007 6:04PM' union all
select 4695,'I have the complaint #4695','Oct 6 2007 10:59AM','Oct 9 2007 4:59PM' union all
select 4654,'I have the complaint #4654','Oct 5 2007 10:29AM','Oct 8 2007 4:29PM' union all
select 4692,'I have the complaint #4692','Oct 4 2007 10:00AM','Oct 8 2007 4:00PM' union all
select 4637,'I have the complaint #4637','Oct 3 2007 9:27AM','Oct 8 2007 3:27PM' union all
select 4674,'I have the complaint #4674','Oct 2 2007 1:52PM','Oct 6 2007 7:52PM' union all
select 4689,'I have the complaint #4689','Oct 1 2007 9:36AM','Oct 5 2007 3:36PM' union all
select 4700,'I have the complaint #4700','Oct 1 2007 4:43AM','Oct 4 2007 10:43AM'
With the above table, I need to generate a report having these fields.
SvcCallNo, ActualCreateTime, ActualResolutionTime, Effort
Points to Note:
1. Service calls are worked upon between 9:00 AM and 6:00 PM Monday thru Friday (Working Hours).
2. ActualCreateTime is the closest working time on or after the CreateTime.
For Example, for SvcCallNo 4709, the ActualCreateTime is 'Oct 9 2007 9:00AM'
3. ActualResolutionTime is the closest working time on or before the ResolutionTimeFor Example, for SvcCallNo 4716 the ActualResolutionTime is 'Oct 10 2007 6:00PM'
4. Effort is the number of working hours between the actualCreateTime and ActualResolutionTime
How do I go about writing the query. The solution can be in SQL Server 2000 or 2005.
Hint: You can create a calendar table and solve this (its still a challenge). If you can come up with an efficient query without a calendar table for SQL Server 2000 that would be pure genius.

A Scenario to Ponder #12

2007-02-07T00:22:00.002-05:00

Its been a long time since I had used this medium of communication. Without getting into the excuses part of the dormancy I will get on with the question...

Here is the scenario. Say, I am in a business consulting company and I am given a table EmployeeCustomerOrders.

The table can be generated from the orders table in the Northwind database using this query.

use Northwind
go
select employeeid, customerid, count(orderid) as OrderCount into #EmployeeCustomerOrders
from orders
group by employeeID, customerID
order by employeeid, customerid

Here you can see that an employee can process multiple customers and similarly a customer can be processed by multiple employees. The consulting experts had told that, to improve the sales, the following changes have to be made:

A customer should be tied to only one employee
An employee can process for multiple customers
The employee who has created most orders should be mapped to the customers who have made the least orders and vise-versa

My job here is to come up with that mapping between customers and employees. Which customers goes to which employees.
For example: say there are 100 customers and 10 employees. The employee with the maximum orders will be mapped to the 10 customers who have made the least orders and the employees with the minimum orders will be mapped to the top 10 customers who gave the maximum orders.
The result will have two columns: EmployeeID, CustomerID
How do I go about getting the query. The solution can be in SQL Server 2000 or 2005.

Obvious SQL Tip #2

2006-12-04T02:02:00.009-05:00

Thinking in sets:
The thought process that goes into set- based programming is slightly different from writing procedural code.

Say, I have 2 tables -one holding order header information and the other holding the details of the products ordered. I want to get the items ordered for all the orders:

The procedural way of looking at it is "for each order in the orders table get the order details from the [order details] table".
In DB terms, its "for all orders in the orders table get the order details from the [order details] table".

Its the subtleties that makes the difference.

Tip: When it comes to database programming, you no longer look at each row. You should understand that a table is a collection of rows with similar characterstics. Or its more like, "you seen one row.. you've seen 'em all".

Say, for example, you need only the orders placed in US and not the rest of the world.

In procedural way "for each order check if the order is placed in US. If so, then get the corresponding order details else ignore the order".
In sets "Get the order details for all the orders placed in US"

Thinking the procedural way will only lead to one conclusion for any requirement- CURSORS. Stop thinking the procedural way, if not possible, stop thinking :)

Tip: With your procedural code, you build your result set from scratch one row at a time. Don't do that. Take all the rows and eliminate whichever is not necessary.

Here is an analogy: Say you have a bag with 100 coins. Of that, 10 coins are silver and 90 are gold. You want to take the 10 silver coins out as fast as you can. How will you do it? If you think you will take one coin out at a time and check whether its silver or gold, then database programming is not the best option for you.

Get the procedural code out of your database:

CURSORs with IF.. ELSE usually can be replaced with CASE statements in SELECT.
Scalar user defined functions (S-UDFs) in SQL Server are the bridge between procedural code and T-SQL code. Or, to be precise, its just procedural code in T-SQL. Avoid it if possible.
In a S-UDFs it might be tempting to use recursion for scenarios like converting integer to binary or getting the factorial. It might be an excellent code, but while loop performs better than recursion, CLR functions (SQL Server 2005) performs better than while loop, doing the calculation in the front end is definitely better than using a CLR function.

A Scenario to Ponder #11

2006-11-29T06:31:00.001-05:00

This time it's a pretty interesting scenario to solve. I have a table called the distance_tbl holding distance between different cities.
Here is the table definition:

create table #distance_tbl (city1 char(1), city2 char(1), distance int)

And the sample data:

insert into #distance_tbl values('A','B',200)
insert into #distance_tbl values('A','C',100)
insert into #distance_tbl values('C','B',90)
insert into #distance_tbl values('C','D',300)
insert into #distance_tbl values('C','E',200)
insert into #distance_tbl values('E','F',50)
insert into #distance_tbl values('F','D',50)
insert into #distance_tbl values('F','C',220)

Lets assume the information is pretty comprehensive and so the table is pretty huge.
I want to put this information to good use. So, I am planning to build a site where people can come and search for the shortest distance between two cities and also the route to go to the city.

The site will inturn use a query to get the information. Two paramters come from the user to the query. The start city and the end city, say @Start and @end and both are of type char(1).

How the result is presented, I leave it to your imagination. But here are a few examples on the information required for a given input.

Example 1:
Input: @Start = 'A', @end 'B'
Output: Distance = 190 and you go from A to C and C to B

Example 2:
Input: @Start = 'C', @End 'D'
Output: Distance = 270 and you go from C to F and F to D

A few rules to note:
In the table, A to C is 100 miles also means C to A is 100 miles and will not be stored as a seperate row.
If two different paths have the same shortest distance, then I need to get the path that touches the least number of cities.

The solution can be given in SQL Server 2000 or SQL Server 2005 and you may use temp tables, cursors or while loops, if necessary.

Obvious SQL Tip #1

2006-11-27T08:46:00.001-05:00

I don't really know why I am starting this series, if it will be really useful for anyone and whether I will be able to keep it running for long. But, its here because I am convinced that sometimes the most obvious escapes our mind. And as someone rightly said "common sense is not so common".

I just intend to illustrate few common issues(one per post) where we try to come up with complicate queries and realise, in the end, that we have a painfully obvious alternative. But, the question is - Is it obvious and is it really an alternative? Read ahead...

My requirement is this. I have SQL Server 2000. In the northwind database, I need to select some information for the order (in the orders table) which has the maximum freight charge.

The query that strikes our mind immediately is this:

select OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders
where freight = (select max(freight) from orders)

On running the above query, I get this result:

What we are doing here is a self-join which, if you notice carefully, can be avoided. Can I use this query to get the result?

select top 1 OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders
order by freight desc

Of course, I can. And I reduced one join. Isn't this great? Yes, it is.

Happily basking in the glory of ineptitude, seldom do we realise that there can be more than one order having the same max freight charge and we got the same result just by chance.

May be, If I wanted to find the last order by orderdate, my safe query will look this way:

select OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders
where orderdate = (select max(orderdate) from orders)

This is the result:

And my seemingly better query:

select top 1 OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders order by orderdate desc

This gives me the wrong result (of course):

Now, I have realised the problem, but I don't want to abandon my approach. I would write my query this way to get the correct result:

select top 1 with ties OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders
order by orderdate desc

It got me the result. Let's see how well it performs when compared to my safe query. So, I run the following queries in parallel:

--Query #1
select top 1 with ties OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders
order by orderdate desc

--Query #2
select OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders
where orderdate = (select max(orderdate) from orders)

and when I check the query execution plan, I am in for a surprise:

I find that Query #1 incurs almost double the cost of Query #2 (or the safe query) and Query #2 doesn't have a self-join (The reason I started the exercise was to avoid the self join which is not there). I still don't lose hope. I want to make my Query #1 performing. Assuming that my orderdate will always be lesser than or equal to the current date and I can convert my scan to seek. So I add up a useless filter to it and my query now, is this:

select top 1 with ties OrderID, CustomerID, EmployeeID, OrderDate, freight
from orders
where orderdate <= getdate() order by orderdate desc
Here is the query execution plan for the above query:

Now, I am able to bring the query to execute as fast as my safe query. Good that I did it without a join. But, I see now that my query looks more complicated, has an assumption which might fail anytime and is giving me the same execution plan and cost as my safe query. So, which query do you choose - good and safe or just good?

Moral of the story:
"Don't stop thinking too hard for better alternatives, but remember that it doesn't become a better alternative just because you thought too hard" :)

A Scenario to Ponder #10

2006-11-22T10:49:00.000-05:00

Say,I have a requirement. My business demands an auto-generated key (or an identity) for a table called CharTbl. But, it needs to be a character identity column. The sequence generation should be as given in the figure below (partial result displayed).

How would I go about creating the table definition?

Pivot Query Generator for SQL Server 2000

2006-11-20T06:26:00.001-05:00

This post is written with reference to the post Pivot Query Generator for SQL Server 2005. Please read through the post to get the context and the data setup before you read any further.

Now, that you have read through the other post, you will realise that in SQL Server 2005, we had the PIVOT operator to generate the crosstab reports. But in case of SQL Server 2000, we will have to use an indirect approach to generate the cross-tab. Here is the SQL Server 2000 counter part of the Pivot Query Generator of SQL Server 2005.

This SP is still in a rudimentary stage and does not handle join conditions. You may use it to see how to generate cross-tab reports and extend it to your requirements. Please feel free to modify the SP to suit to your requirements. And if you can see a way by which it can be improved, leave a comment and I will try to update it if possible.

Though, the implementation of this SP is different from the one written for SQL Server 2005, the end result should be the same. And you can get the query generated from the messages tab of your query analyzer.

Here is the stored procedure definition:

create proc GeneratePivotQuery (
@TableName varchar(100), -- Table to select from
@AggregateFunction varchar(10), -- Aggregate to be done on the value column
@ValueColumn varchar(100),
@PivotColumn varchar(100), -- Column for which the values needs to be transposed
@FilterCondition varchar(1000), -- Filtering to be done on choosing the pivot values
@OtherColumns varchar(1000)) -- Columns selected other than the pivot columns
as
begin
declare @List varchar(2000)
set @List = '';

declare @ConcatQuery varchar(4000)
declare @CurColumn varchar(100)

set @ConcatQuery = 'DECLARE C1 CURSOR global FAST_FORWARD FOR select distinct '+
@PivotColumn + ' from ' + @TableName + isnull(' where ' + @FilterCondition,'')

exec(@ConcatQuery)

open C1

fetch next from C1 into @CurColumn

set @List = @list + @AggregateFunction + '( case when [' + @PivotColumn +
'] = ''' + @CurColumn + ''' then ' +
@ValueColumn + ' else null end ) as [' + @CurColumn + ']'

while (@@fetch_status = 0)
begin

fetch next from C1 into @CurColumn
if(@@fetch_status = 0)
set @List = @list + ',' + char(10) + char(9) + @AggregateFunction + '( case when [' + @PivotColumn +
'] = ''' + @CurColumn + ''' then ' +
@ValueColumn + ' else null end ) as [' + @CurColumn + ']'

end

close c1
deallocate c1

/* this will print the query used to generate the pivoted result */
print '/*query begin*/' + char(10) + char(10) + 'select '+ char(10) + char(9) + @OtherColumns + ',' + char(10)+ char(9) + @List + char(10) + ' from ' + @TableName +
char(10) + ' group by ' + char(10)+ char(9) + @OtherColumns + char(10) + char(10) + '/*query end*/' + replicate(char(10),5)

/* this will generate the result set */
exec ('select ' + @OtherColumns + ',' + @List + ' from ' + @TableName +
' group by ' + @OtherColumns )

end

Here is the call to the stored procedure, that will pivot those values in city2 where city2 is between 'B' and 'D' and display the result. You can get the query, used to generate the result, from the messages tab.

DECLARE @TableName varchar(100)
DECLARE @AggregateFunction varchar(10)
DECLARE @ValueColumn varchar(100)
DECLARE @PivotColumn varchar(100)
DECLARE @FilterCondition varchar(1000)
DECLARE @OtherColumns varchar(1000)

SELECT @TableName = 'distance_tbl'
SELECT @AggregateFunction = 'sum'
SELECT @ValueColumn = 'distance'
SELECT @PivotColumn = 'city2'
SELECT @FilterCondition = 'city2 between ''B'' and ''D'''
SELECT @OtherColumns = 'city1'

EXEC [dbo].[GeneratePivotQuery]
@TableName,
@AggregateFunction,
@ValueColumn,
@PivotColumn,
@FilterCondition,
@OtherColumns

Parameter Sniffing & Stored Procedures Execution Plan

2006-11-17T08:23:00.002-05:00

This post is an attempt to explain what parameter sniffing is all about and how it affects the performance of a stored procedure. I would be walking through an example to demonstrate what the effect of parameter sniffing is. I will be showing the query execution plan generated for SQL Server 2005. To my knowledge, there is no change, in how this process works, between the two versions and you should be able to relate it to SQL Server 2000 as well.

According to the white paper,Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005 published in the Microsoft Site:

"Parameter sniffing" refers to a process whereby SQL Server's execution environment "sniffs" the current parameter values during compilation or recompilation, and passes it along to the query optimizer so that they can be used to generate potentially faster query execution plans. The word "current" refers to the parameter values present in the statement call that caused a compilation or a recompilation.

Before I go ahead with the example, you need to understand that the query execution plan generated by the query optimizer depends on a lot of factors and parameter sniffing is just one of them. So the execution plans I show here might not be the execution plan you get if you run the same query on your server.

Lets consider the Orders table in the Northwind database.

Say, I create a stored procedure with the definition as:

create procedure GetOrderForCustomers(@CustID varchar(20))
as
begin
select * from orders
where customerid = @CustID
end

Remember, that the query execution plan is not generated when you create the procedure. It gets created and cached the first time you run it.

First lets look at the distribution of the number of orders for each customer

Running the following query in the Northwind database:

select top 100 percent customerid,count(*) as OrderCount from orders
group by customerid order by count(*)

We get the number of orders placed by each customer (Only partial result shown)

The first and the last customer are the ones we are interested in:

CENTC has OrderCount (min) = 1

SAVEA has OrderCount (max) = 31

Lets say, I want to find all the orders for CustomerID = 'SAVEA' by calling the stored procedure that we created before.

exec GetOrderForCustomers 'SAVEA'

Since this is the first time the stored procedure is called, it will create an optimized query execution plan and execute it.

Now, the stored proc will return me 31 rows. Let's look at the query execution plan.

Two values in the Clustered Index Scan information are of interest to us.

Actual Number of Rows 31
Estimated Number of Rows 31.0747

How did the optimizer correctly estimate the actual number of rows?

It's because of what we call "parameter sniffing". The optimizer created the plan knowing the fact that it was going to get the information for the customerID 'SAVEA' and hence retrieve 31 rows.

Then how did the optimizer know that 'SAVEA' has 31 orders?

SQL Server internally maintains the statistics and distribution of the values in the columns used for filtering. Which, inturn, is nothing but the information in the result of this query (which we used above)

select top 100 percent customerid,count(*) as OrderCount
from orders group by customerid
order by count(*)

Then, what is the problem with parameter sniffing, if its helping the query optimizer do better optimization?

Check out the query execution plan, again, for a different input (CustomerID = 'CENTC'):

exec GetOrderForCustomers 'CENTC'

Did you notice that the estimated number of rows is still 31 when the actual number of rows is 1? This plan was optimized for retriving 31 rows (in the first run) and the plan stays in the cache for reuse till server is restarted or the procedure is recompiled or if the proc cache is removed because of any other reason. And so, we need to understand that this plan may or may not work with the same efficiency for retrieving 1 row.

Now, lets see what is the efficient plan (in my server) for retrieving the orders for 'CENTC'. For that I need to clear the current execution plan. Let's drop and recreate the stored procedure so that the plan cache is removed.

drop proc GetOrderForCustomers
go
create procedure GetOrderForCustomers(@CustID varchar(20))
as
begin
select * from orders
where customerid = @CustID
end

Now that the plan cache is cleared, a fresh query execution plan will be generated for the input I give in the following procedure call.

exec GetOrderForCustomers 'CENTC'

Let's look at the query execution plan now.

Now, do you notice that the plan has changed and the estimates also have changed

Actual Number of Rows 1
Estimated Number of Rows 1.00241

If you call the procedure again with 'SAVEA' as the input, you will see that the plan will be the same and estimated number of rows will still be 1 though the actual rowcount will be 31.

So, either way it all depends on which input I give the first time I call the stored procedure.

But, It's impossible to keep track of when the plan was created and what was the input it used. In that case, we have an option here to disable the dependency on the input parameter.

We can create local variables and assign the input parameter to the local variables and use the local variables in the query. And in that case, since we didn't use the procedural parameter directly in the query, it will generate a generic query execution plan.

The same stored procedure can be written this way to avoid parameter sniffing:

create procedure GetOrderForCustomersWithoutPS(@CustID varchar(20))
as
begin
declare @LocCustID varchar(20)
set @LocCustID = @CustID

select * from orders
where customerid = @LocCustID
end

Now, when you run the stored procedure with the input as 'SAVEA'

exec GetOrderForCustomersWithoutPS 'SAVEA'

This is the query execution plan that I get.
Here you see that the actual number of rows is 31 whereas the estimated number of rows is not 31, but its close to 9. Its clear in this case that the query optimizer did not use the parameter value for generating the query execution plan. Then, from where did it get the value 9?
You may guess, just like I did. We saw the distribution of the number of orders for each customer. Now, lets find the average number of orders per customer.This query finds the average of ordercount of all customers we got from the previous query:
select avg(OrderCount) from
(select top 100 percent customerid,count(*) as OrderCount
from orders
group by customerid
order by count(*)) b
Or it can be simply written this way:
select 1.0*count(*)/count(distinct customerid) from orders
You will see that the output is:

Well, this is the estimated rowcount that we saw in the execution plan.If you look at it, for the current data, 67 out of 89 distinct customers have ordercount between 4 (9-5) and 14 (9+5) orders. So, execution plan generated should work good for the majority of the customers. Hence, disabling parameter sniffing is a good choice in this case.
A few points to note:

Parameter sniffing can be enabled or disabled at the stored procedure level.
We need not use local variables (to disable parameter sniffing) if the amount of data you will retrieve from the table is evenly distributed for all values of the filtered column (example, search by primary key or unique key)
Parameter sniffing can be disabled (by using local variables) if we see a bell curve distribution in the number of rows retrieved for the filtered column.

A Scenario to Ponder #9

2006-11-15T08:48:00.001-05:00

This scenario is a slightly modifed (simplified) version of a question posted in a discussion forum.
Lets say I have a SQL Server 2000 box. And I have the following table.

create table #runningavg(id int identity(1,1), curval decimal(16,10))

And the following data:

insert into #runningavg(curval) values(12)
insert into #runningavg(curval) values(14)
insert into #runningavg(curval) values(20)
insert into #runningavg(curval) values(30)
insert into #runningavg(curval) values(10)
insert into #runningavg(curval) values(6)
insert into #runningavg(curval) values(16)
insert into #runningavg(curval) values(8)
insert into #runningavg(curval) values(4)
insert into #runningavg(curval) values(2)
insert into #runningavg(curval) values(32)
Now, with the identity column filled, here is the table data:

I would like to come up with a query that will give me the following result:

In the result, there is one calculated column called the average.

The formula goes this way:
average(n) = (average(n-1) + curval(n))/2
where n is the id column.
And
average(0) = 0

The SQL Server 2005 query for the above result goes like this:

with cte as
(select *,cast(curval/2 as decimal(16,10)) as average from #runningavg where id = 1
union all
select a.*, cast((a.curval + b.average)/2 as decimal(16,10)) from #runningavg a, cte b
where a.id = b.id + 1
)
select * from cte

Though, this seems easy in SQL Server 2005, I wouldn't recommend this solution for the following reasons:

Its a performance intensive query and near implementation of a cursor solution (since it processes one row at a time).
It won't work if the number of rows is more than 32767.

What is the best possible solution (performance wise) that you can come up with?
The implementation can either be for SQL Server 2000 or 2005.

Recursive Self Join - A futile endeavor!!

2006-11-13T23:14:00.001-05:00

This post is just a walkthrough of my attempt to come up with an elegant solution in SQL Server 2005 to a (yet) seemingly impossible problem. I don't really know if I am trying to reinvent the wheel but I did realise that the approach I followed was terribly flawed.

What I wanted (and yet want) to achieve might be best explained with an example. Take the table Employees in the Northwind database. With the recursive common table expressions in SQL Server 2005, I can get all the subordinates of an employee (say employeeID = 2) with this query.

--Query 1
with cte as
(
select lastname,firstname, employeeID, reportsto
from employees where employeeID = 2
union all
select a.lastname,a.firstname,a.employeeID,a.reportsto
from employees a, cte b where a.reportsto=b.employeeid
)
select * from cte

Here is the result for the above query:

Now, I wanted the result to be flattened to have a horizontal roll down from the manager to the subordinates (at the leaf level), like this:

Now, I can write a query that will get all the subordinates upto the second level (the above result) this way:

--Query 2
select
c.lastname as ln_mgr2,c.firstname as fn_mgr2,c.employeeID as mgr2,
b.lastname as ln_mgr,b.firstname as fn_mgr,b.employeeID as mgr,
a.lastname as ln_emp,a.firstname as fn_emp,a.employeeID as emp
from employees a
right outer join employees b on a.reportsto = b.employeeid
right outer join employees c on b.reportsto = c.employeeid
where
c.employeeid = 2

The above query cannot be used if the depth in the heriarchy increases or decreases.

I was wondering whether I could come up with a depth-agnostic implementation with all the new features that we have in SQL Server 2005. I realised that the only T-SQL feature (to my knowledge) that can get me close to the solution was the APPLY operator.

So, I started by creating a recursive user-defined table valued function (parameterized view) like this:

create function GetAllSubordinates(@EmpId int)
returns table
as
return(
SELECT a.EmployeeID,a.LastName,a.FirstName, b.*
FROM northwind.dbo.Employees a
outer apply dbo.GetAllSubordinates(a.EmployeeId) as b
where a.ReportsTo = @EmpId
)

The above function wasn't getting created since it was referencing itself. Ignoring the signs, I went about creating a work around, by creating a dummy function under the same name and using an alter function to overwrite it, like this:

create function GetAllSubordinates(@EmpId int)
returns table
as
return(select 1 as dummy)

go

alter function GetAllSubordinates(@EmpId int)
returns table
as
return(
SELECT a.EmployeeID,a.LastName,a.FirstName, b.*
FROM northwind.dbo.Employees a
outer apply dbo.GetAllSubordinates(a.EmployeeID) as b
where a.ReportsTo = @EmpId
)

Now, in an ideal case, this function, when called this way:

select * from GetAllSubordinates(2)

will find all the subordinates of 2 and will pass them to the function again to get their subordinates and will go on till you don't get any subordinates for an employee (termination condition) and return the result.

Didn't find a reason why it should fail. But when I actually executed it, the following error was thrown:
"View or function 'dbo.GetAllSubordinates' contains a self-reference. Views or functions cannot reference themselves directly or indirectly."

With a misconception that this too can be circumvented, I went ahead and tried to use OPENROWSET function and somewhere in the middle of the implementation it struck me why my approach can never work.

The definition from BOL "The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query."

In other words, its a union of all the cross joins between each row in the outer table with the result from the table valued function for that row.

It means that all the rows returned after the cross apply should follow the same schema, which, in turn, means that the schema of any function should be fixed during the design time.

The schema for the table returned by the function GetAllSubordinates will vary based on the depth of the leaf level employee from the input employeeID. And, you can't UNION result sets with different schema. Maybe, we can make the table returned by the function GetAllSubordinates to have the same schema (which I haven't figured out how, yet) by making the rest of the columns to be null. But, then, to get that I don't have to go for a recursive function, I can do it by just using query 2.

I have a feeling (though I cannot put my finger on it now) that the answer to the question "Why the PIVOT operator cannot take a select sub-query in the IN clause?" is much more complex and very much related to the issue we have now (though, at the database engine level).

Or is there a simple explanation that I miss?