Feb 27, 2013

SQL to ECL - Metamorphosis


As the old saying goes, it can be figured out whether you are a data guy or not just by the way you solve the problem:
An imperative programmer goes: for each order in the orders table, find the corresponding order details. . .
And the SQL programmer goes: take the orders table and join with the order details on the order ID key. . .  

As a SQL programmer, you look at tables as sets and not as individual records. You would not want to bother about whether the join happens as a nested loop, hash, merge or map-reduce for that matter. You want the join to happen the way you intend to (functionally) and let your query engine find the best way to do it based on the data distribution, size etc. That is the SQL programming style.

Now, ECL is not SQL but has the SQL programming style. I find it to be a pseudo functional-declarative programming style with some object oriented concepts tossed in. Don’t pull your hair out. Not just yet!!!

In ECL, there are only 2 types of statements that I have come across.
  1. Action: Something that produces an output.
  2. Declaration: This is a single assignment (or, it cannot be re-assigned). These are called attributes in the ECL world.   
In my experiments with ECL, I have created  attributes to:
    - hold a value (string, int, set, record set, table, file etc)
    - a function or action
    - a definition (More like a table definition)

Most of the ECL code are definitions. And your action can call a definition or pipe your definitions to cascade to a result. More on this later.

Of course, you did notice that there is no concept of a variable. And that is the only thing to “get” in ECL. If you really think about it, ECL gives you amazing abstraction over all the threading and grid-ding that happens behind the scenes. How is it able to do that? By signing a contract with you that says, a definition (or an attribute) does not change its state, ever, ever. So, the attribute points to the same thing whoever wants, or more specifically whichever machine in the grid wants. There, it eliminated any custom race condition you may introduce in your code (which is common in parallel computing world). As long as we adhere to that contract, you can tell HPCC what you need and it will get you the result. With me, so far??

It’s not as hard as you think to code without variables. Purely functional languages like Haskell do not allow you to re-assign.

So, how do you work around this constraint? 
Blunt answer: Create a new definition to hold the mutated value.

Think about it, you don’t “really” use variables in the SQL world either. 

Now, why definitions? If it's SQL programmer friendly, why not use SQL? When you try to answer this question, you will uncover the brilliance of this powerful language. Here is my attempt.

Rewind a few years, SQL Server 2005 introduced the common table expressions (CTE) or the “WITH” clause. What problem was it really trying to solve? Help generate number table using recursion, No!!!
In production applications, the biggest use of CTE was to remove code clutter. I was able to remove inline views and move it to the top of the query. Improved code readability.

Take this cooked-up query in SQL Server 2000:
SELECT *
FROM   OrderDetails A
LEFT OUTER JOIN (SELECT productid,
                        Max(transactiondate) AS LatestDate
                 FROM   transactions
                 WHERE  orderid <> ''
                    AND type = 'Sell'
                 GROUP  BY productid
                 HAVING Year(latestdate) > 2011) B
ON 

     B.productid = A.productid
WHERE  

     A.status = 'Active' 

With CTE, you can change it to:
WITH B
     AS (SELECT productid,
                Max(transactiondate) AS LatestDate
         FROM   transactions
         WHERE  orderid <> ''
                AND type = 'Sell'
         GROUP  BY productid
         HAVING Year(latestdate) > 2011)
SELECT *
FROM   orderdetails A
LEFT OUTER JOIN b
ON 

       b.productid = A.productid
WHERE  

       A.status = 'Active' 

This is cleaner, I am now looking at two queries. One that generates the table B and the other one that uses it. Coming to think of it, Wouldn't it be great if I can move each of the complexities to its own shell (or definition). So that each of definition can be reused as you need. Like this:
































Voila!!! This is exactly how the ECL code looks (Well, its a about 95% close and the compiler will help you fix the rest). But that's about it. This will run just fine for a single node with gigabytes of data or 100 nodes petabytes of data. Welcome to the world of big data.

Once you get past this stage, you will quickly move beyond SQL scope. There is a world of constructs to handle any type or size of data. It's got some powerful detergents built-in to clean up the dirtiest of data you  have, that will blow your mind.

And, just so you know you can write the above query as a single line of code by substituting your definitions inline (recursively). And you will be able to build your SQL Server 2000 query in ECL. But, why you would want to do that is a different discussion!!!

1 comments:

Anonymous said...

Nice article! I told you that you would soon be dangerous!

Post a Comment