Feb 27, 2013

SQL to ECL - Metamorphosis


As the old saying goes, it can be figured out whether you are a data guy or not just by the way you solve the problem:
An imperative programmer goes: for each order in the orders table, find the corresponding order details. . .
And the SQL programmer goes: take the orders table and join with the order details on the order ID key. . .  

As a SQL programmer, you look at tables as sets and not as individual records. You would not want to bother about whether the join happens as a nested loop, hash, merge or map-reduce for that matter. You want the join to happen the way you intend to (functionally) and let your query engine find the best way to do it based on the data distribution, size etc. That is the SQL programming style.

Now, ECL is not SQL but has the SQL programming style. I find it to be a pseudo functional-declarative programming style with some object oriented concepts tossed in. Don’t pull your hair out. Not just yet!!!

In ECL, there are only 2 types of statements that I have come across.
  1. Action: Something that produces an output.
  2. Declaration: This is a single assignment (or, it cannot be re-assigned). These are called attributes in the ECL world.   
In my experiments with ECL, I have created  attributes to:
    - hold a value (string, int, set, record set, table, file etc)
    - a function or action
    - a definition (More like a table definition)

Most of the ECL code are definitions. And your action can call a definition or pipe your definitions to cascade to a result. More on this later.

Of course, you did notice that there is no concept of a variable. And that is the only thing to “get” in ECL. If you really think about it, ECL gives you amazing abstraction over all the threading and grid-ding that happens behind the scenes. How is it able to do that? By signing a contract with you that says, a definition (or an attribute) does not change its state, ever, ever. So, the attribute points to the same thing whoever wants, or more specifically whichever machine in the grid wants. There, it eliminated any custom race condition you may introduce in your code (which is common in parallel computing world). As long as we adhere to that contract, you can tell HPCC what you need and it will get you the result. With me, so far??

It’s not as hard as you think to code without variables. Purely functional languages like Haskell do not allow you to re-assign.

So, how do you work around this constraint? 
Blunt answer: Create a new definition to hold the mutated value.

Think about it, you don’t “really” use variables in the SQL world either. 

Now, why definitions? If it's SQL programmer friendly, why not use SQL? When you try to answer this question, you will uncover the brilliance of this powerful language. Here is my attempt.

Rewind a few years, SQL Server 2005 introduced the common table expressions (CTE) or the “WITH” clause. What problem was it really trying to solve? Help generate number table using recursion, No!!!
In production applications, the biggest use of CTE was to remove code clutter. I was able to remove inline views and move it to the top of the query. Improved code readability.

Take this cooked-up query in SQL Server 2000:
SELECT *
FROM   OrderDetails A
LEFT OUTER JOIN (SELECT productid,
                        Max(transactiondate) AS LatestDate
                 FROM   transactions
                 WHERE  orderid <> ''
                    AND type = 'Sell'
                 GROUP  BY productid
                 HAVING Year(latestdate) > 2011) B
ON 

     B.productid = A.productid
WHERE  

     A.status = 'Active' 

With CTE, you can change it to:
WITH B
     AS (SELECT productid,
                Max(transactiondate) AS LatestDate
         FROM   transactions
         WHERE  orderid <> ''
                AND type = 'Sell'
         GROUP  BY productid
         HAVING Year(latestdate) > 2011)
SELECT *
FROM   orderdetails A
LEFT OUTER JOIN b
ON 

       b.productid = A.productid
WHERE  

       A.status = 'Active' 

This is cleaner, I am now looking at two queries. One that generates the table B and the other one that uses it. Coming to think of it, Wouldn't it be great if I can move each of the complexities to its own shell (or definition). So that each of definition can be reused as you need. Like this:
































Voila!!! This is exactly how the ECL code looks (Well, its a about 95% close and the compiler will help you fix the rest). But that's about it. This will run just fine for a single node with gigabytes of data or 100 nodes petabytes of data. Welcome to the world of big data.

Once you get past this stage, you will quickly move beyond SQL scope. There is a world of constructs to handle any type or size of data. It's got some powerful detergents built-in to clean up the dirtiest of data you  have, that will blow your mind.

And, just so you know you can write the above query as a single line of code by substituting your definitions inline (recursively). And you will be able to build your SQL Server 2000 query in ECL. But, why you would want to do that is a different discussion!!!

Feb 26, 2013

ECL - Big data for the SQLly Inclined

The last project I worked, we pushed the limits of RDBMS (misfit for our requirements) and I decided that the next time, I will consider beyond SQL for my data needs. I started exploring the NOSQL world - MongoDB,Neo4j, REDIS etc and I understood that had we been open to these technologies when we started our last project, we might have had a lot easier life.

Eventually, I started seeing Hadoop everywhere, our company was talking, customers were talking, my friends were talking. And I started learning the jargons around Hadoop (map reduce, hive, sqoop, HDFS,HBase) and I used to throw these words in my conversation along with a few zoo animals and figured out most of my friends were doing the same and we had a happy ecosystem going on. But, deep down I knew that I was ignoring the elephant in the room that was staring at me. I read through the famed map reduce research paper and I was able to get the concept. But, I was not able to get to start playing with hadoop. Setting up was easy and you can get the word count sample working in an hour. But, after that I was stuck. I understood the power of what it can do. But, I felt I did not know the right language to communicate with it. It's like someone asked me to write a web server in SQL. Of course you can do it, but I don't want to. To me, SQL is "the" reference implementation of a Domain Specific Language. And the ease at which you can instruct your RDBMS to do a complex task was mind blowing as long as you are operating within the problem domain.

In retrospect, I understood that the reason why I had so much reluctance to get into hadoop was because I am not a technology guy (there, I said it). I like to solve logical problems (puzzles or problems, I don't care). From a problem solver perspective, my problem statement does not change whether I am working with 1 record or 1gazellion records. It does not change if I have 1 line of text or the entire world wide web to process. I wanted that abstraction. SQL was giving me that (almost), till the data spilled over to the next machine. So, I started looking for languages on top of Hadoop that can help me out. Looked at Pig and Hive and a few others but, i felt this was like LINQ for SQL. You can change the programming languange but you cannot change the fundamental building blocks. I don't want to come out wrong. I love LINQ, but not so much when i have to write complex SQL queries in LINQ.

And, so I started exploring options that were outside Hadoop. Come on, big data is such a "big" pie, and it will not be monopolized. Anyways, my search ended with ECL. A programing language for taming the super computing grid called High-Performance Computing Cluster. It was open source, installation was exactly like how it was for SQL Server.
    "Download the VM and download an IDE (looks like your SQL Server management studio). Connect to the server and get going."

Played with it for a few days. They have some tutorial videos in their site (google HPCC systems). My interest was mainly because the programming style was so different. Not similar to any language that I knew. But, I was able to relate to it. I didn't have to skew my thinking to fit to their programing style.

Also, it's been "the" programming language for HPCC for the last 10 years or so and it has undergone a lot of refinement over the years. So, I knew that I can take a deep breath and give some time to understand this language.

Fast forward a few months... I am still in love with ECL. And some day, someone may write ECL for Hadoop. But, till then, I am taming big data, the ECL way.

If you got an hour or so, give ECL a try and let me know what you think.

There a mental switch that you need to turn on to be able to easily starting thinking in ECL and with that, it becomes pretty much like SQL, actually even more elegant in a few cases. I will anyway write about it in my next post...breaking down a complex SQL query and building it in ECL. It should be a lot more easy to understand, I hope.  

Apr 16, 2010

Custom SQL Stored Procedure Best Practices Analyzer - SQL Cop, Maybe?

Recently, I moved to Seattle as a Data Architect for a product our company was developing. I also started taking up most of the DBX roles in that project.

Though our architecture is predominantly CRUD based and we have our in-house CRUD proc generator, some business logic invariably seeps into the SPs. And we had let it to the discretion of the developers on whether to do the processing at Data or the Business Layer. I thought it would make sense to keep a rudimentary check on what logic was going into the SP and make sure our developers don't misuse the liberty and continue to follow the guidelines defined.

Mar 14, 2010

A Scenario to Ponder #16

During my school days, I used to play this game that I would call as "word mutation". The problem statement gives the starting and ending 4-letter words.

The game is to find the path from the starting to the ending word by changing one alphabet at a time and ,of course, each of the words in the path should be a valid word.

Nov 14, 2009

Solving Sudoku using SQL Server 2005 - Step by Step - Part #6

Implementation of RunSolveAlgorithm4:

This is the last post in this series. The previous algorithm is the last one that tries to solve the puzzle logically. This one will take the latest unsolved sudoku board that we get after running through the first three algorithms and use brute force to solve the puzzle. I am not going to reinvent the wheel here.

Solving Sudoku using SQL Server 2005 - Step by Step - Part #5

Implementation of RunSolveAlgorithm3:

The last algorithm that we implemented was able to solve an easy puzzle. Now, lets take a hard one and see if the solution we have built till now in this series can solve it.

 EXEC SolveSudoku 
'030,001,000,,006,000,050,,500,000,983,,080,006,302,,000,050,000,,903,800,060,,714,000,009,,020,000,800,,000,400,030'

Solving Sudoku using SQL Server 2005 - Step by Step - Part #4

Implementation of RunSolveAlgorithm2:

We implemented RunSolveAlgorithm1 in previous post of this series . The next algorithm is the implementation of Solve Method A from sudoku solver.

In this algorithm, we check all the cells (having mutiple values) in each row and see if a particular value occurs only once in that row. Then update that as the solution for the cell having that value. We do the similar check for column and the 3X3 block.

Nov 12, 2009

Solving Sudoku using SQL Server 2005 - Step by Step - Part #3

Implementation of RunSolveAlgorithm1:

In the previous post of this series, we created the procedure stub for each algorithm. This post will implement RunSolveAlgorithm1.

The first algorithm will do the primary clean up on the solution board. It will implement the basic rules of Sudoku. The rule is that a number cannot appear in a cell if it already appears in any other cell in the same row or column or the block in the sudoku board. It will remove those numbers from the possible cadidate values of each cell in the solution board.

Nov 10, 2009

Solving Sudoku using SQL Server 2005 - Step by Step - Part #2

In the first part of this series, we created the base objects needed to work on our solution. Now, there are two parts to building the solution:

  • The core algorithms that will solve the puzzle for us
  • The surrounding objects that will facilitate the solve and execute the core algorithms
The core algorithms are the crux of this whole exercise. Each of these algorithm will ideally take the unsolved sudoku board and try to fill as many cells as possible using the logic implemented in the specific alogorithm.

Nov 9, 2009

Solving Sudoku using SQL Server 2005 - Step by Step - Part #1

This one is going to be a series. I thought I was going to come up with a single query to solve Sudoku (without choosing the brute force method). When I started creating the tables I needed, I figured out there there are way too many aspects to solving Sudoku logically. I thought it would be a good idea to give a continous update as I go about building the solution.

This post is all about creating the necessary tables and basic procedures which I think we need to build the solution. 

Object Oriented SQL Programming with SQL Server 2005

In this post, I have attempted a crude implementation of Object Oriented SQL Programming using the APPLY operator in SQL Server 2005. I feel that giving some thought in these lines, we can bring in more flexibility, abstraction and reusability to the way we query the database and, may be, create a new style for data access.

There is only one rule that I am going to follow here. All table access will be done using an INLINE TABLE VALUED function. No query will directly access the table. I would like to illustrate it with a small example.

Nov 6, 2009

A Scenario to Ponder #15

I guess, by now, most of you are familiar with the SQL Server 2005 sample database - AdventureWorks. We will use one of the tables from this database to create our scenario - Production.BillOfMaterials. To simplify our requirements, we will be selecting only those records in this table, that has the EndDate as NULL.

Oct 28, 2009

Understanding Transaction Isolation Levels in SQL Server

One of the most important concepts pertaining to any DBMS that every database programmer must know is the Transaction Isolation Levels. Sometimes, even the most seasoned database developers get confused on how multiple connections running in different isolation levels affect each other. This post is an attempt to explain through examples on how the isolation levels differ from each other.

Oct 23, 2009

Custom Sorting and Paging in SQL Server Stored Procedure

This post is about achieving pagination using stored procedure, without using dynamic SQL. The solution presented should work for both SQL Server 2000 and 2005.

Most of the web application displaying transactional data will have a data grid with sorting and paging functionality and some filter criteria.

We usually use the default sorting/paging functionality available in the datagrid. But, this requires the entire search resultset to be cached in the client side, which is fine for a few hundered records but can be a drag as the number of records increase.

Oct 10, 2009

A Scenario to Ponder #14

Lots of things happened since the last post I made in this blog. I got married. Didn't do much justice to my blog followers since then. My wife couldn't stand me being interested in ONLY her :) So, she pushed me to get this blog started again and I hope she keeps pushing me to continue this blog.

Now, Here is the scenario.
I had been playing Chicktionary (google it if you had not heard of it) in Club Bing. There are some other versions of the same game - Word Warp, Text Twist, Jumble, etc. In a gist, its a game where you are given a chain of letters and you need to come up with as many words as you can with these letters in a given time. It's a pretty addictive game.

Oct 9, 2007

A Scenario to Ponder #13

Last February, I had come over to Atlanta, Georgia from India for a new project and had been quite overwhelmed by the new way of life, the work and the country that I wasn't able to do justice to this blog. I apologize for those who had been following this mini series of scenarios. Hope I can get some time now to revive this.

Here goes the scenario:

Feb 7, 2007

A Scenario to Ponder #12

Its been a long time since I had used this medium of communication. Without getting into the excuses part of the dormancy I will get on with the question...

Here is the scenario. Say, I am in a business consulting company and I am given a table EmployeeCustomerOrders.

The table can be generated from the orders table in the Northwind database using this query.

Dec 4, 2006

Obvious SQL Tip #2

Thinking in sets:
The thought process that goes into set- based programming is slightly different from writing procedural code.

Say, I have 2 tables -one holding order header information and the other holding the details of the products ordered. I want to get the items ordered for all the orders:

  • The procedural way of looking at it is "for each order in the orders table get the order details from the [order details] table".
  • In DB terms, its "for all orders in the orders table get the order details from the [order details] table".

Nov 29, 2006

A Scenario to Ponder #11

This time it's a pretty interesting scenario to solve. I have a table called the distance_tbl holding distance between different cities.
Here is the table definition:

create table #distance_tbl (city1 char(1), city2 char(1), distance int)

And the sample data:

Nov 27, 2006

Obvious SQL Tip #1

I don't really know why I am starting this series, if it will be really useful for anyone and whether I will be able to keep it running for long. But, its here because I am convinced that sometimes the most obvious escapes our mind. And as someone rightly said "common sense is not so common".

I just intend to illustrate few common issues(one per post) where we try to come up with complicate queries and realise, in the end, that we have a painfully obvious alternative. But, the question is - Is it obvious and is it really an alternative? Read ahead...