Feb 26, 2013

ECL - Big data for the SQLly Inclined

The last project I worked, we pushed the limits of RDBMS (misfit for our requirements) and I decided that the next time, I will consider beyond SQL for my data needs. I started exploring the NOSQL world - MongoDB,Neo4j, REDIS etc and I understood that had we been open to these technologies when we started our last project, we might have had a lot easier life.

Eventually, I started seeing Hadoop everywhere, our company was talking, customers were talking, my friends were talking. And I started learning the jargons around Hadoop (map reduce, hive, sqoop, HDFS,HBase) and I used to throw these words in my conversation along with a few zoo animals and figured out most of my friends were doing the same and we had a happy ecosystem going on. But, deep down I knew that I was ignoring the elephant in the room that was staring at me. I read through the famed map reduce research paper and I was able to get the concept. But, I was not able to get to start playing with hadoop. Setting up was easy and you can get the word count sample working in an hour. But, after that I was stuck. I understood the power of what it can do. But, I felt I did not know the right language to communicate with it. It's like someone asked me to write a web server in SQL. Of course you can do it, but I don't want to. To me, SQL is "the" reference implementation of a Domain Specific Language. And the ease at which you can instruct your RDBMS to do a complex task was mind blowing as long as you are operating within the problem domain.

In retrospect, I understood that the reason why I had so much reluctance to get into hadoop was because I am not a technology guy (there, I said it). I like to solve logical problems (puzzles or problems, I don't care). From a problem solver perspective, my problem statement does not change whether I am working with 1 record or 1gazellion records. It does not change if I have 1 line of text or the entire world wide web to process. I wanted that abstraction. SQL was giving me that (almost), till the data spilled over to the next machine. So, I started looking for languages on top of Hadoop that can help me out. Looked at Pig and Hive and a few others but, i felt this was like LINQ for SQL. You can change the programming languange but you cannot change the fundamental building blocks. I don't want to come out wrong. I love LINQ, but not so much when i have to write complex SQL queries in LINQ.

And, so I started exploring options that were outside Hadoop. Come on, big data is such a "big" pie, and it will not be monopolized. Anyways, my search ended with ECL. A programing language for taming the super computing grid called High-Performance Computing Cluster. It was open source, installation was exactly like how it was for SQL Server.
    "Download the VM and download an IDE (looks like your SQL Server management studio). Connect to the server and get going."

Played with it for a few days. They have some tutorial videos in their site (google HPCC systems). My interest was mainly because the programming style was so different. Not similar to any language that I knew. But, I was able to relate to it. I didn't have to skew my thinking to fit to their programing style.

Also, it's been "the" programming language for HPCC for the last 10 years or so and it has undergone a lot of refinement over the years. So, I knew that I can take a deep breath and give some time to understand this language.

Fast forward a few months... I am still in love with ECL. And some day, someone may write ECL for Hadoop. But, till then, I am taming big data, the ECL way.

If you got an hour or so, give ECL a try and let me know what you think.

There a mental switch that you need to turn on to be able to easily starting thinking in ECL and with that, it becomes pretty much like SQL, actually even more elegant in a few cases. I will anyway write about it in my next post...breaking down a complex SQL query and building it in ECL. It should be a lot more easy to understand, I hope.  

3 comments:

Dexter said...

Nice Article! ECL for Hadoop need not be written again. We have PIG and HIVE. Both have same layers of the components in stack. No more reinvention is required. PIG and ECL share more or less same semantics but the way the work is executed is different. End of the day what determines your choice is, what the system is capable of, which IMO can't be just ignored. :)

Omnibuzz said...

I can understand your perspective, Dexter. But, how will you decide when to use PIG and when to use HIVE and when to use neither? To put it in a different perspective, ECL is the mother tongue of HPCC and Map Reduce is the same for Hadoop. PIG and HIVE are two different translators that you can use to communicate to Hadoop. Translators are okay for small talk. When you want to have a deep long conversation with someone, I would prefer to talk directly. Just my 2 cents.
-omni

Dexter said...

PIG and HIVE compiles/plans and translates the query as Map-Reduce, ECL compiler compiles/plans and translates your query to C++ code (distributed computing). ;)

Post a Comment