Source Guru

Everything but the Kitchen Sink

by Mez on Dec.07, 2009, under Personal

Transforming data is hard. When I joined my current company, there were stupendous amounts of Perl/PHP/Bash/<insert random programming language here> scripts that would run on a cron job and do magic things to our data. They’d create reports, they’d tell the purchasers when we were running out of stock, they’d synchronise data between our Frontend and backend databases, they’d collect, they’d collate, they’d do everything and anything.

Except, with all these scripts, in all these random languages, written by a multitude of previous developers (at different skill levels), they weren’t particularly maintainable (and sometimes, they weren’t particularly readable or understandable either – imagine a 6000 line perl script that pretty much ran different permutations of the same data over and over again)

Enter Pentaho, and specifically it’s “Kettle” project. (since renamed “Pentaho Data Integration”), a tool that lets you manipulate your data in pretty much any way you can imagine, in the simplest and easiest way imaginable.

That’s right, it’s a GUI for data manipulation.

I know a lot of you are probably sceptical right now.  The first time I ever saw this was when a previous boss of mine put it forward as a potential solution for one of our problems (getting our orders from the front end database down to the office/warehouse).  I saw it, and I thought “GUI? Nah, that’s not how real programmers do things!”, so after the development team put forward another proposal to solve this, and it got accepted, I thought I’d never see the thing again.

That was until my current boss started playing with it, trying to work out what it was doing so that he could get these evil GUI based scripts into something manageable, like nice, pretty code.  Thing s, when my boss plays with things that he doesn’t know about, he tends to read up, research, and, 9 times out of 10, change his mind.

We wiped the previous server (it was rather noisy! We’re glad it’s no longer switched on!) and set up a new server to house our “BI platform”.  Starting off with a few scripts, my boss learnt to love this tool, and then, as I’m his “2nd in command” (aka general lackey) – started making me learn how to use it.

Again, I was sceptical, I didn’t want to learn, and I put up resistance, but my boss was going away for nearly a month, and by this time, a few of our key business processes relied on Kettle, so, grudgingly, I sat down, and started to learn.

You may be wondering now, why I started off this story talking about all those magical and wonderful scripts that no one seemed to know the inner workings of.  These scripts, as I’ve already mentioned were unwieldy, and at times, god-damned awful.  The plan was to move them to the BI system (as my boss had been doing already).

I like to think of Kettle as a bridge between the process-flow diagram, and the code.  I started converting these scripts, and I was astonished by the fact that most of the conversions I was doing was converting a long perl script into 3 or 4 “Integration steps”

I’m totally besotted with this program now.  Any time I have to do data manipulation, I turn to it.  I can’t describe how (once you’ve got used to it’s quirks) easy it is to use, how simple it is, and how much it just makes sense. Best of all, most of those evil scripts are gone now, and replaced with “pretty” diagrams that do the work for you.

If you have to play with large data sets on a regular basis, I urge you to try it out.  You can buy me a beer for reccommending it next time you see me at $conference.

:, , ,

9 Comments for this entry

  • Eduardo Robles Elvira

    Hello!
    Interesting application indeed. I am currently doing something even better than trying kettle, I’m developing a new KDE application similar to it.

    How different it is from Automator? It seems to be like an Automator pro, or Automator Enterprise. The app I’m developing is heavily inspired in Automator and it’s called RoboDo. Each action is a configurable plasmoid. Well, it’s still in very early development stage, I’ve actually been developing it only for some days :P . The code is in http://gitorious.org/robodo/

    Is Kettle open source? Does it work in linux? Is there a .deb for it? :P

  • Anonymous

    Mind giving a specific example of a problem and its solution in Kettle?

  • Mez

    I’ll have a look @ the jobs tomorrow, the only ones I can think of off the top of my head would probably give a bit too much insight into our company than I should post publically.

  • Anonymous

    Kettle looks awesome – ETL tools are normally ultra-expensive and closed-source so this is a fantastic project. Thanks for highlighting it.

    Although we don’t use Kettle (yet) where I work we use a very similar ETL tool to extract application access rights from the systems and run an automatic comparison against the official records of who has been granted what. This reconciliation was previously being done manually in a spreadsheet and was both time-consuming (often measured in months for some of our more complex apps where there are thousands of access permissions and thousands of users) and error-prone. With the Kettle-like tool, we now have an automated process that we run on a daily basis to perform the reconciliation across around 40-50 apps and push an email with any exceptions out to the appropriate part of the business for review. The ETL tool means that we don’t have to worry about the database being used as this is all managed for us.

    We also use it to produce a great deal of reporting for senior management as it can help deal with some of the complexity of pulling information from multiple systems, transforming it into a common format and then pushing the resulting data into a database where where the reporting app can query it.

  • wjl

    Sounds more like an ETL than a “BI” tool to me. And, as Eduardo asked: is it open source?

  • wjl

    oic – GPLv2. I could have seen, then asked – mea culpa ;-)

  • Franck

    Hi,

    for those insterested in opensource ETLs, have also a look at Talend Open Studio (http://www.talendforge.org/)

    or CloverETL (http://www.cloveretl.com/), the latter having an opensource engine, but the GUI is closed source.

    Franck

  • Gunni

    I love Kett^H^H Pentaho Data Integration, too. I once used Jitterbit for ETL, but as i ran into problems with the 3.0 Version of Jitterbit, i searched for something new, and found Talend and Pentaho.
    Both look quite similar, and use a different approach than Jitterbit. Jitterbit is easier to use in the beginning, but Talend and Pentaho are so much more powerful, that once you took the beginner hurdle, you can do so much with it.
    The decision for Pentaho was mostly because of my most loved integration step “synchronise after merge”.
    Just wanted to leave my opinion here.

  • seb

    Reading this post it reminded me off the LUG session I attended in Sydney some time ago.

    Back in May 2008 Zachary Zeus held a talk on “Open Source Business Intelligence with Pentaho”

    http://slug.org.au/node/100

    Back then I did not get everything he explained and showed us but sure it was stunning how fast and simple he was able to pull different sources together.

    Not sure if there was a video camera recording…

Leave a Reply

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!