Open Source ETL with Hadoop
Friday, July 22, 2011 at 08:43AM
Aaron Rosenbaum in #Cloudera, #HIHO, #NoSQL, #Pentaho, #Sqoop, BI, NoSQL

As followup to my article on BI projects for 2012, I got a few questions about ETL and Hadoop.  Here are some of the leading options for doing ETL projects with Hadoop.

 

Cloudera/Sqoop

Lots of nifty tools.  Sqoop moves data to and from HDFS from RDMS's.  Flume moves log files.  Transform logic gets written as part of Map(). I think they are bundling connectors for Netazza and some stuff from Quest for Oracle but fuzzy on licensing terms...I know all these tools are in CDH3 - if you are running CDH2, they may not be there...

Pentaho

Pentaho has a well known OpenSource BI suite.  I think they are leveraging HIVE/JDBC.  Haven't used it but worth looking at, especially if you have played with Pentaho before. Kettle certainly can migrate relational to JSON (seems sort of backwards but for some applications, can't argue with performance.)  I know they were trying to get Hadoop integration into Kettle 4.X...any comments? I'll revise with more info.

Oozie/Pig

If you are tackling things in a more native way with Pig, don't forget about Oozie for controlling your external calls to existing transform logic.  Oozie is part of Clouderas dist and I'm sure will be in Hortonworks.

HIHO

New - I haven't used it.  Looks interesting.  HDFS centric instead of RDMS centric....

Clover and Talend

Both great open source ETL/EAI tools but neither seem to be making any Hadoop specific efforts.  Of course both could leverage Sqoop.  I'm not that sure Pentaho is really that far ahead but at least they seem to make an effort.  

Article originally appeared on Hillsborough, CA (http://www.al.net/).
See website for complete article licensing information.