Open Source ETL with Hadoop
As followup to my article on BI projects for 2012, I got a few questions about ETL and Hadoop. Here are some of the leading options for doing ETL projects with Hadoop.
Cloudera/Sqoop
Lots of nifty tools. Sqoop moves data to and from HDFS from RDMS's. Flume moves log files. Transform logic gets written as part of Map(). I think they are bundling connectors for Netazza and some stuff from Quest for Oracle but fuzzy on licensing terms...I know all these tools are in CDH3 - if you are running CDH2, they may not be there...
Pentaho
Pentaho has a well known OpenSource BI suite. I think they are leveraging HIVE/JDBC. Haven't used it but worth looking at, especially if you have played with Pentaho before. Kettle certainly can migrate relational to JSON (seems sort of backwards but for some applications, can't argue with performance.) I know they were trying to get Hadoop integration into Kettle 4.X...any comments? I'll revise with more info.
Oozie/Pig
If you are tackling things in a more native way with Pig, don't forget about Oozie for controlling your external calls to existing transform logic. Oozie is part of Clouderas dist and I'm sure will be in Hortonworks.
HIHO
New - I haven't used it. Looks interesting. HDFS centric instead of RDMS centric....
Clover and Talend
Both great open source ETL/EAI tools but neither seem to be making any Hadoop specific efforts. Of course both could leverage Sqoop. I'm not that sure Pentaho is really that far ahead but at least they seem to make an effort.
Reader Comments