January has so far been an exciting month in the Hadoop world. Three heavyweights with different offerings of the MapReduce platform commonly used in data analytics.
First up was Netflix, introducing us to Genie. Genie provides additional abstraction on top of Amazon’s EMR through a RESTful API. It also allows the management of job flows on different clusters. What is not clear is whether there are any plans to make Genie available as open source or otherwise. Will we see it on GitHub soon?
In typical Netflix fashion, the blog post also provides technical insights into the Netflix kitchen this time about their data warehouse architecture. The biggest surprise to me was the reliance on Amazon S3 which I know from experience can stifle Hadoop performance. Their main reason seems to be that with S3 it is possible to spin multiple clusters simultaneously without needing to manually replicate data stores or worry about inadvertent data loss.
Next is Hortonworks Sandbox, a virtual machine packed with a couple of things that a Hadoop novice would find handy: Hortonworks’ own flavor of Hadoop preinstalled, as well as a step-by-step tutorial to get started with Hadoop. I wonder if the VM is useful beyond its standalone environment, e.g. to easily manage an external Hadoop cluster. There is also an associated webinar, and a promise to provide more tutorials and exercises in the coming months.
Last but not least is Joyent, who announced an enterprise Hadoop-aaS provision that is supposed to be of much higher performance than others in the market. They claim “3x faster or at 1/3 the cost of other cloud offerings”. Something’s gotta give, not sure what yet.
Hadoop might indeed be old hat to some, but no one can deny that it still is an important programming framework that is widely used for crunching data.