I noticed some interesting announcements recently concerning the open-source Apache project Hadoop. Firstly, in the last week, both Intel and EMC have announced their own distributions (link). It seems that the big iron hardware vendors are finally coming around to seeing Hadoop as the standard for big data processing. It makes sense for these vendors to optimize and integrate it into their platforms, but in reading these articles, I have to wonder if these vendors are focused too much on Hadoop as the single solution for all data processing needs.
I stumbled across an article some time ago by Mike Miller at Cloudant that made the case that Hadoop’s days are numbered. Mike makes the point that while Google MapReduce and it’s open-source cousin Hadoop were great innovations when first introduced, even Google has moved on to other technologies that have fewer limitations and are better performing. I have personally struggled with determining the best approach to handling streaming data sets. In fact, it seems that something like Storm might have been more appropriate.
Part of the job of a good software architect is knowing what tools to recommend to the team that are available and best fit the job. This means that you need to know about a range of tools with different strengths and weaknesses. So, the next time someone mentions Hadoop as the solution for a large-scale processing need, take a step back and make sure the problem maps well to it. If there are ad-hoc analytics, dynamic data sets, or other features that it has trouble supporting, look for other alternatives that might perform much better.