What is Hadoop good for?
Published on March 3, 2022 by Kevin Graham
It must be confusing if you are a Hadoop newbie at the moment. There are so many conflicting opinions about what it should be used for, as anecdotally demonstrated at the end of this beautifully written piece about the Big Data London event.
At one end of the spectrum you have people saying, “it’s only good for storage,” and at the other, Cloudera has a vision for Hadoop as the only data platform you will ever need (although they no longer refer to the “H” word; that would be too easy. They now call it the Enterprise Data Hub). They believe that anything you want to do with data, you can do on Hadoop (or rather Enterprise Data Hub), whether that be ETL, streaming, BI, AI, ML or whatever.
Besides the obvious data storage use case, accepted typical uses for Hadoop today, include:
- data preparation and batch processing
- scheduled BI reporting
- data science and advanced analytics
- streamed data analysis
However there is a perception that Hadoop should not be used for:
- Interactive BI
- Self-serve analytics
- Data warehouse replacement
So really, what is Hadoop good for?
To answer that question, I would first answer, “what is Hadoop?”
Hadoop is a software technology that allows clusters of low cost commodity servers to be connected together to form a very powerful, scale-out and resilient data storage and data processing platform. Original it had two key components: a distributed file system called HDFS, and a framework for programming data processing and analysis tasks in parallel, called Map-Reduce. Map-Reduce is now pretty much defunct, so if you think of Hadoop in its original form then it is pretty much just a distributed, low-cost data storage platform. However in the years since it was launched, a huge eco-system of software technologies have emerged that allows this cluster of low cost commodity servers to effectively perform a wide variety of different data related tasks.
Because the Hadoop distributors have to provide on-going support for the technologies they ship, only a sub-set of the ecosystem technologies actually come with the distributions. Many people’s view is therefore confined by the capabilities of this sub-set. But to realize its full potential you really need to think outside your distribution’s box and investigate the other complementary technologies out there. These technologies can massively extend the capabilities of your Hadoop cluster and many have excellent levels of support, as discussed in this paper by analyst Mike Ferguson.
Presto, a SQL on Hadoop solution supported by Teradata, is an example of a powerful Hadoop technology that is not part of any of the main distributions. Another is the Kognitio SQL engine for Hadoop.
If you consider Hadoop as a powerful, scale-out hardware platform that is supported by a vast eco-systems of data-related software technologies, then by choosing the right combination of software components you can make your Hadoop cluster into almost anything data related, including interactive, self-service BI and data warehousing.
For example, Inmar uses Kognitio on Hadoop to provide retail clients with high-speed interactive access, via a web portal, to massive amounts of retail transaction data stored in a Hortonworks cluster. Using the Hive SQL layer that came with Hortonworks made the “high-speed” part of that hard to accomplish. With Kognitio they found that queries ran around 300x faster.
So what is Hadoop good for? Well with the right combination of software components, pretty much anything.