Data Governance, Videos, Data Science, Data Warehousing, Snowflake, Blog

Why Hadoop is Dying



September 21, 2017

Allright I know that I’m going to ruffle some feathers here, but… Why is Hadoop dying? It was once the darling of the corporate data management space. Boardroom meetings suddenly had line items about Big Data Strategies, and some of the distributions booked record levels of venture capital funding. But today, nobody cares. Why is that? What’s happened?

Well let’s talk about Hadoops roots for a moment. In 2007 the Apache Hadoop open source project was released to the public. The project was based on the storage tech called mapreduce developed by google to house the entire world wide web. What it did phenomenally well was offer an open door to data, while allowing organizations to scale that data footprint on cheap hardware. Up until that point, organizations had been paying through the nose to store data on expensive enterprise class hard drives. So there was real value in using hadoop to store this massive quantity of data. This was Hadoops strength, its weakness at the time however was security, orchestration, query speed, and the complexity of queries. Over time those weaknesses have been addressed, but query speed still remains a thorn in the side of many Hadoop deployments. This is largely because of the way Hadoop stores its data.

Let me try to give you a tangible example. Imagine I gave you a college history book, but in this book there are no titles, no chapter headings, no subheadings just 500 pages of textbook sized reading. Now imagine I asked you to try and find how much World War 2 cost each country? It would certainly take quite a bit of time to come up with an answer, especially because I’m asking for a deeper level of specificity. I don’t just want to know where World War 2 is in the book I want to know exactly how much it cost each country. But if I add those chapter headings, and everything else, back to the book, it suddenly takes a lot less time to come up with that answer. Albeit imperfect, this is a good parody to why many corporate customers were disappointed when they tried to use Hadoop for enterprise wide analytics. I’m not saying that Hadoop doesn’t have enterprise wide success stories to tell, but where we find those success stories we find the tried and true practices of data aggregation, and data schemas.

But query speed is only half the equation of why Hadoop lost steam. The cloud is at the root of its demise. The big vendors in the cloud space have been offering their own storage layers which now basically do everything Hadoop does, but without the hassle of managing a ton of hardware. Interestingly enough, most of these cloud vendors were housing Hadoop deployments all the while creating their own cheaper alternatives over the last 5 years. And these alternatives don’t require its users to do much in terms of managing redundancy, server uptime, etc. Now, just like Hadoop these storage layers are not super great at high speed, high frequency queries, but that’s ok because there are other solutions within the cloud space which are good at such queries and they integrate seamlessly with their native cheap storage layers.

So what we’re seeing now are organizations trying to leave Hadoop. Many of these organizations are already in the process of offloading Hadoop sequence files into Amazon S3 and other cloud storage platforms, because they’re cheaper to administer and scale. However, this introduces its own problem. See Amazon S3 for example can’t natively read hadoop sequence files without spinning up another hadoop instance. Because of the sheer number of companies offloading Hadoop to Amazon S3, Intricity created a solution called readSEQ which allows companies to read Hadoop sequence files right from Amazon S3. Additionally, it allows organizations to choose the format type they would like to convert their data to such as JSON, AVRO, or Parquet. If you’d like to learn more about the Intricity readSEQ solution, click on this link. And if you’d like to discuss your scenario with a specialist you can reach out to us at any time.

Related Post

What is a Partition?

Understanding the concept of database partitioning can be significantly illuminated by the historical context of hard drive defragmentation.

Learn More

The Narrow Case for Data-to-Information Company Acquisitions

The rumors about Salesforce acquiring Informatica bring up some interesting observations from past acquisitions of this nature.

Learn More

CI/CD for Data Projects

Despite its prevalence in software development, CI/CD is less common in data projects due to differences in pace and cultural perception. Discover the importance of implementing CI/CD in...

Learn More