Videos, Data Warehousing, Blog

What is Hadoop: Scalability

Jared Hillam

Jared Hillam

October 30, 2014


A simplified explanation of how Hadoop addresses the bottlenecks found in traditional databases and how these are overcome in HDFS.

Text from the Video:

Imagine for a moment that you are the owner of a hotel, and that every morning you serve a famous breakfast to your guests in the dining area. Now imagine that your hotel becomes so popular that you decide to increase the hotel size to 3000 rooms. This also requires you to expand pretty much everything, the dining area, the speed of the elevators, the size of the lobby etc. You still serve your famous breakfast, and it’s a huge hit, and your hotel becomes a worldwide sensation and you find yourself again needing to expand the size of your hotel. This time you decide to expand the hotel by another 3000 rooms doubling the size of your hotel. You keep your super-fast elevators and you simply add the additional floors. The first day you add the new capacity you discover a new problem. The movement of people going up and down the elevators creates a massive traffic jam. Even though your elevators are going REALLY fast they just aren’t fast enough to clear the crowd. Additionally the dining area is too over worked and it can’t process the massive quantity of people.

Let’s pause the story for a moment and explain the parable. The people in the hotel represent the data. The hotel represents the storage or database. The dining area represents the data processing, and the elevators represent the network. Everything I’ve explained in the parable match up fairly well to the challenges of data growth in a traditional database environment. As data volumes become really large we begin needing to deal with weaknesses in the environment. For example a 10 gigabit network is indeed fast but what if I’m transferring a terabyte of data across it? I all the sudden have a very real bottleneck to deal with. And the same is true for the kitchen, or the processing engine. It might be super-efficient but the volume of data is so extensive that the centralized nature of a single processing target creates a massive bottleneck.

Now let’s talk about what the Hadoop Hotel would look like.

Knowing ahead of time that your hotel is going to be so popular that the entire world population might want to say there and eat your famous breakfast, you design your hotel like this:

Each floor of the hotel has its own little kitchen and dining areas. Each morning the cooks are given their recipes and they go up to their designated floor together with their ingredients to cook their great meal. This way they don’t end up choking the elevators with guests going to and from the d ining area, and the famous breakfast gets served to everybody who wants it. This model allows you to expand the hotel to a huge size without a major impact on the hotel infrastructure.

This is in essence how Hadoop behaves. Imagine each rack or computer in the Hadoop cluster being a floor of the Hadoop hotel and rather than the data moving between floors we take the calculation instructions (or cook) to the machine where the data is living.

Now there are a lot of solutions that leverage the central tenants of the Hadoop infrastructure, but many forget the core tenants keeping the cooks on the same floors. The temptation to move all the guests around or move them to a dining hall is very high because it’s often more difficult to adhere to the fully distributed nature of Hadoop. There’s an inevitable network chatter that happens between Hadoop nodes, but improperly architected Hadoop solutions can turn this chatter into an absolute tumult. Because of the sheer quantities of data we’re dealing with, Hadoop can easily max out a network if it is improperly architected. The most scalable solutions for Hadoop keep the network traffic to a minimum by adhering to the principals of calculating where the data lives. These little oversights in network chattiness can lead to expensive changes later in the lifecycle of your Hadoop implementation when it grows to production capacity.

This is one of the many considerations that organizations need to take into account when defining a Hadoop infrastructure. We recommend you reach out to Intricity and talk with one of our specialists about your Hadoop strategy. Because we can help you avoid expensive detours in your big data journey.


Related Post

Snowflake Data Breach... Now What?

Snowflake's data breach affected 165 customers through stolen credentials. Discover the security measures that could have prevented it and how to protect your data in the future.

Read More

Using AI for Code & Metadata Conversions on Data Systems

LLM-based code conversions can have challenges and successes. Explore real-world insights and best practices for navigating these projects.

Read More

Ness Digital Engineering Acquires Intricity

Ness Digital Engineering Acquires Intricity - a New York based company specializing in data strategy, governance, modernization, and monetization

Learn More