The transformations in ETL are the meat of an ETL tool. A corporation's intellectual property for how their data is reshaped is what makes an ETL tool sticky to organizations that use it. Most companies don’t have a great understanding of what those transformations are even doing. They just know that the data behaves itself when it is queried. (well… most of the time) If they really did want to know what that ETL tool was doing, most ETL tools provided a graphical representation of the transformations. Even then, the details that drove those logical steps drilled down into deeper logical definition screens.


For 30 years ETL tools have served the data pipelines of organizations. However, there is a force that is gravitating transformation events to be computed where the data is stored. This is driving companies to look at ways to make their ETL computations more adjacent to the database. This gravitation is also influencing other factors to come into play. For example, the pain of negotiating, planning, and exiting the ETL tools has organizations wanting to look for solutions that are less vendor-locked. But where do they go? The presumption that there is a silver bullet answer is a little fanciful. However, for those that are quite conservative in their approach the future is quite uncertain, so maximizing optionality is high on the agenda. This is producing a trend of PySpark being a target for those parties.

 

Open and Broad

PySpark is a target that does allow for non-cloud, non-vendor-specific, high-level data processing, which can be particularly attractive for organizations that operate in environments where data sovereignty and security are paramount. By utilizing PySpark, these organizations can keep the data processing close to the data storage layer, often within their own data centers or private clouds, thereby maintaining control and governance over their data workflows.

The allure of PySpark lies in its large open-source adoption. Despite vendors creating their own syntaxes, it still offers an escape from a majority of proprietary clutches of traditional ETL vendors. 

With PySpark, data engineers and scientists can script complex transformations using Python, a language with which a vast number of professionals are familiar. This democratization of data processing means that organizations are not just less vendor-reliant, but they are also tapping into a larger pool of talent to maintain and enhance their data pipelines.

Let's take a look at the current landscape of cloud vendors and whether they support PySpark. 

  • On-premise Hadoop… Yes
  • AWS… Yes
  • Google Cloud… Yes
  • Azure… Yes
  • Databricks… Yes
  • Snowflake… No (Well, no to the Spark part, but the Python part is Yes)

This broad support has made large conservative organizations that are debating the future take a close look at PySpark as a target platform for ETL jobs. The allure is the ability to negotiate an exit without the organization putting itself at risk operationally. The sheer expense involved in deploying native compute from one cloud platform to another is a force for getting organizations to consider PySpark as a target. This risk is something that conservative organizations very much look at. The last thing they want is to be noosed by a single vendor, and unable to negotiate their costs. So much so that some organizations still haven’t moved to the cloud, but rather tout their own “clouds”. Nevertheless, the target language for them is PySpark.

 

Go To

All of these advantages see PySpark becoming the “go-to” code base for many ETL modernization efforts. However, this can be overdone. Often in the zeal to modernize to PySpark, organizations will try to convert everything down to their SQL database logic. While it is potentially possible to convert SQL logic to PySpark, it is often overkill. Reconstructing simple ANSI SQL to a far more complex programming structure just ends up creating additional noise in code. Where code can be maintained in SQL, it makes far more sense to do so. With open options like SparkSQL this kind of noise can be avoided.

What you will lose with PySpark is the ETL widgets, as everything will be represented in code. Platforms such as Databricks allow for more organized segmentation of code in their workbooks. Because these widgets are gone, it is important that the code conversion be conducted in a highly structured way so that the patterns persist across ETL transformations. This will enable the organization to ensure the code is at least maintainable and automatable in the future.

 

Once in a Career

While ETL tools have been sticky for decades, their grip is finally loosening. Organizations that wouldn’t have dreamed of tipping the applecart of their data pipelines are now making a once-in-a-career pipeline change. With this change, they’re becoming more skeptical about future-proofing their next move, and PySpark is one of the targets coming out on top.

 

TO CONTINUE READING
Register Here

appears invalid. We can send you a quick validation email to confirm it's yours