This article is sourced based on my interview with Bart Farrell, Head of Community at Data on Kubernetes, on March 15, 2022. For further context check out the full episode below.
The current data engineering ecosystem is filled with a wide range of tools from both open-source and third-party solutions. While there still isn’t a consensus on which path to choose (to either choose an open-source or third-party vendor), I think it’s interesting to explore the possibilities of building an open-source data stack ( and, with the current state of the market, it’s honestly the best time to re-consider how you designed your data stack and begin to explore open-source alternatives).
The current consensus is to deploy your open-source data stack on Kubernetes, as it provides the ability to scale seamlessly and operate complex infrastructure using standard tooling.
Having operated large-scale systems for nearly a decade, I’ve seen firsthand the challenges of deploying an open-source application on your own. On the other hand, I realize how valuable it is for organizations to self-host their data stacks to better navigate privacy and security concerns relevant to their field of work.
While this is certainly challenging for developers, I do think it’s certainly reasonable to deploy a self-hosted open-source data stack with the right tooling and processes in place as an organization.
In this article, you’ll learn:
The most common open-source tools to choose from at each layer of the data stack
Why you should choose to self-host your data stack on Kubernetes
Three potential challenges I have seen when deploying an open-source data stack
First, you need to pick the right open-source tools for each layer of your data stack. Good open-source tools are well-documented, have a strong community presence, and are being actively developed. Bonus points for good community support channels and contribution programs.
For data ingestion, a popular solution is Airbyte, an open-source EL(T) platform that helps you replicate data from disparate sources to your data warehouses, data lakes, and databases. Due to its ease of use and low barrier to startup, it is quickly becoming the industry standard for replication jobs.
Another ingestion tool I commonly see companies using is Singer, an open-source ETL tool that powers data extraction and consolidation for all of your organization's data.
dbt has predominantly emerged as the way forward for transforming data. Not only is it a great open-source solution, but it’s also simple to get up and running. They also have a very strong community of over 25,000 data practitioners to answer any questions you may have during implementation.
Another interesting option I have seen companies using every now and then is Dataform – more specifically their open-source SDK – a framework that helps data teams develop, test, and deploy SQL-based data workflows to the warehouse.
Having an orchestration engine to schedule jobs and manage a large tree of dependencies between them is a crucial part of today’s data stack. For years, Apache Airflow has been the standard for data engineers when it comes to orchestrating jobs in their data pipelines.
Recently, Dagster has been picking up traction in the space as an alternative to Airflow. Unlike Airflow, Dagster was built with the full software development lifecycle in mind, allowing developers to orchestrate their pipelines locally at first before deploying into production. Another key feature I’d like to highlight is Dagster’s modern interface that allows developers to visualize the status of the dags and any logs they output.
A tool to keep your eye on in the upcoming months is Prefect, a workflow management system designed to handle modern data infrastructure.
For the data storage layer, there are three common open-source solutions that have gotten a lot of traction over the years.
Clickhouse: Performant, easy to set up, and works well on simple hardware. Good if you want to set up a columnar data store and you know exactly what kind of data you want to store.
Apache Spark: One of the largest open-source projects in the data processing industry. Since its original release in 2009, the unified analytics engine has been a staple for some of the largest companies' data and machine learning operations.
Once you have the data in a presentable fashion you still need to have a BI team query the data to provide analysis for internal and external stakeholders. Three of the most popular open-source solutions in this area are Apache Superset, Lightdash, and Metabase.
Apache Superset and Metabase are fairly similar and this choice will often boil down to which user experience you prefer. Metabase has been known to have a more intuitive UI, whereas Superset can be more powerful but has a steeper learning curve.
On the other hand, lightdash is an interesting option especially if you are heavily using dbt inside your organization, as everything you need for BI is written as code in your dbt project.
While self-hosting your data stack isn’t a new concept, there are still some misconceptions that are important to address. Figuring out how to deploy an open-source data stack is not as hard as it was previously, especially if you have experience or are willing to learn Helm and Kubernetes. You can likely get an application like Metabase or Airbyte spun up, but it might take longer than you anticipated.
I do want to make it clear that if you are building entirely from scratch, the ramp time to learn Kubernetes is steep, so learning Kubernetes in a short period of time is not exactly trivial for developers in most cases. However, over the past few years, the data infrastructure landscape has matured a ton and the ability to install applications in the cloud is far better than it previously was.
Another common misconception I consistently hear is that scaling applications are challenging. I think this needs some clarification. Stateful systems (e.g., data warehouses) are extremely challenging to scale as you need to manage data partitioning and other quirks that are particular to data warehouses.
However, as a whole community, it is safe to say that we have figured out how to scale pure compute systems (which is what a good amount of data software is considered.) Scaling compute systems with the proper infrastructure is not too complex and is simply a matter of tuning a few dials.
You’re going to use a Kubernetes deployment or a Kubernetes stateful set and tune that replica in wherever it’s appropriate for the amount of traffic you’re getting. Also, extraordinarily powerful auto-scaling constructs now exist so it’s not as challenging as it once was to scale. The challenge is more around knowing the appropriate metrics to monitor and tune against.
I often get asked why organizations should consider running their data stack on Kubernetes. While there is a long list of benefits to doing so, I think it’s important to highlight the three main benefits I have seen in running your data stack on Kubernetes.
1. Cost savings: In my opinion, the biggest benefit is the cost that organizations save especially when you begin to mature as a business. Larger organizations that are running infrastructure at scale can realistically reduce their cloud bill by upwards of a million dollars per year. I have also seen incredible cost savings, especially when paired against the managed serviced layer. Often if you were to buy a managed service solution it usually comes with a 40% markup to compute, and if you are doing large-scale batch processing jobs that cost adds up quickly.
2. Simpler security model: Everything is a hardened network, with no worries about privacy and compliance issues. Being able to have strong compliance and privacy around product analytic suites is especially helpful when you begin to talk about GDPR and CCPA environments that are challenging to enforce across the board at scale.
3. Operationally simpler to scale out to multiple solutions: In most cases, you’ll likely have to run Airbyte with other solutions such as Superset, Airflow, and Presto. To do so effectively is challenging and time-consuming. If you are committed to running all those applications together, a self-hosted Kubernetes model becomes quite powerful, due to the fact that it unifies all applications in a singular environment. Developers can create a unified management experience with a web UI on top. The other benefit is the application upgrade process is unified. The complexity involved with upgrading a tool like Kubeflow requires a lot of dependencies at the Kubernetes level. Those dependencies might clash with your deployment of an application like Airflow, which requires something similar to K9 or Istio under the hood. With a unified platform, this process is simplified, developers can easily run a diff check between the dependencies of all the applications and validate that they will be upgradable at any given time.
I’ll be upfront and admit that for some engineering teams running a self-hosted data stack is not always reasonable and can be challenging, especially if you don’t have any prior experience in getting a self-hosted stack up and running.
While it definitely is possible to do so, I want to quickly call out the three main challenges I have seen in deploying a self-hosted open-source data stack.
1. Managing Upgrade Cycles: What makes upgrades challenging is dependency complexity. An example of this is Kubeflow, which has around 15 deployments operating underneath the hood. When running Kubeflow you need to run applications such as Knative, Istio, and Kubernetes version dependencies alongside it to ensure proper installation of upgrades. If any of those change within your cluster your entire flow deployment can break. Managing the upgrade lifecycle is difficult and the reality is that it requires manual effort. You have to do manual validation in all your target clouds before baking, upgrading, and delivering it. And, you end up having manual effort in terms of tracking security patches and new versions that are published to the upstream open source project.
2. Handling Security Challenges: A challenging part of working with open source is figuring out how to make everything secure. A lot of open source has a bad story around security. Plenty of open source applications have no login - so they don’t have any solution for authentication, and if they do it is either outdated, sneaked into an enterprise plan, or not properly implemented. The other part of security that is extremely complicated is the software supply chain management of your solution. You have to implement image scanning to make sure you understand any vulnerabilities that might exist in docker images. It is very common for base images to contain security vulnerabilities and eliminating those is not a trivial task. Implementing network security is still difficult. Service meshes are kind of meant to be the solution for this, but we’ve seen compatibility issues when you start to push them, especially in interactions with database operators where things start to break down.
3. Integration of different tools is tricky: Right now the process is still quite manual and requires a bit of thought. The shared network layer is a really big win here, which is why we like running single-cluster. It is also why developers are getting more bullish about Kubernetes. It is a consistent operational environment and has the ability to add some sort of holistic security profile for all of your applications. However, keep in mind that realistically this isn’t going to be a full solution. There are a ton of other applications that you need to manually integrate and there’s not a true infrastructure level solved for that.
While deploying your open-source data infrastructure stack on Kubernetes does offer an array of benefits for data teams, it’s not meant for every organization. I do wanna caution that while Kubernetes is great, there is a steep learning curve if you haven’t used it previously. If you are comfortable with that learning curve or have team members with previous Kubernetes experience then by all means it does make sense to deploy via Kubernetes.