The modern data stack or the Data Stack is a collection of cloud-native applications that serve as the foundation for an enterprise data infrastructure. The concept of the modern data stack has been quickly gaining popularity and has become the de facto way for organizations of various sizes to extract value from data. Like an industrial value chain, the modern data stack follows a logic of ingesting, transforming, storing, and productizing data.
This blog is an introduction to the modern data stack, and in the following articles, we will dive into each of the different components.
The modern data stack can be defined as a collection of different technologies used to transform raw data into actionable business insights. The infrastructure that derives from these various tools can be used to reduce the complexity of managing a data platform while ensuring that the data is exploited to its fullest. Companies can adopt the tools that best fit their needs, and different types of layers can be added depending on the use case.
For decades, on-premise databases were enough for the limited amount of data companies stored for their use cases. Over time, however, data amounts exponentially increased, requiring companies to find solutions to keep all of the information that was coming their way. This led to the emergence of new technologies that allowed organizations to deal with large amounts of data like Hadoop, Vertica, and MongoDB. This was the Big Data era in the 2000s when systems were typically distributed SQL or NoSQL.
The Big Data era lasted less than a decade when it was interrupted by the widespread adoption of cloud technologies in the early and mid-2010s. Traditional, on-premise Big Data technologies struggled to move to the cloud. Their higher complexity, cost, and required expertise put them at a disadvantage compared to the more agile cloud data warehouse. This all started in 2010 with Redshift — followed a couple of years later by BigQuery and Snowflake.
It is important to remember that the modern data stack as we currently know it is a very recent development in data. Only very recent changes in technology allowed companies to exploit the potential of their data fully. Let’s walk through some of the key developments that enabled the adoption of the modern data stack.
The rise of the cloud data warehouse
In 2012, when Amazon’s Redshift was launched, the data warehousing landscape changed forever. All of the other solutions in the market today — like Google BigQuery and Snowflake — followed the revolution set by Amazon. The development of these data warehousing tools is linked to the difference between MPP (Massively Parallel Processing) or OLAP systems like Redshift and OLT systems like PostgreSQL. But we’ll discuss this topic in more detail on our blog focused on data warehousing technologies.
So what changed with the cloud data warehouse?
The move from ETL to ELT
In a previous blog, we’ve already talked about the importance of the transition from ETL to ELT. In short, with an ETL process, the data is transformed before being loaded into the data warehouse. On the other hand, with ELT, the unstructured data is loaded into the data warehouse before making any transformations. Data transformation consists of cleaning, checking for duplicates, adapting the data format to the target database, and more. Making all these transformations before the data is loaded into the warehouse allows companies to avoid overloading their databases. This is why traditional data warehouse solutions relied so heavily on ETL. With the rise of the cloud, storage was no longer an issue, and pipeline management’s costs decreased dramatically. This enabled organizations to load all of their data into a database without having to make critical strategic decisions at the extraction and ingestion phase.
Self-service analytics and data democratization
The rise of the cloud data warehouse has not only contributed to the transition from ETL to ELT but also the widespread adoption of BI tools like Power BI, Looker, and Tableau. These easy-to-use solutions allow more and more personas within organizations to access data and make data-driven business decisions.
It’s essential to remember that merely having a cloud-based platform does not make a data stack a modern data stack. As Jordan Volz wrote in his blog, many cloud architectures fail to meet the requirements to be included in the category. For him, technologies need to meet five main capabilities to be included in the modern data stack.
The objective of the modern data stack is to make data more actionable and reduce the time and complexity needed to gain insights from the information in the organization. However, the modern data stack is becoming increasingly complex due to the growing amount of data tools and technologies. Let’s break down some key technologies within the modern data stack. The individual components can vary, but they usually include the following:
Main tools: Airbyte, Fivetran, Stitch
Organizations collect large amounts of data from various systems like databases, CRM systems, application servers, etc. Data integration is the process of extracting and loading the data from these disparate source systems into a single and unified view. Data integration can be defined as the process of sending data from across an enterprise into a centralized system like a data warehouse or a data lake in such a way that it results in a single, unified location for accessing all the information that is flowing through the organization.
Main tools: dbt
Raw data is entirely useless if it doesn’t get structured in such a way that allows organizations to analyze it. This means you need to transform your data before using it to gain insights and make predictions about your business.
Data transformation can be defined as the process of changing the format or structure of data. As we have explained in this blog, data can be transformed at two different stages of the data pipeline. Typically, organizations that make use of on-premise data warehouses utilize an ETL (extract, transform, load) process, in which the transformation happens in the middle — before the data is loaded into the warehouse. However, most organizations today make use of cloud-based warehouses, which makes it possible to increase the storage capacity of the warehouse dramatically — therefore allowing companies to store raw data and transform it when needed. This model is called ELT (extract, load, transform).
Once you have decided on the transformations, you need to find a way to schedule them so that they run at your preferred frequency. Data orchestration automates the processes related to data ingestion, like bringing the data together from multiple sources, combining it, and preparing it for analysis.
Main tools: Snowflake, Firebolt, Google BigQuery, Amazon Redshift
Nothing can reach your data without accessing the warehouse, as it is the one place that connects all the other pieces between each other. So, all your data flows in and out of your data warehouse. This is why we consider it the center of the Modern Data Stack.
To put it simply, reverse ETL is the exact inverse process of ETL. Basically, it’s the process of moving data from a warehouse into an external system — like a CRM, an advertising platform, or any other SaaS app — to make the data operational. In other words, reverse ETL allows you to make the data you have in your data warehouse available to your business teams — bridging the gap between the work of data teams and the needs of the final data consumers.
The challenge here is linked to the fact that more and more people are asking for data within organizations. This is why organizations today aim to engage in what is called Operational Analytics, which basically means making the data available to operational teams — like sales, marketing, etc. — for functional use cases. However, the lack of a pipeline moving data directly from the warehouse to the different business applications makes it difficult for business teams to access the cloud data warehouse and make the most out of the available data. The use of the data sitting in the data warehouse is limited to creating dashboards and BI reports. This is where the bridge provided by reverse ETL becomes crucial to use your data entirely.
BI & Analytics
Main tools: Power BI, Looker, Tableau
BI and analytics include the applications, infrastructure, and tools that enable access to the data collected by the organization to optimize business decisions and performance.
We are biased here, but data observability has become an integral part of the modern data stack. For Gartner, data observability has become essential to support and enhance any modern data architecture. As we explain in this blog, data observability is a concept from the DevOps world adapted to the context of data, data pipelines, and platforms. Similar to software observability, data observability uses automation to monitor data quality to identify the potential data issues before they become business issues. In other words, it empowers data engineers to monitor the data and quickly troubleshoot any possible problem.
As organizations deal with more data daily, their data stacks become increasingly complex. At the same time, bad data is not tolerated, making it increasingly critical to have a holistic view of the state of the data pipelines to monitor for loss of quality, performance, or efficiency. On top of this, organizations need to identify data failures before they can propagate, creating insights on what to do next in the case of a data catastrophe.
Data observability can take the following forms:
These different forms of data observability help technical personas like data engineers and analytics engineers, but also business personas like data analysts and business analysts.
Now you may ask yourself: is there a perfect data stack that every organization should adopt? The quick answer is no. There is no one-size-fits-all approach when it comes to selecting the best tools and technologies to deal with your data. Every organization has a different level of data maturity, different data teams, different structures, processes, and so on. However, we can categorize modern data stacks into three distinct groups: essential, intermediate, and advanced.
Essential data stack
An essential data stack should be able to ingest data, store it, model it — with dbt or custom SQL — and then, depending on the use case, organizations craft their reverse ETL process for operational analytics or invest in a BI tool. Usually, BI tools in essential data stacks are open source tools like Metabase. On top of that, an essential data stack includes a basic form of testing or data quality monitoring.
Intermediate data stack
An intermediate data stack is a bit more sophisticated, and on top of what we described above, it also includes some form of workflow orchestration and a data observability tool. Basic testing and data quality monitoring allow teams to gain visibility over the data assets’ quality status. However, they offer no way of knowing how to troubleshoot potential issues quickly. This is where data observability comes in. Data observability signals across the entire data stack — logs, jobs, datasets, pipelines, BI dashboards, data science models, and more — enabling monitoring and anomaly detection at scale.
Advanced data stack
To conclude, an advanced data stack includes the following technologies and tools:
Infrastructures, technologies, processes, and practices have changed at a dramatic speed over the past few years. The rise of the cloud, segregation of cloud and storage, and data democratization impeded the modern data stack movement. Keep in mind that there is no one-size-fits-all approach, and it is hugely use-case specific — depending on factors like the data maturity level of the organization, your data team, and more. Find out more about how we think you should build your modern data team based on your organization’s maturity level on this blog.
Everyone is talking about the modern data stack (MDS) nowadays. I am a data system person. I started building core database systems in the big data era, and have witnessed the birth and prosperity of cloud computing over the last decade. But the first time I came across the term “modern data stack,” I felt confused - is it just yet another buzzword that the cloud service vendors created to attract people’s eyeballs? There are so many articles online, but most of them are quite markety and salesy. After running a startup building core systems in the modern data stack domain for a while, I would like to share my thoughts. In this article, I will explain “modern data stack” to you in simple terms, and discuss why modern data stack can really matter in companies.