The Modern Data Stack has been the hot topic of the data world for quite some time now. In the 12 years since Redshift was initially released, we’ve seen a mass explosion of capital and innovation that has enabled organizations to transform how they think about and leverage data.
For this article, I’d like to discuss one of the major advantages of the MDS — the shift from ETL to ELT — and the primary users who are served and underserved in this paradigm. Today, I believe that the MDS provides a massive unlock in value for data engineers and analytic engineers but lacks the easy-to-use and intuitive toolkits for the data analyst, citizen data scientist, and knowledge worker that sits closest to the decision.
The purpose of this blog is not to discuss the past, present, or future of the Modern Data Stack. There are many articles that provide a fantastic history lesson on the transition of on-prem legacy systems to the truly cloud native, modern data stack most of us know and love today. If you’re new to the MDS and interested in learning more, dbt’s CEO, Tristan Handy, wrote a great post, The Modern Data Stack: Past, Present, and Future, in December 2020, that perfectly articulates the transformation at hand.
You’ve likely heard about ELT — Extract Load and Transform… the Modern Data Stack’s evolution on ETL. This is a game changer by nature in that it enables organizations to ingest raw data into the data warehouse and transform it later. ELT gives end-users access to the entirety of the datasets they need by circumventing downstream issues of missing data that could prevent a specific business question from being answered.
ELT isn’t a new topic. I remember the early days of my career in the Big Data/Hadoop space. Originally introduced by Google in 2005, the promise of Hadoop was, more or less, ELT. Don’t transform data on its way into the data platform and risk losing critical data that you need down the road. Instead, ship the entirety of the raw data into Hadoop, and you’ll have it…. for whenever you need it.
The promise was great! Truly an amazing sales pitch, but unfortunately, it just didn’t work out. Data processing was too slow. It frustrated end-users, was not usable by the knowledge workers, and didn’t enable agility in analytically driven decision making. Spark made huge progress in reducing the speed of processing, but a better solution has emerged.
This is where the Modern Data Stack, and the Cloud Data Warehouse (the core foundation of the MDS), differ. By separating storage from compute and enabling elastic computation in the cloud, tools such as Snowflake, BigQuery, Delta Lake, and Redshift have changed this paradigm. Data teams can now extract and load raw data directly into cloud storage for a super low cost per terabyte, and the users who need prepared data for analysis can get lightning-fast speeds throughout their data analysis workflow.
To make things simple, I’ve outlined a few of the major technologies being leveraged today. Again, for a complete overview of the MDS, check out Tristan Handy’s blog that I posted above.
To answer this question, it’s first important to level set on the key users we see in the end to end data stack. I like to think of this by breaking down users into data producers and data consumers. Data producers are the individuals who are integrating and modeling data on behalf of the organization. Data consumers consume the cleaned and modeled data to answer critical business questions from that data.
In working in this space for years, I’ve seen firsthand that these tools (mentioned above) enable the data engineer and the (newly minted job title) analytic engineer super well. They provide modern, easy-to-use, and cloud native code-first tools for the ELT workflow and provide the organizations that employ these users with better speed and governance of their data assets.
Said differently, the Modern Data Stack is awesome in that it enables centralization, speed, agility, and performance. But, today, it is a walled garden for users who are highly proficient in SQL development.
So who is left out in this world? In Tristan’s article on the past, present, and future of the Modern Data Stack (posted above), he argues that one of the biggest gaps still to be closed in the MDS is the bridge to the data consumer.
Tristan goes as far as saying, “Data consumers were actually more self-serve prior to the advent of the modern data stack: Excel skills are widely dispersed through the population of knowledge workers. There has not yet been an analogous interface where all knowledge workers can seamlessly interact with data in the modern data stack in a horizontal way.”
I couldn’t agree more! In my viewpoint, the data consumer, in the analytics workflow, is not only underserved but arguably the most essential user who needs a solution. After all, this user is the one who sits closest to the business decision that can drive differentiated advantage in the marketplace.
This knowledge user is different. They typically understand SQL conceptually, and many are proficient at writing SQL code, but the true power of this user is that they act as both the data SME and the business SME that truly understands the problem at hand and the business context of what they are trying to solve. Said differently, the critical value of the knowledge user is domain expertise, not programming expertise.
This user needs to answer questions from data. They need insights from data…. fast. They are typically on the hook for preparing critical data insights that lead to executive and senior management decisions on the direction of the company. Important stuff!
In my opinion, ELT represents a transformational advantage for this user. Traditional BI tools such as Tableau do an amazing job at enabling visualization of data to answer questions but rely on the data engineer or analytic engineer to first prepare raw data into an analytical table that Tableau can query.
This means that the visual insights from raw data workflow is, today, inherently not self-service. An analyst needs to partner with the analytics engineering team to get the data they need in the format they need. This slows down decision making to a halt and typically annoys the analytic engineering team with ad hoc requests that fall outside the scope of their prioritized sprint.
Analytic engineers own production-level data pipelines that the business relies on. These pipelines are static, hardened, and scalable. This is a fundamentally different requirement than ad hoc data insights which are agile, iterative, and experimental. A new solution paradigm is needed.
This requirement is not new. Alteryx built a massive business around no-code data preparation, but it was built for the traditional, on-premise world. It does not give users the advantages of ELT within the Modern Data Stack. It is, instead, ETL, forcing users to extract data from the central warehouse, impacting performance, speed, and auditability of the transformation pipeline.
In another blog from Tristan, “Code and No-Code, Having it Both Ways”, he describes a major gap in the marketplace requiring no-code tools that, behind the scenes, write Modern Data Stack native SQL for the end user that is automatically pushed down to data in the Cloud Data Warehouse.
A solution like this would enable data consumers to capture the value of ELT in the Modern Data Stack without forcing them to become an expert-level SQL developer. It would enable a rapid up-leveling of skill sets within a user community overnight.
At Rasgo, this is a problem we’ve thought a lot about. We’ve worked with hundreds of data analysts, business users, and citizen data scientists who have heard a lot about the Modern Data Stack and the value their data engineering teams have gleaned, but haven’t yet been able to fully capture the value of the cloud data ecosystem themselves.
From my viewpoint, a new breed of tools purpose built for these end users needs to emerge to truly bring data consumers into the MDS ecosystem. These tools need to be UI first, enable a more iterative workflow, and be fully native to the underlying dbt transformation engine. They need to enable business users to answer ad-hoc questions from data as they come up, whether or not these questions rely on metrics that have already been created by the Analytic Engineering team. Without this, data consumers will always be at the mercy of their engineering counterparts, and the engineering teams will always be responding to ad hoc requests from the business, instead of thinking about modeling data for the holistic needs of the enterprise.
Lately, I’ve been describing this to folks within the industry as “The UI for the Modern Data Stack.” It’s something we at Rasgo are heavily committed to perfecting, and we believe that, without this, true data driven decision making will never fully be realized.
Really appreciate you reading!! Super exciting times to be in the data world and we look forward to what’s ahead :)
A majority of business leaders believe data insights are key to the success of their business in a digital environment. However, many companies struggle to build a data-driven culture, with a key reason being the lack of a sound data democratization strategy.
Just like data mesh or the metrics layer, active metadata is the latest hot topic in the data world. As with every other new concept that gains popularity in the data stack, there’s been a sudden explosion of vendors rebranding to “active metadata”, ads following you everywhere and… confusion.
As data became more and more available to companies, data integration became one of the most crucial challenges for organizations. In the past decades, ETL (extract, transform, load) and later ELT (extract, load, transform) emerged as data integration methods that transfer data from a source to a data warehouse.
As the amount of data rapidly increases, so does the importance of data wrangling and data cleansing. Both processes play a key role in ensuring raw data can be used for operations, analytics, insights, and inform business decisions.
Do you know the current status — quality, reliability, and uptime — of your data and data systems? Not last month or last week, but where they stand at this moment. As businesses grow, being able to confidently answer this question becomes more important. That’s because data needs to be clean, accurate, and up-to-date to be considered reliable for analysis and decision-making. This confidence comes through what’s known as data observability.
In the past years, organizations have been investing heavily to convert themselves into data-driven organizations with the objective to personalize customer experiences, optimize business processes, drive strategic business decisions, etc. As a result, modern data environments are constantly evolving and becoming more and more complex. In general, more data means more business insights that can lead to better decision-making. However, more data also means more complex data infrastructure, which can cause decreased data quality, a higher chance of data breaking, and consequently erosion of data trust within organizations and risk of not being compliant with regulations. The data observability category — which has quickly been developing during the past couple of years — aims to solve these challenges by enabling organizations to trust their data at all times. Although the category is relatively young, there are already a wide variety of players with different offerings and applying various technologies to solve data quality problems.
Data governance is more than just having a strategy – it is about establishing a culture where quality data is achieved, maintained, valued, and used to drive the business. Modern-day businesses are supported by data and information in many ways and forms. In recent years, data has become the foundation for competition, productivity, growth, and innovation. We are seeing successful organizations shift their focus from producing data to consuming it, and data governance strategies becoming increasingly important to support their crucial business initiatives. Executives and shareholders are starting to realize that data is a strategic asset and data governance is a must if they want to get value from data.
I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.