Mar 21, 202336 min
Share via:

S02 E05: What's Fundamentally Wrong with Modern Data Stack with Lauren Balik, Owner at Upright Analytics

Lauren Balik, who runs Upright Analytics and is a leading data consultant and investor, discusses why she believes the modern data stack is flawed and the three factors that affect the cost of a data platform. Balik also compares building versus buying a data platform and recommends an OLAP database in the cloud for small companies. However, she thinks centralizing data out of a line of business is a mistake for larger companies. Balik does not anticipate consolidation in the modern data stack and thinks that large language models such as GPT-3 will be crucial.

Available On:
google podcast
Amazon Music
apple podcast

About the guest

Lauren Balik

Lauren Balik runs Upright Analytics and is a leading data consultant and investor. With her company, Lauren and her team are solving the toughest data challenges and empowering data-driven cultures, helping people fall in love with their data.

In this episode

  • What's wrong with Modern Data Stack
  • Cost factors for a data platform
  • Positive outcome from MDS revolution
  • Way forward for the data space
  • Thoughts on future consolidation


Welcome to the Modern Data Show, where we explore the latest advancement and insights in the world of data. Today we have a very special guest, Lauren Balik, who runs Upright Analytics and is a leading data consultant and investor. With her company, Lauren and her team are solving the toughest data challenges and empowering data-driven cultures, helping people fall in love with their data. Join us as we dive deep into the mind of a true data expert and learn about cutting-edge solutions and strategies she's bringing into the industry. So sit back and relax and get ready to learn all about the world of Modern Data with Lauren Balik on the Modern Data Show. Welcome to the show, Lauren.
Thanks, for having me.
Okay, so Lauren, let's start with a very quick primer on Lauren. Would love to know a little bit more about your journey and how you got into data and your story of starting Upright Analytics.
Yeah Absolutely. In college, I studied economics. I fell in love with data there and applying it, I was using everything from MATLAB, learned some SQL, and did all that in school. Yeah, I went to the University of Virginia. So it's been about yeah like I was doing that a little over 10 years ago now. So, time flies, but love data, fell in love with it, graduated, and then got into sales, which is my first job outta school, I was selling software and working at a tech startup selling mobile messaging software. Nothing to do with data, not applying the things I like to do at all. And it was a very very interesting learning experience. I learned about the tech startup space and everything, but after about a year, I had to get back into it. Rather than prospecting people and doing the phone calls and doing, I guess what today we call them SDRs doing that type of job right out of school got back in the data. I've worked with organizations over the last decade or so from, some of the biggest names in the world, the Fortune five hundred, large electronics companies down to tech startups. Have seen, the rise of Tableau, the fall of Tableau, the rise of the cloud data warehouse, the what might be the start of the fall of the cloud data warehouse now. So, have seen it all. I like to be pretty pragmatic. Keep my head screwed on right about things and separate buzz terms from actually providing customer value to very different things and yeah, at Operate Analytics run a very small lean team. I would say over the last year and a half, we have mostly just been doing open heart surgery, is how I describe it. And so if you've ever seen a movie, like Pulp Fiction they have to call in the wolf at one point to clean up a big mess. So, they call up the wolf, the wolf comes in, you go over here, you don't say anything, first, we clean this, then we clean this, then fix this up. And I've been doing that for I've been running my company for a few years now, but last year and a half has been a lot of that. And I think it's a function of a lot of the over-eagerness, overspending, and maybe not hiring people in the right roles. It's a function of a lot of things, but that's what we do. And, work with organizations large and small.
Amazing. And let's address the elephant in the room. You're not a big fan of Modern Data Stack. You already mentioned Tableau is dead. You already mentioned cloud warehouses are about to be dead. Let, before we go into that first help let's bring, let's be on the same page. What's your definition of the modern data stack? Which modern data stack we are talking about?
Yeah when I say the modern data stack, and I think what most people think of is the idea of, a centralized cloud data warehouse that operates as the computing centre and a single source of truth around which sit, you know, outcomes that range from potentially machine learning. Usually, it's analytics and BI and there's a heavy lean towards, I think, SaaS applications that plug in versus coding especially, when you see ingestion jobs, ETL. So there's a definite lean towards that. And so that's what I would define as the modern data stack. I can put the lakehouse in there, some of the things that are going on with a lake house type architecture. But, I think it's all the same at the end of the day.
This story sounds good. You have the central data warehouse where you have the central source of business knowledge, and then you build applications like the data applications on top of that to be able to support the business decisions. What's wrong with this? What's fundamentally wrong with the modern data stack, or where do you think Modern Data Stack fails to deliver?
Think if we look, just broad lens, if we take off our engineering or analyst hats and just look at the broader state of how did we get here, what's happening, et cetera. The whole thing is based on the NRR or NDR, net dollar retention of Snowflake, Databricks, Google Cloud BigQuery, and the Google Cloud Ecosystem it's all based on the net dollar retention of these compute centres. So you store your data, then you do stuff with it, you make DAGs, you make SQL statements, et cetera, in a variety of different ways. And because like if you look at, and this is broader than just the modern data stack, you can look at monitoring and observability stuff that is in the broader engineering world and broader infrastructure world like Datadog, HashiCorp, they all have great NDRs. They all have great, every year, from the previous period to this period, customers are spending, 130% more, 140%. I think snowflake's gotten up to 178%. So these customers are spending more and more every year on a lot of these cloud products. And when we're talking about the modern data stack we're talking about the compute centres here and no one has ever brought this up I think, but if you talk about if you have a credit card and you're paying 30% interest on it, that's a bad credit card. Like you're getting screwed over a little bit by the credit card company. It's high interest. And if you compare the net dollar retention of what some of these companies are, and let's be honest, it all centres around the computing center of the cloud warehouse and the modern data stack world. That's what drives everything else forward. And so I think right off the bat like there's a little bit of, I don't know, maybe a misplaced incentive around what customer success looks like versus how much a customer is paying. And there are a lot of incentives that may not line up with perfect customer value in that second year, in that third year versus what they're paying. Okay.
And what do you think is the biggest culprit of this? Are these, this is the data warehouse itself or the ingestion pipelines that are dumping an insane amount of data into the warehouse and this whole shift from ETL to ELT? What's the main culprit there?
Yeah, the rebranding ETL is ELT is funny because it just moves instead of processing the data. I don't know, in Spark or wherever you're going to process it beforehand or define your schema, you're just dumping it somewhere else and then defining it later. So they own the computer there. But you know, overall, I think the main culprit in this is that a lot of companies aren't educated on what it looks like to succeed. Snowflake for all the great things that they've done and in Google Cloud as well as BigQuery, they've lowered the bar and lowered the entry point of what it takes for a company to get started. They want the data now. They want to build some analytics. They wanna build some reports and things, and that is that's great, but if you don't design stuff correctly upfront, you're taking a payday loan at 30% interest, or in the case of Snowflake, I don't know what are they, 60% NDR or something. You're taking a 60% loan on average and 60% like interest. And this adds up like it adds up in the second year. It adds up in the third year. A lot of it comes down to SQL. What we've seen here is the rise of SQL, in the last couple of years. SQL is easier to learn than an object-oriented programming language. Using declarative SQL to define your data schema in Snowflake in a database, yeah, it works fine. It works great. If you're not writing optimal SQL and you're doing things like joining on high cardinality data that can add up very quickly if you're doing like looping functions or nested loops in SQL. That's gonna add up and a lot of people don't think about this, but I think a lot more like both individual contributors, data directors, and managers need to think about this more about the exponential growth. If you're running some kind of function, you know right now, like it can cost you like, I don't know, one snowflake credit next period, it can cost you two snowflake credits when your volume goes up and you're running it. Now, if you're doing it linearly, it should take you three. It costs you three credits next time. But if it's growing exponentially, it's not gonna cost you three. It's gonna cost you four. And then next time it's gonna cost you eight. You're gonna double it every time cuz you're making all of these, N-squared types of solutions, and I've seen this pretty constantly. And the same with indexing data or figuring out how data's partitioned. A lot of basic stuff falls by the wayside because of the desire to move fast and moving fast is good. And moving fast is the name of the game for a lot of these tech companies in the last few years when it's all been run, run, run, but now, it's time to optimize a little bit. And, I even wrote a piece recently on my Medium about, Snowflake as a table game. But the Google Cloud ecosystem is a casino. And, if we're talking about things that are wrong or maybe look off one of the funniest things here, and I love Google. I like BigQuery a lot, and I like their ecosystem. I like all their solutions and use them all the time. But one thing I've gone over throughout my career is with these tech companies like they'll start using Google Analytics, they'll start doing Google paid ads, and like that. If they're in consumers, like they're selling consumer goods to individuals, they're definitely using Google Ads, Facebook ads, and everything else. Google ones can be big. Same with some B2B but they're using all these Google, they're paying. Google has lots of money for all these like applications, Firebase for their app, and they're collecting all kinds of data. And they get to a certain point when they say, wow, I'm spending so much money on Firebase and Google ads. Like, why are we spending so much on Google Ads? We shouldn't be paying Google so much money. We need to get our tax to improve our customer acquisition costs, we shouldn't be, we should be targeting better customers. And so then what do they do? Everything plugs in nicely to the Google cloud platform like all of Google's products. So then they hire machine learning analytics, and data engineering to work on Google Cloud. They get their tax down. They might pay Google ads less money, but now they're paying Google like BigQuery, and Google Cloud more. So Google, even though Google goes down over here on the ads. They go up on what they're taking out of these companies on the compute side of putting together dashboards, machine learning, and everything else to improve what they're spending on other Google products. And I think that's just really funny overall and I think a lot of people have not thought about that fully. But it's definitely like what the game is. So, very funny stuff.
Yeah, so Isn't that in argument, in favour of the modern data stack where you're essential, unbundling a lot of these functions into kind of composable elements where you are giving the people a good amount of control in terms of how they want their data stack to be.
You're giving them more control and at scale, it works. If you're a big enough company, you can get good cost savings out of this. Most companies are not that big. If you're like, selling e-commerce, like if you only have a million customers or 2 million or 5 million customers, like actual humans out there who've ever like actually completed an order with whoever completed an order with your business and become, what's defined as a customer. If you're operating at a level that's small. You're probably overspending on data if you're trying to tinker with these margins and fix your Google ads and fix everything else there. If you have a hundred million customers, if you're McDonald's or I don't know, Ford Motor Company or somebody then yes, it would make sense to potentially take this on and do all this analytics to tinker with I don't know if those companies are on Google, but you know, to tinker with, what you're spending on Google. Using Google Compute to do it
Right. And this also raises a kind of parallel question in terms of what are your thoughts on build versus buy. So basically what you're saying is companies rush in, in terms of buying managed services to get that initial friction out and get started with things. But that kind of turns out to be a very costly mistake in the long run. But at the same time, the kind of engineering efforts and the development that you need to require that is non-core to what you're doing as a business. Isn't that counterproductive? Like you, you take the example of an e-commerce company, why would an e-commerce company, with less than a million users or whatever skill that you would define would invest resources in terms of building an internal data platform versus actually focusing on resources building that business itself? Isn't that a counterargument?
What you have to look at with these types of things is there are three factors at play here in what the cost is. There are your cloud costs, your storage, and your compute of data, there are all the products you may want to add to it or not. Maybe if you wanna open-source stuff, you're not gonna pay. If you wanna pay for a bunch of SaaS applications, you can pay there. And then there's the HR costs the human resources like headcount costs. And when you have one go up, you should have the other go down. So if you're paying if you hire more people, your cloud costs should go down because people can spend time optimizing. And if you're hiring more people, like maybe, you shouldn't be using so many SaaS applications because if you're hiring good people, they can build something and maybe it's more efficient. But what we've seen here in the last couple of years is companies are spending lots of money on extra heads. The data team was two people, now it's six people, and now it's eight people. And they're using all these applications to move data around in various ways, ETL, Reverse ETL, and manage all the tables. We could do a whole bit on the proliferation of tables. And so they're spending money in these three different buckets. And I don't think people talk about build versus buy the benefit of building is, you have ownership over your data more and you have more control. But you're also probably gonna have to hire more people to do it or hire more expensive people anyways to do it. And right now we see expensive people, we see a lot of SaaS apps, and we see cloud costs going up. So all three are going up. And that's, that's been the challenge.
And, if you go to we have this page where we list the various categories that are there within the modern data stack. Just our interpretation of those categories. There are roughly around 28 or 30 different categories out there and that seemed annoying that seemed to annoy you. And two questions there. One is, What is that one good thing that you really believe came out of this whole, modern data stack revolution that happened in the past couple of years? What is that one true thing that, it's not a vendor-driven narrative, it's not a visa-driven narrative? That's one thing that the one needed and bam, someone solved it.
Yeah. I think a lot of people have learned SQL pretty well, and I think a lot of people who may have, their title may have been like marketing ops or sales ops, but now they're, I don't know, a data engineer or data scientist or analytics engineer or something. A lot of people have been more educated on data, that's for sure. And, I think that's a good thing. So yeah, that's what I would say the number one best thing that's come outta this is a lot of people have gotten more educated. And the funny part about the categories to is. You think about what's core, what you start with, and what you need to do data at your company. You need to collect data somehow. In the most simple way. It could be an Excel sheet of all your orders, and then you wanna make a chart and look at orders over the day or time or something. But the more categories you build, like it all like pyramids up, like yes, in the modern data stack everyone, let's say has Snowflake or a BigQuery or Databricks or whatever. They've got a BI tool and they've got some way to get data in. So they've got 1, 2, 3, they've got the whole pipe there. Then you move one step like those customers, some of them also have now observability, now or reliability, or whatever you want to call it. Now, some of them have a catalog on top of that. So but that's even a smaller number. And so, you get up to the point where, you know it turns into this dependency ladder because a lot of these, like observability for example, if they're just for the cloud, like they're not gonna have more customers. Then there are customers who like using a cloud data warehouse, right? Like that number's always gonna be high, equal to or higher, like by default. And so the more assumptions you have to make about these categories and what's already in place and how big the team needs to be and how complex it is. The more gets riskier as an investment and, maybe not something that's considered necessary. And maybe the TAM, the total addressable market, or the total serviceable market is a lot smaller than people think.
Fair enough. And you consult a lot of companies ranging from startups to Fortune 500 companies in terms of building a data strategy for them, right? If that would be the right term for that, what's your typical advice for them? For example let's take a scenario, have this, the early-stage company that has just started to hit the scale. They come to you and say fine, we have found that initial product-market fit as a company, whatever we are doing as a business has started to catch off and now we need to start leveraging data to be able to grow even further. What's your advice? What's the typical journey that you would take on with a customer in terms of helping them understand what they need? So what would be a typical flow for you?
Yeah. So, initially, it's what matters to the business in the next quarter and six months. That's what matters. And in the last couple of years, with the way the economy's been and the way VC incentives work and the way these startup incentives are, it's all been growing. We don't have to worry about merging too much. Let's just grow. And so, you see all these growth roles like a lot of companies have like the director of growth. Growth ops and these people are doing a lot of like ads and other growth hacking types of things to put in place. So if that's what's important is getting a growth metric to get to that next round of financing or to, get to that next stage you want to get to, then it's gonna be a growth initiative. Now with a lot of larger businesses that, you know I've been around a while, and they're not, they've either gone past the VC game of growth or, they're just, financed by the cash they collect and by debt sometimes plus any equity finance that they have. But aren't VC companies like these companies they care about their margin and they care about operational excellence and for them, it would be improving, CAC would be a very important thing. That comes up pretty constantly as it relates to growth. Another would be like these types of businesses when people talk about centralization, modern data stack, et cetera. A lot of these businesses that have been around for a while and have over a 1000, 2000 people independent of industry, like they're gonna have some databases over here, they're gonna have some databases over there. They might be running a good they may have a good analyst who only works with Salesforce data and marketing data. And none of that's connected to the product data, like a lot of these larger businesses. And I think when people talk about, how are things like in a different company versus the company where I work? Like a lot of these larger companies, they have something that resembles a data mesh in place a little bit. And they often have decentralized analysts and they typically are able to serve themselves decently. Something like spin up, let's stuff, everything in Snowflake, or let's stuff it all into Redshift or whatever. That is something that they're not gonna want to do just because they're already humming along and they're already very much delivering on things and good enough works for these types of companies.
Right and if not modern data stack, what's the way forward? what do you think is the way forward?
I think, for a lot of these because the bar has been lowered to run an OLAP database in the cloud that's gonna stay. And that's a good, way for these small companies that are growing, maybe VC funded to get started and get some data in place and start delivering on some dashboards. It shouldn't be the earliest thing you do but it should be, maybe when you get to 50 or 100 people somewhere around there, like that's when you should start pulling this together. So you know for those types of companies, I think this is gonna stay an operating paradigm that exists for the next couple of years. And for larger companies, I think it's gonna stay decentralized like it always has been. And trying like trying to force the centralization of data and centralizing it out of a line of business is, I think a mistake in most cases. And it's just there are gonna be a lot of failed projects that happen. As this, attempts to move up the market more.
Would you see a consolidation happening in the modern data stack? Last year was debundling. Do you see a bundling in the modern data stack?
So I think there's already a lot of bundled solutions out there. There's Keboola for example. You can do orchestration in it. You can do your ingestion. You can manage the flow of data in and out of Snowflake or whatever you want. Same with a bunch of other solutions, Rivery is the one, Nexel is a great one. They have a lot of good enterprise customers. And so when people talk about these companies like, reverse ETLs and ETL companies and everything else, merging or acquiring each other. I don't think that's gonna happen because they're all way too expensive. They're all valued at $500 million or more. And outside of Fivetran and some of the BI tools, Most of these companies are under 20 million dollars in revenue. So you're not gonna buy them for their revenue or their customer base really. And if you're not gonna buy 'em for the customer base, okay, we may have a complimentary feature, but you would just build that yourself. So I don't actually see this scenario of these companies, buying one another and consolidating that way. I think that Keboolas,the Ascends, Nexlas the Riverys that already do multiple things, they're just gonna rise more and more. And I don't know what's gonna happen to a lot of these modern data stack companies. Like they all raise a lot of money in 2021 in 2020, and a lot of 'em have dozens to hundreds of employees. Fivetran makes a lot of money. Some of the BI tools make, in the tens of millions or hundreds of millions. But everybody else, like they're gonna run outta cash by the end of the year. And so they're either gonna have to do a down round and, basically put all their employees underwater or they're just, gonna have to like, start letting people go. And I think we're probably a couple of months away from that if you just look at the employees like employees are a big expense. And a lot of these companies, like if you're paying on average an employee, 200,000 US dollars a year with their salary insurance, the insurance you pay on them and multiply that out by you have a hundred employees. You have 200 employees, you can pretty much see how the cash is burning down in that organization. And if they're not making their revenue to cover that and they're not raising new money, then I don't know what's gonna happen and I think they're gonna have to make some tough decisions here.
Let's just hope that doesn't happen. Hope there is a happy way out here. But as we inch closer to the end of the episode, there's another interesting that I just would love, want to talk to you about it, which I saw in one of your articles about the impact of GPT3 on tech workers. Big question. What impact do you think GPT3 or these large language models would have on data, on the data industry what are the things that you see, are the most obvious ones to you?
Yeah so the most obvious here is, writing SQL that queries into an established model. And what I mean by the model is that the word model has gotten misused in the last couple of years by data science, analytics, engineering, et cetera. To query into this data mart, data model, whatever you want to call it, that makes a lot of sense. And it makes sense because if you've defined the entities, this is a customer, this is a transaction, this is this type of transaction. If it's all well defined there's no reason that like GPT-4 when that's released later this year GPT-5 after that, there's no reason that like this code is not going to be able to be generated very successfully with maybe only a mild amount of handholding. But the entity resolution piece is where the money is because you can go on Chat GPT-3 right now and say what is the female population in New York State, and because New York State, all the states are well-defined public data, the population that's female, they can come up with an answer very quickly because those entities are well defined and they're public. Now, if you go into a business and say, how many orders did we do? And what was our net operating profit last quarter? Now you have all these metrics and these entities that have to be defined, and I still think that's gonna have a human element in a building for the next couple of years.
Yeah. Because you have orders underscore one table order underscore two table orders underscore final table. Yeah, okay that's very interesting. So, I think so we are seeing quite early results from, this whole natural language processing to sequel generation. And I think so the bigger problem is not just the ability to generate SQL from tricks, I think so that's a solved problem and that kind of works really well, I think. So the bigger question, as you rightly said, is how do you organize this knowledge that is there within an organization to be able to actually understand where to actually query this data? So that's a fair point. Anything other than this then you see specifically for the data teams on the applications of the large language models?
Yeah, I think GitHub co-pilot is gonna continue to spread and popularity. It was controversial. I know Amazon I don't think it was AWS. I think it was Amazon's retail side of the house. I think they actually banned co-pilot cause they were concerned about what code was potentially being collected by GitHub, which is owned by Microsoft. So, there would be a little bit of stuff to watch there, but I think that if the automation really saves so many human working hours, then it's only gonna continue. And I know plenty of people who are saving half their time or more already using this.
Amazing. As a closing thought, what would be your prediction for the, and more than a prediction? What would be your top three insights for the data teams for this next year, what are those top three things that you would advise, data teams, and companies who have built out their data functions over the next year? One obvious one is, as you've rightly said, cost. People would start to think about how they're spending on the warehouse or ETL credits and so on and so forth. What would be those things for you?
Yeah. And that's one I think and the human costs are what's most expensive, right? And so that's why we're seeing so many layoffs. So I would, advise people and advise anyone listening. If you're, a data scientist, data engineer, or analytics engineer, like really figure out where do you fit into the company? As I've talked about a lot of these data jobs have become human middleware. You're not on the application side, you're just in the middle. You're not doing analytics, you're just in the middle. The more in the middle you are, the worse off your situation might be. So I would encourage people to think about that. I think we're gonna see the rise of like Ops and so the last, decade we've seen developers DevTools, make a tool that does this thing and you can turn this snob over here and there's open source and we have a community around it. And we've seen this everywhere, not just in the data world, but much more broadly. And I think that a lot, as a lot of these companies rethink this, a lot of these developers are just doing Ops. They're just doing operational things. And a lot of data especially is really just operations and the more steps and the more tools and the more people you have to get a number on a dashboard. If someone can come in and do that better and do that with Retool or Airtable or, instead of a database and a BI tool, they're gonna win cuz they're gonna be cheaper and they're gonna get it done more efficiently. And I think we already see this. And I think the last one is I would encourage every company to do an audit of how many tables they have in their cloud data warehouse. This is where the costs add up because in every table you add if you have a medallion architecture where you have a staging, then you have another layer, then you have the final layer, the production layer gold, silver, bronze, or it's called a number of different things. How many tables do you actually have ingested? And how normalized is it? Because one of the big trends was, your OLTP would be normalized and then you would ETL it and then your OLAP would be de-normalized to serve reporting and BI and analytics and everything. And we've moved away from that. Now we're just copying things from an OLTP to an OLAP from different endpoints that come off APIs. Just bring it in de-normalized and then roll it back up and if you look at costs and look at efficiency and where your team's time is being spent. The more you make more tables and then roll them back up, the lot of that's wasted time. And so I think a lot of people have realized that and have started to move to more, de-normalized structures as a goal and as a first goal. When they're building out their, Snowflake, BigQuery, what? Whatever else.
Interesting, thank you so much for that lovely piece of advice, and thank you so much for this so amazing conversation. Lauren, it was such a pleasure to have you here on the show and I hope you know, we all learned quite a few things from this episode.
Yeah, thank you so much for having me. I appreciate it.