May 30, 202325 min

Share via:

S02 E14: Transforming Data Pipelines for the Future: An Interview with Sean Knapp, CEO of Ascend.io

Uncover the secret to turning data engineering into a superpower! As Sean Knapp, the CEO and founder of Ascend.io, joined us and discussed the value of depth and breadth in capturing the entire data value chain, emphasizing the need for an automation layer to adapt to the evolving data landscape. Ascend's platform enables intelligent data pipeline creation and management, with a dynamic control plane that detects and responds to changes in real time across extensive pipeline networks. Sean further explored the potential of generative AI in data engineering & his optimism about the future of the modern data stack, foreseeing consolidation and the emergence of new parallel spaces in the data ecosystem.

Available On:

spotify

google podcast

youtube

Amazon Music

apple podcast

About the guest

Sean Knapp

CEO and Founder

Sean Knapp is the founder and CEO of Ascend.io, a leading company in data pipeline automation. With a remarkable background as a co-founder and CTO of Ooyala, Sean has extensive experience in scaling companies, raising funds, and overseeing successful acquisitions. He also made significant contributions to Google's revenue growth as the frontend technical lead for the Web Search team. With a strong educational background from Stanford University, Sean brings valuable knowledge and expertise to Ascend.io. The company provides a comprehensive platform for data pipeline automation, offering intelligent capabilities to detect and manage changes, ensure data accuracy, and measure the cost of data products.

In this episode

Sean Knapp's background as a software engineer and his experience with data pipelines at Google.
Challenges faced by data engineers in terms of productivity and creating powerful data products.
Ascend.io's comprehensive solution for data pipeline automation and its global presence.
The three planes (build, control, and ops) of Ascend.io's approach to automate and manage pipelines effectively.
The potential of generative AI in data engineering to simplify tasks, bridge skill gaps, and optimize code, accelerating individual growth and career progression.

Transcript

00:00:00

Welcome back data enthusiasts. You're tuned into another exhilarating episode of the modern data show, where we dive deep into the ever- evolving world of the data and its incredible impacts on our lives. Today, we have Sean Knapp, the CEO and founder of Ascend. io, which is at the forefront of data pipeline automation, revolutionizing the way organizations build and manage intelligent data pipelines. Welcome to the show, Sean.

00:00:23

Thanks for having me. Aayush.

00:00:24

Sean. So let's start with the very first question. Tell us a little bit more about your background. We'd love to know a little bit more about Ascend and what it does, but before we even get into what does Ascend do, we'd love to get a backstory of how did it all happen?

00:00:38

Yeah, happy to. I'll try and make it. It's getting longer and longer, my background stories. So I'll try and make it more and more compact. I started my career as a software engineer, did my undergrad and master's in computer science from Stanford way back. But, my early career was actually at Google, and I think the reason why that's important and relevant here, was actually ended up using data pipelines within a month of starting my job at Google in 2004 writing MapReduce jobs to analyze weblogs and while my primary job was pushing pixels on the website and running experiments and doing all these other interesting things, I very quickly found myself being an accidental data engineer, even really early on in my career. And so really over the last 19 years, I've spent a lot of time in and around the data ecosystem. Was fortunate enough to be an early user of BigQuery, which back inside was called Dremel back in the day when I founded my first company with a couple of other Googlers back in 2007, we within six months had our own Hadoop cluster running and we're already running a big data stack. And built that out to about a 60 person data team before we were acquired much much later on by a much larger company. And the reason why I think a lot of that took me to building out ascend was, for me as a former data engineer and as a former CTO, I really saw that the trends starting to emerge, not just around this. Every company is a software company, which means every company is a data company. But as I really looked at how a lot of the landscape was maturing, it was becoming abundantly clear that the impending sort of challenge that really was going to face all of us in the data ecosystem was going to be a productivity of developer knowledge. And it was no longer going to be a challenge of size and scale of data. The classic volume, variety, velocity, et cetera, but it's going to be, how do we, with our limited data engineering resources across every industry and every company. How do we produce more? How do we actually get ourselves out of the weeds and really create these really powerful, compelling data products? And to me, that's an automation problem. And in late 2015, got a bunch of really great investors excited about this idea and we started to hunker down and build some really hard technology to help us bring in a whole new wave of automation into the data pipeline space.

00:03:15

And how big is the company? What stage you guys are at right now? Tell us a little bit about the current state of the company.

00:03:21

Yep. Great question. So the company we've raised 50 million dollars from Tiger Global, Sequoia, Excel and Lightspeed Ventures. We think our last round was a series B last year. We currently have, obviously we're headquartered in North America, but I have recently opened up offices in both APAC and EMEA as well, and are continuing to grow pretty nicely, even despite the macro market headwinds today. Of customers around the globe and across industries to which we're very flattered and excited about.

00:03:58

Okay. Amazing. And now let's dive deeper into what does ascend do. Tell us a little bit. I saw through the website and one thing that caught my attention is the way you have described 3 different planes within pipe and automation, which is the build plane and control plane and ops plane. Tell us a little bit more about that and tell us why do people even need something like Ascend and why not something like plain, old, simple Airflow Dags?

00:04:27

Yeah, absolutely. I think It helps us at first to think about this notion of an operations plane, a control plane, and a build plane as despite it being a little bit more academic of a philosophy we find in most maturing technology spaces that we do need a architecture around this. And, I think, as we think about the modern data stack, and all the amazing innovation that's happened over the last few years. We've seen this massive surge, and I would contend sprawl at the same time of all these amazing tools and technologies without a well designed framework and architecture around this that pushes a huge amount of burden on engineers to then go and integrate and it feels oftentimes like we're all just writing a ton of code, connecting all of these various systems together, which feels very brittle and ends up us being a lot more around connection engineers versus actual platform and data pipeline engineers. And so when we think about a framework, we think about it in this notion of operations, control and build. And really the important part here is when we think about a control plane. The evolution of most mature technology spaces and those that have something that's continuously running generally moves from an imperative world to a declarative world. And if we think about Airflow, It is an imperative scheduler and an amazingly powerful tool and technology. But when we think about imperative schedulers, they generally operate on timers and triggers. But they're generally not overly intelligent tools. And when I say intelligent tools, they're very powerful, but they run code on timers and triggers and they produce side effects and they're unaware of the side effects that they produce. They just generally know that you run one task after the other is based on the dependency chain and the reason why that becomes very scary is the side effects of your code then have ripple effects across your ecosystem. And the thing that's driving and powering that is unaware of those side effects. So it can't help you do anything about that, right? When you're writing a pipeline and you read some data and you write it somewhere else. It doesn't know the nature of that data. It doesn't know, it oftentimes doesn't even know that it wrote that data. That burdens on you as an engineer to track that and register it and trace the lineage and validate it, do all the other things. What we see in rapidly maturing spaces, like for example, container orchestration with Kubernetes, you go from a simplistic scheduler based model, which is classically an imperative construct into what really is a context aware domain specific control plane, which is really fancy way of being like a really smart, badass scheduler. And the idea behind that is when you can make that shift, you can move into a declarative model where that new model allows the system to actually understand what is happening. When you run code, what are the side effects of that code? What is it producing? What is the dependency chain? And as a result, then as a developer, we can lean on that automation factor for the control plane to do all the things that could break on us. And instead we can get a lot higher productivity. We can pull ourselves out of the weeds. And so when we think about this notion of a control plane what it really amounts to is it is like a scheduler, but it's a declarative model. It tracks incredible amounts of metadata and it ends up becoming this metadata backbone and bus, and that's how it connects up to this upper level, this operations plane because everything we do on operations liability, costing, observability, et cetera, all taps into that same metadata but the power is on a control plane, you can now actually just lean on it to drive more of the activity and track the side effects and the operations, the metadata tied to that which really helps connect those 2 other planes.

00:08:41

Understood. I saw that on the website and about reading the product is you, Ascend as a company tries to capture the entire value chain of the data, I think so, I saw you've got around like over 300 connectors that allows people to pull in data from various operational systems, put it into a data warehouse and manage the entire data life cycle. That approach is quite atypical when we're talking about companies within the modern data stack, where what we see is, a lot of companies targeting very specific problem within the data life cycle and kind of going deep into that. What's your take on that? What's your take on going broad versus going deep into something?

00:09:32

Yeah I think It's a really good question and I think you, you need both. The 1 part that I want to be really clear on first is, as soon as really deep partnerships and really deep connectivity with a large amount of the ecosystem, obviously we're really close partners with Snowflake and Databricks and Amazon and Google. But we also partner with a lot of other companies. We have many customers who are on the observability side using Monte Carlo or Great Expectations. We have a number of customers who are using 3rd party catalog tools that we integrate with. We actually partner with a company called CData to offer up our so many data connectors at the same time. And so while we it certainly looks like an end to end solution. Part of our goal and value prop is to pull together a lot of these capabilities. It's tied to that metadata bus that powers that control plane itself. But the reason why I think this is so important is in nascent and very early technology spaces, you tend to see this massive proliferation of tools as people would sprint to really create a huge amount of value in particular vertical spaces tied to the larger space. And when we think about the maturation cycle in technology nevermind vendors and all the other stuff, but just as technologists here you tend to see a lot of, it starts with a bunch of pit packages you're gonna install and then, that eventually evolves into some very verticalized SAS tools and over the course of time, you start to see fewer vendors and fewer technologies and they actually go wider and the reason is, in the early days as it's this gold rush of which teams for which companies can harness the power faster, you get really high horsepower teams build up a lot of these capabilities. What happens however, is tech that starts to set in and we spent all of our time integrating all these tools, all these technologies and the speed at which that space is evolving to, we have the same internally, which is last year's innovation is next year's anchor 40 new. And so what we see with a lot of teams is they built these really impressive stacks pulling together a bunch of tools, but all of a sudden the space is moving so fast and you're trying to stay ahead and so now you're spending 80% of your time just on tech debt and trying to evolve your stack while everybody else is sprinting past you. That becomes this very exasperating and mentally challenging thing that we see with a lot of teams. Last year, you were the team way out ahead, you were writing your medium post about how awesome it is and all the cool things you have. And now you're desperately just trying to keep up with everybody who read your medium blog posts and was like, hey we can do that plus 1 and now they're out ahead of you. And so what happens is over the course of time we think of it as like a high watermark and over the course of time, the most common patterns and the best practices become far more clear and as those patterns become really emergent, the benefit is less around grabbing a bunch of point solutions and integrating, and instead we think there's this huge value in having an automation layer. Very similar to how Kubernetes is an automation and control layer for a really broad and diverse ecosystem and so when we think about a Ascend's strategy, our goal is to actually take the depth and the breadth of the ecosystem and pull it together so that it is a more seamless and lower maintenance burden and far more automated, come for users and customers for us, of course, but that allows them to stay up with that pace of innovation as opposed to fall behind us as they try and pull everything together themselves.

00:13:30

Amazing and tell us Sean, tell us a little bit more about the kind of customers you work with. Tell us a little bit about the stage what kind of, when does in their life cycle, a customer would be an ideal fit for you? Are you, or do you work mostly with early stage companies, mid scale companies or enterprises or the, giga size companies?

00:13:54

All of the above and I'd say that we certainly find what we describe as an inverted bell curve of resonance, to be really clear with with folks, we find we have some of our smallest customers are literally data or analytics engineering teams and small startups are just small businesses where, it's a guy and a girl and a dog is like the whole team and they're just building out their first, the first major deployments. At the same time, we have folks who are fortune 100 multinational conglomerates that are running complete multi cloud, multi data cloud, data meshes and leaning heavily on the automation. I think that's part of the excitement about it is that the power of automation actually serves both. The power of automation gives a small team tremendous leverage. It also gives these huge organizations with incredibly complex data ecosystems the same kind of leverage, which is really neat. The thing that I highlight with two in the middle, we have a lot of tech startups. They may have 20 to 50 person data teams and they use a send to really automate a lot of their pipelines. I would say the why I talk about the inverted bell curve of resonance, most data teams tend to go on a 2 to 3 year architectural cycle and we're finding this less today, but we do find that teams tend to get emotionally and intellectually pot committed to a particular architecture and so if they've recently committed to an imperative architecture they're generally not in a position to re-evaluate and say, hey actually want to go to a declarative model and to me, it's very similar. Kubernetes is a really hard sell to a team that just decided they really want to go on Docker swarm or do their own manual container orchestration.

00:16:00

Right and Sean another thing is how does the whole implementation process looks like? Because we're talking about a lot of moving elements and I would assume that probably no customer of yours would realize that value of all of these moving components from day one. What's the typical journey of customer who implements, sends and how does that journey evolves?

00:16:27

Yeah absolutely. Obviously we first start to get touch points with customers in two ways. One, we have a classic Salesforce that is going to go knock on your door and hopefully you answer because I promise they will make your life much easier and you'll be much happier and there are amazing people. The other is we have a free cloud offering, so we actually have a free developer tier and it's free forever and go to cloud.ascend.io and simply sign up for a free account and use it to power your data plane, whether it's Snowflake, Databricks, or BigQuery and both tend to start with Ascend. For our larger customers who generally come through we'll do four week trials with them, and we actually will help them really learn and understand about this declarative model but where we usually focus on with folks is, hey, don't try and just re-architect everything from day one. We also try and get people to stay away from feature and function like the death by spreadsheet. You do this thing and this thing which is generally like, trying to make decision by like mass grade consensus tends to slow down a lot of folks. We usually see folks really adopting this end like, hey, what are your core principles? And it's usually lower maintenance burden, faster development cycles, better data reliability. And it's the team saying, Hey, we want to go move faster. And so usually what we first start to engage folks is go pick two or three data pipeline use cases, the things that are like just driving you crazy with your existing architecture that are really painful, that are really expensive, that are really slow to either build new capabilities on top of, or just simply maintain and it's amazing how fast, literally within days to single digit weeks low, single digit weeks, entire systems can be migrated over and put on full autopilot and truly running intelligent data pipelines. And so that's the beauty of data pipelines is usually you can connect them even across disparate systems. And that's where we see a lot of the initial adoption and then a lot of customers we see generally have mandates, which is, Hey, if something's running, just leave it. Don't try and migrate it over. But the second it goes bump in the middle of the night and wake somebody up, migrate it to ascend as the maintenance burdens way lower. Things just don't break and generally it'll also be more cost optimized.

00:19:12

And Sean, how do you quantify the ROI for the investment in Ascend.

00:19:18

Ah, good question. I think there's two or three fold. The obvious one is just actual like dollar cost. The things that we're able to do as an automated platform are really hard to do with manual pipelines, like with a classic, just Airflow orchestrated pipeline. So a really good example would be, we're able to do things like, generate individual jobs per partition to optimize incremental data propagation so you actually can run on smaller warehouses more efficiently and in compressor costs. We also are able to do things like job and data duplication. If you're doing dev to stage to prod workflows. We're able to de-duplicate all of the work associated with that, which reduces your whether it's your Snowflake, your Databricks footprint significantly. Early, we're able to do really advanced things like mid pipeline pausing restarts rollbacks, which also reduces your cost as you don't have to replay all of your data for all time. And there's a lot of these really advanced capabilities that just helps you actually optimize your spend. That's the easiest one. The 2nd one part that I think really matters as well is both a combination of developer sat and productivity. This is one where most teams are now being asked to do twice as much with hopefully the same resources, hopefully not even fewer resources, but in the current market landscape, what happens is we see a lot of companies, they're freezing their headcount or they're even reducing their headcount but they're not actually changing the demand on their team which really sucks for their team because you have these incredibly talented, hardworking data experts who now have to maintain just as many data pipelines. But when you cut headcount, the maintenance burden generally doesn't go away. So now even for the people you kept who are awesome, you're disproportionately pushing up more and more of their time and their workload to maintenance and sustaining work and that makes it really hard on those teams. And so both from a, can you help make them more productive? But even more importantly, from a stat perspective and all the residual effects of happier developers and you actually get technology embedded that offloads all the maintenance stuff and freeze your really high power team to go build new things again, which is what everybody's wanting to do. That's where we see a continued expansion both in SAT, reduced attrition, and then just the overall productivity of the team itself.

00:22:12

Amazing and we have talked a lot about automation. So let's talk about the one thing that is on everyone's mind is generative AI. Sean two questions here one is what should we expect in essence offering when it comes to having generative AI capabilities that can help your customers? And the second thing is more general in the industry or specifically in the data industry, where do you think generative AI would be an incredible game changer?

00:22:43

Yeah. So I think the users can expect that Ascend will really march down what I would describe as a co-pilot, not auto-pilot strategy and the reason why I think that this really matters is one, we're very careful and sensitive to the tiers of things that we can use AI for. I think of it from a data sensitivity perspective. The 1st tier is just purely things I can help with. It requires no customer context whatsoever. Second is, things I can help with it requires metadata context. For example, how many jobs are running? What's the performance of those? What's the code? That's running but doesn't require data access itself, and then the third tier is, things that requires actual data and we're obviously very sensitive as we move through those levels and where we're demarking right now is going to be after that 2nd level and we'll be very cautious as to how we step into that 3rd level. When we think about the effect that this has on data engineering in general, the framework that I like to look at is, for pretty much all of us white collar workers, the vast majority of our jobs is broken up into two, three stages for every task and then we wash, rinse and repeat that the three stages are ideate, create and refine. And the place where we actually spend the vast majority of our time is on the creation part. This happens to be where AI is very good and moreover I would contend, this tends to actually be the most boring part of the job because it's fundamentally limited by the laws of physics and like the slow laws, for example, how fast can our mouths move or how fast can our hands type and over the course of my 20 years in micro, I've generally found that ideation is when I'm running at the highest clock speed. Creation is when my brain is just getting really frustrated because my hands simply can't type fast enough. People with how fast I want to experiment and explore ideas and then on the refinement, it's actually pretty fast again and so when we use this kind of framework I think AI has tremendous potential. Because for me, I would love to run at max clock speed all day, every day. Like I get so excited by all the amazing things I could go do if I could just get my hands, metaphorically speaking, to type as fast as my brain can go, that would be amazing and so when we think about that and really then turn our focus to how can we leverage AI in the creation of what we do. I think there's a few different examples and to continue on down this general theme of frameworks of threes. I think there's three things that we can use AI for in that particular one. Really obvious one first, is the things that we as engineers just hate doing. No engineer likes documenting their code. No engineer likes analyzing performance data to provide recommendations etc. Or no engineer, most engineers don't like writing tests or data quality checks. Yeah It's awesome for this, it's actually very good. Generative AI is very good at analyzing code and describing what it does and imagine analyzing all the code of a pipeline with all the metadata around the lineage and then the performance metrics. This would be amazing and I think this is one of the easiest places to really start to help and delight users because it just makes our job so much easier. I think then the next stage really becomes into that how do I do X? And teaching and training. We see a lot of our users are really excited about bridging from SQL to Python and look man I've been coding for 19 years literally, even as a CEO of a startup, I still spend my weekends coding and there's parts of Python where I'm still like learning things new every day. And I can only imagine somebody coming from a SQL world to a Python world and trying to write a Snowpark for Python transform, a Pyspark transform and trying to understand all those nuances. This is where generative AI can be really helpful. It just help people bridge and learn really new powerful skill sets at an order of magnitude of speed faster than they could ordinarily which is super cool. And that part I'm really excited about. It's like an Iron Man suit for everybody that wants to go be a data engineer. So that's the second. And then I think that the third really gets into the and this is still in the, what I call the creation domain. But really in the how do I help me optimize? Thanks. We already see as we do a lot of our internal prototyping experimentation, generally, I can be really good at helping you optimize your code. You're trying to sessionize user data and you're doing a big join versus window functions. And things like that, that just help you by analyzing the complexity. I think again, really all three of these together really compress down that the cycle of creation that gives it users this incredible power and the reason why I think this is so cool. I think is really important. Being 2, as we've certainly had these conversations internally. I understand that there's a lot of fear around is AI coming for our jobs? And the thing that I think about in the way that I look at this is. AI is actually an accelerant for every individual that embraces it to propel their way through their careers. As you use AI to learn more, when AI gives you an answer to something, dissect it, learn from it. Like it just short circuited you Googling and reading Stack Overflow for the next like half hour. And so not only does it do that but it actually helps propel you into an architect style role so much faster and it helps you ideate and move pieces around on the puzzle board faster and really flex those muscles way faster and so how we look at it internally and how hopefully we're trying to help a lot of our customers look at it is this is just up levels so much of your team and gives everybody these superpowers they just didn't have before.

00:29:39

Wow. That was an amazing thought, Sean. And as we inch closer towards the end of the episode, let me leave with some parting thoughts in terms of where do you see the modern data stack to evolve from here? Modern data stack itself has seen a lot of ups and downs, it has seen all its peaks and troughs now. What do you think is the future of the whole modern data stack itself?

00:30:07

Yeah. So I think we have a really exciting future. When I look at this and if I zoom out and I look at the macro market, the big data industry is hundreds of billions of dollars of annualized revenue, which is not that this is not the TAM for the MDS but when I look at our macro ecosystem, it is so large that it can sustain trillions of dollars of market cap which I think is really the first thing I anchor on is the data space is massive and amazingly large. And so it can sustain a tremendously large number of really interesting companies, really large companies. What I think happens against the current backdrop, I do think we'll see some compression on the number of companies and this is very natural, right? You hit this nascent stage of any market, you get this massive explosion. Of all these really cool companies, it reminds me of that old saying, are you building a a tool, a product or a company? And oftentimes it's unclear and so I think we're gonna see a lot of tools and even products start to get merged in. I think that gets accelerated a little bit just given that the macro market funding cycle 2, I think because of how hot the data space was, a lot of companies raised a lot of money and really spiked their burn and so I think that puts a number of companies in fairly precarious positions and so I think that's going to accelerate some compaction. We already see this, we see Snowflake and Databricks doing a number of acquisitions, small ones, small token acquisitions right now, but I think that will continue. And then I think what happens over the course of time is we'll wind up with a smaller number, but more holistic, broader companies that have a broader product portfolio and in doing so, we'll see those companies continue to grow. I think of it as a kind of a bit of a pyramid in the sense of, as the space matures everybody gets pushed up towards the top and so you see consolidation as you get much larger players. So I think we'll continue to see that. While at the same time seeing a lot of really exciting new companies and new sort of parallel and emergent spaces.

00:32:33

Perfect. Sean, thank you so much for your time for the episode again, and I hope both us and all of our listeners had an amazing time listening to this whole episode. So thank you again for your time.

00:32:44

Thanks for having me.