Feb 28, 202333 min
Share via:

S02 E02: Building for Scale: When and How to Invest in Data Platforms with Brennon York, Head of Data Platform at Lyft

In this episode of the Modern Data Show, Brennon York, Head of the Data Platform at Lyft, gives insights into the critical aspects of the data platform ecosystem in the early stages when there is no scale. Brennon also discusses the structure of the data platform team and new emerging technologies within the modern data stack that have impressed him, such as machine learning orchestration systems like SageMaker, Union-ai, and Flyte. The episode provides valuable insights into building a data platform that can scale with the growth of a company, enabling businesses to stay competitive in the fast-paced technological landscape.

Available On:
google podcast
Amazon Music
apple podcast

About the guest

Brennon York
Head of Data Platform at Lyft

Brennon York, the head of the Data platform at Lyft is a true expert in the field of data with years of experience under his belt, including his current role leading the data team at Lyft. He has also held leadership positions at other companies like Simpatica Medicine and Capital One and even served as a board of Directors for the University of San Francisco Data Institute. Brennon's knowledge and skills include distributed computing data science, and data analysis, and has a proven track record of creating innovative research prototypes for businesses. And not only he is a leader, but he also acts as a mentor, coach, architect, and technical lead, and has a leadership style that focuses on humility, clear communication, and keeping the team motivated.

In this episode

  • Rationale behind building vs buying
  • Building for scale
  • Thoughts on Modern Data Stack
  • Supporting tech innovations in big companies


Hello everyone. I'm excited to welcome Brennon York, the head of Data platform at Lyft to our podcast, the Modern Data Show. Brennon is a true expert in the field of data with years of experience under his belt, including his current role leading the data team at Lyft. He has also held leadership positions at other companies like Simpatica Medicine and Capital One, and even served as a board of Directors for the University of San Francisco Data Institute. Brennon's knowledge and skills include distributed computing data science, data analysis, and has a proven track record of creating innovative research prototype for businesses. And not only he is a leader, but he also acts as a mentor, coach, architect, and technical lead, and has a leadership style that focus on humility, clear communication, and keeping the team motivated. Welcome to the show. Brennon.
Thank you. Thank you. Appreciate it.
So Brennon tell us something more about your journey. We would love to understand you started your career as a computer consultant and then, over switching over by multiple profiles over years today, leading up to your current position as head of data platform at Lyft. Would love to understand from you, how did that journey happen?
Yeah it's a funny story, I guess. I got my grad degree in computer security, actually. But even while I was in grad school for that stuff, I think what attracted me to the security realm was because you know, it was just it's very hard right? Mathematically it's very intricate and tricky. That led me through a number of areas and that whole, while I found my niche in security, in distributed systems, So that's where kind of the start of, you know, this distributed system world comes in. That got me playing with Hadoop when Hadoop was coming out and it was all just map produce and things like that. And I just fell in love. I always had a heart for the operating system and making things extremely performant and efficient. And then when you get into things like distributed systems to do map produce and you learn about lamport clocks and all that other fun stuff like it it's very challenging to a new degree. And I just always had this, I don't know, love for making. 10,000 machines do one thing very well together, right? Data then comes after that. And, you learn about machine learning and, model training and all these things and what you do with these 10,000 or more machine, but that's where everything got started. And then it snowballed. I just followed that interest and it started with security to secure those things. And then that moved into building analytical tools on top of map produce for other companies. And then it just kept on going and going until here I am. But I just followed the passion and yeah, that, that took me to Lyft.
Nice. And I'm sure all of our listeners knows what Lyft is and the kind of business you guys are into. But talking specifically about your role at Lyft , walk us through about your role at Lyft right now.
Yeah, so director of data platform basically means that, my teams and I, we work through managing all of the, I would argue, front of house and back of house for data. So that means anybody who's using the app, there's events that fire and trigger those go through all the streaming, the online services that trigger, real-time ML, things like that, experimentation platforms that run through there. You know, then you get into S3. And you get storage, then you have the backend which is a lot of the traditional, your ETLs, your orchestration systems, your batch compute systems that then create, various reports or, model training, et cetera on the other end. And we run basically all of that. At this point we look at everything as you know, trying to be the framework or the series of APIs that are for internal Lyft customers, right? To just move quickly with data. And just go from there.
Amazing. And you know, just so that our audience put things into perspective, what kind of data volumes are we talking about?
You know, it's funny. So I believe, I was checking the numbers recently and we compute on about a hundred petabytes of data, a day. I don't know if I can necessarily go into the event volume,
right? Yeah.
Because you can kinda extrapolate there. But let's just say, they're very high. We're, I think the candid thing is we're not on the order of Facebook or anything like that, we aren't into exabyte scale. I would argue in petabyte scale. Which is a lot, right? That's still more enough. I think you know, one of the Lyft being completely in the cloud. One of our kind of companies that we look up to because I think there's a big difference of scale when you own your own compute and then scale when you don't, there are very different challenges when you deal with, on demand spot instance types, reserve instances, especially as I'm sure we're gonna talk about in a minute, as I, I feel like the market kind of moves to crunch all this stuff. You know, I do think there are very different unique challenges there. And one of the companies that we look up to is Netflix, right? Like they, a long time ago, very cloud first, cloud native, and have paved the way for a lot of other companies to start looking at cloud as the sole kind of backbone for their compute. And we have a great partnership with them and a number of other companies. But so they spearheaded a lot of it though, and you see them probably on the forefront of making more tools than most of the rest of us when it comes to like cloud tooling and things like that.
Yeah. And that brings to my next question in terms of like you say you guys love cloud, right? And as a head of data platform, while making your decisions in terms of building your own stack for various use cases or something like that, do you have an affinity to use any of the managed service provided by these cloud providers? Or do you guys tend to wherever possible build stuff on your own and what's the rationale behind whatever your answer is?
Yeah, the classic build versus buy, I think whatever I say is probably gonna make somebody angry out there because it is very opinionated in some degrees, I think as an engineer, like I'd love to build. As you know, someone who has to represent the business I very much respect the buy and I think there are these varying trade-offs. I would say again, a lot of it as I'm still learning right in my own career, that the trade-off in my mind comes very much from the business need at the end of the day. So there are a couple factors that come in, it's you know, what's the run rate? How much money do you have? Where can you put those resources, if you're gonna have someone actually build the thing? And then, like I think what's the long-term ROI there? Sometimes you might take a layered approach, where you would go with a managed service today and also build at the same time your own solution. Because long term, the maintainability sometimes of the long term solution, maybe a little bit higher, right? But your ability to mold with the business becomes so much faster than a managed service. So you've gotta weigh all those things in and I think there's a lot of times where admittedly, especially with a lot of these big cloud providers, I would say the managed services are great. You're probably not gonna get a lot of features necessarily as quickly as your business wants to move. But if you need something simple you know, Kinesis or something for like Pub/Sub or whatever, if you just need Pub/Sub, right? And you can work within the limits of Kinesis, just use it. It's probably gonna be more cost effective now the moment you need, something else or more than just like the simple Pub/Sub. You know, choose whatever you want. You know, you're getting into the Kafka territory for instance or something like that, and you wanna use all these custom features it's gonna be a while before you're gonna get those potentially out of Kinesis, right? You gotta go talk to Amazon and you know, it's a natural kind of evolution. So for me, I tend to try to look at things like that and just say, where do we need to be with the business? Like at the bleeding edge and where do we just need to maintain? And I think when everyone looks at their data stack, you've gotta look at those things, right? You've got Pub/Sub solutions, you've got orchestration solutions, you've got batch compute, you've got online compute, you could even bucket in just those four categories. And I'm gonna leave out machine learning right now, cuz I think that's a way more nascent territory but also has its own buckets as well. And choose where you want to be at the edge and where you don't, right. We run our own Flink, we run our own Beam. As I think everyone could probably imagine. Lyft is a very online business. It's a very online company for those exact reasons. We need to make sure that we are with the business when we're online, batch compute and other things like that. We actually do run open source but we don't need to be nearly as cutting edge, right? We can fast follow and use managed services and things like that.
Yeah You know, I know you probably won't be able to get into specifics but what do you think, what percentage of your systems are off the shelf managed services versus the one that you have built? Roughly?
Where do you put a open source?
So open source, if it is hosted by, let's say a managed MongoDB by Mongo, I would say off the shelf.
Okay. Okay. But if we host it, then its
That's yours. That's yourself yeah.
I would say quick swag, we're probably 60 40. We host our own
60 40 you hosted own Nice.
70 30. We host our own.
Got it. Okay. Fair enough. Now the next question that I have is and this is something that, that there's a common fallacy that I think, so most of the engineering leaders fall into is building for a scale when the scale is not even there. And you probably have experience across a wide spectrum of companies. You have seen the scale of Lyft. You probably have seen different-different scales. First question, what are the most important things within the broader ecosystem of data platform? What's the single most important thing that people need to think about in the early stages when there is no skill to be able to be ready or prepared when the skill comes in? What are those core components where you would say spend as much as you can in the earliest days possible?
You know, it's a great question and I think in my personal opinion, the thing that matters most at every scale is the product, right? Data doesn't matter if you don't have a product or like if you're selling something, if it's B2B, B2C, whatever. I would let that be the driver, the way I look at things nowadays, is scale when you need to. I think that's why these managed services are amazing because they work zero to one and one to a hundred. Maybe they don't work to a thousand or 10,000, you know, and I'm talking, when I talk scale, then its factors. I would argue at every factor you should just deal with the scale when it comes. You know, I think what was it, Knuth that basically said like premature optimization is like the worst thing you can do effectively. And arguably I think that is actually the pitfall that many people fall into is we think we can get ahead of the game and predict it. It's pretty rare, right? Even if, let's say you go hire a senior engineer who had done way more scale or whatever, and they come in and you're like, Hey, I want you to prep for X, Y, or Z. Well, what you still don't know is whether you're actually gonna hit X, Y, or Z scale. And you may end up wasting effectively those resources, right? Everything comes down to money at the end of the day. And while it'd be great to be prepared for it, if it never comes that's probably a worse problem than if it does come and then you kind of need to put some band-aids on it. Or to the same point, there are a number of managed services that can handle the scale for what is a lot less price than to go off and run your own whatever, build your own X, Y, Z thing that is exponentially more, more cost-prohibitive, I think in the moment. Like it's not gonna come out well on your balance sheet for many years. And again, so I think taking that risk is almost in my mind nowadays just not worth it. It's better to start with a bunch of managed services or a lot of these the vendor solutions and things. I think they work great. Personally, I wish vendors and these other companies had a better relationship to know when their scale ended and you could say- hey it's not for us anymore, you know, and it unfortunately you have to go through that sad divorce which doesn't always work out super well. But I think at the end of the day, that's, that in my mind is what I would advise everyone. Scale when you're there, right? Don't scale thinking you're gonna be there. I guess the other caveat I will say is for Lyft, we've done a lot of work in effectively simulating our ride volume. So one thing that I can say is basically it's a giant performance stress test on all of our services, microservices, API data platform, et cetera. And I think, that allows us to handle peak loads when they happen. I think as you're looking towards scale and you need to be at least dynamic to handle these random peaks, those are the next kind of worlds that you would want to invest in. Then that'll usually shed light on areas in which you can go focus on.
Wow. Amazing. And just to understand a little bit more about the way you have structured your team and you know, so talk about this data platform team, like what does data platform team even does? How is it different from let's say traditional teams like head of data, the teams under the head of data where you're more responsible around analytics and BI. How is data platforms different from analytics and BI functions?
Yeah. I would argue the big differentiator is that they're our customers still. So data engineers, scientists, analysts, all those, you know, and the tools that surround their work to some degree are for them to own. We're kind a much more centralized function. You know, our job and ability is really to make sure that it is easy to build whatever you need. Tables, reports, models, those kinds of end state artifacts that are gonna power the business. You know, at the same time making it easy for say, production engineers to create a new event, have that run through the pipeline, test it, et cetera. I think back to the question of like, how do we structure ourselves effectively, I look at the world as like online and offline with that storage layer in the middle that kind of links everything together which even before my time they had chosen, a file format and a structure and everything, which I think was probably one of the smartest decisions, right? We started with Parquet could gone with or right either or but we had Parquet and I think, having that good compressed column or format was like. The best ground. At the end of the day, get your what people are calling the data lake, correct? And then you can build outwards from there, both online and offline. And so everything we do revolves around effectively reading and writing parquet you know, into our data lake. And so that's the first separation, right? And then I think everything comes into admittedly around resourcing and things like that. In offline we have teams around, basically our core components. We have, you could argue it's effectively three teams. I could break it up into more let's just say it's some version of compute and query, right? Then there's orchestration and then there is kind of the front end systems. We have some of our own, we have some vendor solutions, just easy ways to access the data, the logs, be able to diagnose what's going on. In the online world, we have the same thing. We have a couple, but I would argue like online front end systems as you're well aware of Amundsen, right? That being one of them. We also have some internal things to do data quality checks like that. And then you get into our actual stream compute systems. And in between is that layer of persistence, right? These are the teams that leverage our stream compute systems to run these customized Flink, Beam jobs, et cetera to kind handle the volume of events and get those written in.
So is stream compute a major part of the work you guys are doing? I have heard Apache's Flink, Beam, Samza and everything. So is that a majority of, does it take a substantial chunk of the work that you guys are doing at your team?
Yeah, I would say, compute on both sides of the house is probably the most substantial so Flink Beam on one side Spark Trino on the other side and that is where the majority of our power goes into if those go down right, Lyft can't effectively do its job nearly as well. So that's it's our most critical point.
The next question is probably a bit of controversial one is, what are your thoughts on the modern data stack? I believe as a data leader, you would probably have 10 emails in your inbox every single day where there is a vendor who is pitching the new category that they have created with this amazing tool and amazing technology. So my first question to you is, have you seen any implementation of any of these new tools and technologies that actually made a difference to you or your teams? And if possible if you can take names, would love to hear those success stories of things that actually worked for you guys.
Yeah it's a great question. I think, my team if they are hearing this would hate the answer. And it's I don't think we've done enough exploring in this front to be honest. And I think that admittedly gets into one of those pieces that we talked about in a prior question around like. I think there is a reality of the business need and the market, right? Like the market is compressing a lot of these tech companies lately. I'm sure we're all reading this stuff in the news. And so I think naturally we're moving from growth to profit and I think as many people would want to say, oh, well, like that doesn't really affect like these lower layers, but it really does, right? Everything then becomes about efficiencies and things like that. What gets cut? Well, innovation effectively gets cut, you end up becoming like your innovations become running these things extremely reliably to a very low cost, you focus more on your reliability, your SRE style of work than you do around, I think Hoodie, Iceberg, Delta Lake, these kind of new upsert level tools and systems that I think are moving us in these way better directions. But we personally don't get to necessarily play in that realm right now, like we have to just keep the business online. I don't think that doesn't mean that we can't innovate but like I said, I think candidly, that's actually where I get really excited like I said, my whole journey has been around, making things really efficient. You know, once you start capping your number of resources, then all you get to do is be more efficient. I find that actually quite intriguing. But it's not like we're gonna be launching a new Amundsen anytime soon, whatever that new thing would be unfortunately, I think we're gonna make some incremental progress, effectively a portfolio approach of things. But our portfolio for like long term endeavors is gonna be a lot smaller. So yeah, unfortunately I don't think we're gonna be there nearly as well. I will say my own controversial opinion on all this is you know, there were some papers that came out a long time ago by Google and everything else on streaming. I generally align with the fundamental principle that like streaming is the basis for everything, and batch is a subset of streaming, and I know that's the Lambda Kappa architect. Everybody argues but my fundamental thought is that, and I think database technologies need to catch up in such a way that the event volumes that we're seeing in stream can be leveraged in effectively near real time. We need those reads and rights in those databases to go down along with the guarantees that we typically see and want immediate consistency, not eventual consistency and things like that, right? Normally you have that cap theorem and you've gotta trade off things. And so I think, that's what we wanna see, right? We want all that stuff to get a lot faster that way you can do a lot of this, real time, online , computation things. We're starting to see that with like Druids, Clickhouses things like that, so we're getting there. I actually know a buddy of mine who's been working on this technology Pelosa for a long time, which is very cool technology, very nascent in this kind of database world which is all like bitmaps and things like that going into the same realm as Clickhouse and Druid. And I'm super excited to see where that goes. Cause I think all of those technologies are where we need to get to paired with streaming. That's gonna be the way.
Nice. Amazing. And just on a very quick slide detour, have you ever explored any of the materialized databases, for example materialized.com or KSQL db, is that something that have you guys ever considered?
No. No. Have not. Yeah.
Okay. So next question what is that one? You know, you talked about these, streaming databases and database that can adopt to changes other than this, has there been any kind of category or kind of any new emerging technology within the whole modern data stack that has impressed you and given the business constraints, given the business need all of their sorted and that is something that probably you would love to try it out someday?
Yeah, again I think admittedly, I've been so focused, I haven't had time to necessarily move out as much. You know, one of the things that I'd love to pull in would be the Iceberg Hoodie world. I know I mentioned that already, but I think that's also not super new, but I think that's an area that like very tactically I'd love to be able to try out. Another area that I wish we could pull in is the Alluxio world. Of this kind of pinned memory, I still have this dream that like, it could actually help us speed up a ton of things, but we just, again its a resource constraint issue. And the last one is, just in general, like the area that I'm super passionate about or I really love seeing is, where a lot of the kind of machine learning world is going, we see a lot of this with SageMaker and Union-ai, Flyte which came outta Lyft. A lot of these like machine learning orchestration systems and things like that are moving forward. I think that whole world is really cool. And I think that's demonstrating the new needs, right? Like it, goes back to that streaming database thing, but also compute tied with caching, tied with versioning and kind of dynamic tasks and things like that. So it'll be interesting to see where all the machine learning AI world plays out in a lot of this.
Nice. And I also happen to notice that you have co-authored this book called Spark Big Data Cluster Computing in production. How did that happen?
Like everything in the world, it starts with an email and you say yes to something when you probably shouldn't have. And I will say it was great. It was a super fun time. I am not an author. I'm not a great writer. I think, I'm sure the editor had a lot of work to do for me. You know, for anyone out there who's debating a book, I really, I would encourage you write 10 pages of something and then say whether you wanna write a book or not, because it was a great experience. I don't know if I would necessarily do it again just because it's I think for certain people it's just a slog to get through, know, writing that many words, but hopefully people found it useful. I at a time when Spark was coming out, I think, I was there working with a lot of the Databricks, the early Databricks folks and things, and so had known a lot of things then now I'm definitely less well-versed
And you have also, you were academically involved as in adjunct faculty at university of Maryland. And do you see a gap in terms of what academia has to offer versus what the industry need. And if there is a gap, how do you think, young professionals or people who are just getting into in the industry can bridge that gap?
Yeah, so my time in Maryland admittedly, was also, I was actually teaching in grad course in computer security on system administration things like that. Then a lot more of what is probably more apropo is the time when I was on the board for in San Francisco, because I think that, that they asked a lot of the same questions, right? How do we prep our young people who are coming out with the right skills and technologies and tools. And I think, at the end of the day, in my mind, nothing beats trying things out, prototyping. I think yes there's candidly a level of luck plus motivation and grit that gets everybody everywhere. But I think, there's the reality of mostly all of the solutions that everybody runs and all of these companies is open source to some degree, right? The vendor solutions, most of them are open source, right? Confluent Kafka, right? Amundsen, Flyte obviously Spark, Databricks, et cetera, right? There are these pairings or you know, they're out there, Starburst and Trino. But Trino runs at Facebook as well. I think it's easy to pull these things down, make some incremental improvements, right? That's a lot of where I got my start was pulling down spark, starting to make edits. I made edits to the build scripts and things because I was like this is taking too long. And so I think just having a little bit of motivation just drive to just be passionate about something and follow the passion. I think that's the one thing I tell everybody, be passionate and go with it. And so at the end of the day, I don't know if I would augment, I think every public curriculum is a little bit different. What I've focused more of my time on lately is coaching people who want to transition from other careers into a tech-based career, whether it be in data or something else. And I think that's where, these questions again of do I go back for a four year degree? Do I go try to get a master's somewhere or do I go to one of these boot camps or something like that are becoming quite accredited. You know, in that domain, I see a lot of places where I think these boot camps and I think getting people's feet wet, shall we say, in like a nine month or a one year boot camp and things like, that's a lot of where I've been spending my time and focusing on now is again those career transitions and how to just augment all these other skills that I think we have. I would argue that's the bigger need we're gonna people are gonna be passionate about computer science. They're gonna come out. It's the lawyers, the nurses, the doctors, the you know, the other marketing practitioners, recruiters, et cetera that all want to, start writing code and automating things for them. And I think that's a whole world of people that also have great ideas that we wanna engender.
Yeah. Nice. And you know, just before we wrap up our today's episode let me leave you off with one last question. Lyft has been a hot bird for innovation when it comes to data products. You have, Amundsen coming out of it, you've mentioned Flyte coming out of it. Two kind of interlink question, what would be your advice to individuals who are working in big companies like Lyft working on some of the amazing cool projects and technologies to be able to incubate or create such projects within their organization and yet being able to spin off that project away from the organization and kind of share it with the community and eventually probably build a business out of it. And B, what is it those these companies can do to be able to support that culture of innovation and kind of the amazing capabilities of those individuals to be able to contribute not just to the company, but to the rest of the world. What are your thoughts on that?
Yeah, no, great questions. I think to answer the first one, very candidly, I think focus on the business need is my advice. And I know that maybe not, again, that's not like what engineers necessarily want to hear, but I think if you build a product that solves a real need even a data product. First you have to demonstrate that it solves a real need. And I think that is oftentimes where I see things end sooner than they should is because, it gets very pie in the sky and, oh, this thing could handle everything and do everything. And so my advice is build incrementally and build to a need. Whatever that need is, right? That's the same thing for a startup, right? Build to a real need. And that is exactly why, for what we've seen Amundsen and Flyte like those are built for real needs that have existed inside Lyft. And that is why they came to fruition and they snowballed, right? And they got to a point I think the second thing, which Lyft does really well is like you gotta let it go. To answer your second part of the question is I think, if I could give advice to any company, I think more often than not, while we think that it may be a competitive advantage, I think and there could be some argument for that as well. More often than not, I think, letting these technologies go and flourish often becomes more beneficial back to the inside of the company as well. Because, you know what happens is, let's say you let go this 10 person team who is building this thing and you let them open-source it they're gonna go and keep building on that thing because they're super passionate about it. And then, company X or whatever that let it go is then just gonna reap those benefits. And so you become an incubator of these things and I think we see this over and over again. There might be some companies who wanna hold some of these things in, but I think oftentimes, especially in tech many of us are quite altruistic and you know, we do see this, continue to come out. Again, and again and again. Twitter was prolific with a lot of, I think they started a lot of the kind of like open source streaming world with storm, and then it snowballed from there know, so I think we see this with a lot of the tech companies. I don't think we're necessarily gonna see people holding back necessarily. But you know, you'll probably get features and components and I think that's okay.
That's a great perspective. That's a great perspective. So thank you so much for your time again, Brennon. It was such a pleasure having this conversation with you. I'm sure our listeners would love.
Appreciate it. Thank you. Thank you so much for having me.
Thank you again for your time.