Oct 04, 202230 min
Share via:

S01 E04: Navigating the future of Modern Data Stack with Chris Riccomini, Seed Investor

There is a lot of content out there about what does future holds for Modern Data Stack from vendors perspective, but very few about what is actually going to stay relevant in the market. To understand this we have Chris Ricommini joining us for this episode on navigating the future of Modern Data Stack.

Available On:
google podcast
Amazon Music
apple podcast

About the guest

Chris Riccomini
Seed Investor at Seed Investor

Chris Riccomini is a Software engineer, an Author, an Investor, and an advisor with more than a decade of experience working in tech companies like PayPal, LinkedIn, and WePay. He started his career as a Software Engineer @ Paypal back in 2007, after which he transitioned to a Data role as a Senior Data Scientist at LinkedIn in 2011. At Linkedin, he was the creator of Apache Samza - which is LinkedIn's streaming system infrastructure built on top of Apache Kafka (similar to Storm, Spark Streaming, Flink, etc). 6 years after working at LinkedIn he joined WePay, a payment platform where most of his time was spent running the data infrastructure and managing the engineering team. Chirs also recently co-authored a book called “The Missing README: A Guide for the New Software Engineer” - a book that has golden nuggets for budding software engineers.

In this episode

  • Evolution of data engineering and its practices.
  • Hypothesis for investment in Modern Data Stack.
  • Future consolidation of tools in data space
  • Projects emerging from big tech companies.
  • Future of the MDS from the market perspective.


Welcome back to the modern data show for today's episode. We have Chris Riccomini with us who is a software engineer, an author, an investor, and an advisor with more than a decade of experience working in tech companies like PayPal, LinkedIn, and WePay. He started his career as a software engineer at PayPal back in 2007, after which he transitioned into a data role as a senior data scientist in LinkedIn in 2011. At LinkedIn, he was one of the creators of Apache Samza, which is a LinkedIn streaming system infrastructure built on the top of Apache Kafka, similar to storm or spark streaming and fling, et cetera. Six years after working at LinkedIn, he joined WePay a payment platform where most of his time was spent on running data infrastructure and managing the engineering team. Chris also recently co-authored a book called the 'Missing Readme', a book that have golden nuggets for budding software engineers. Welcome Chris to the show. We are excited to have you as a guest.
Thank you. It's my pleasure to be here.
First of all, Chris, tell us a little bit more about your journey from being a software engineer to your later your roles as, data engineer and, leading data engineering team. So tell us about a little bit more about that journey
Yeah. So when I started my career, I actually started more on the data science side of the house and really doing, data visualization and kind of exploration of PayPal. So we just had a lot of data, especially transaction data. I was in their fraud team. And they just didn't understand what was going on a lot of the time. And they had so many payment flows that we did a lot of visualization to just see what was happening. When I switched over to LinkedIn, I spent, a bunch of time kind of in the, on the data science side but quickly realized that a lot of the high leverage work that was getting done at the time was really more on the infrastructure side of things. So simply adding more data to the model, getting more features, speeding up the model training cycles. And so I switched over to engineering pretty quickly at LinkedIn and got involved in Hadoop and converting some of their machine learning models, specifically their 'people you may know', algorithm to Hadoop and productionalizing it. So once I had that epiphany I was down the road to on engineering, but I was spending more time on kind of infrastructure development and really that's most of what I did while I was at LinkedIn, when I joined WePay I kind of, WePay was a much smaller company and philosophically, I wanted to keep WePay from developing some of the not invented here syndrome that I had seen at LinkedIn and also just the data and really infrastructure ecosystem had evolved in such a way that now there were vendors, there were, people you could pay to get reasonable infrastructure. And so my role at WePay kind of shifted from more infrastructure development to more of a data, what I would say is a data engineering role at the time, it wasn't called that I think, sort of pre that term, but more of a data engineering role. So it was like setting up the data pipeline using cloud data warehouse. We were very early into Bigquery for example very early into Airflow and then we were very early into Debezium as well. And that was our data pipeline was those three things. Really. It was like the Debezium, Kafka, Bigquery. And then airflow is our orchestrator for, running queries in the cloud.
Amazing and, you know, Chris, so what do you think have changed, like for good or a bad in the data engineering since the days you worked back in 2007 to what you're seeing now and, apart from the whole explosion of tools that we have
well, that, I think that was the thing I was gonna say is for both, for good and for bad, the amount of options you have are pretty staggering. When I started back in the day there really wasn't a lot of choice, even when I was looking for an orchestrator, at WePay going back seven years ago, there was Airflow, there was Luigi, Airflow was fairly new. And then other than that, we were back in the days of Oozi and Azkaban and stuff. So now there's three modern orchestrators, all duking it out that are all great, you so I think, I think that's definitely one thing. In terms of other things that have changed. I think, things less integrated now, I think, as a result of the tools and that, that can be a little bit frustrating, I think for users, for me personally as well. So a lot of the work that I've, I find that we were doing toward the end was really like more about gluing all these different solutions together those are the two things that come to mind.
And what do you think from a people perspective? What have changed from an organization perspective in terms of you know, now we are seeing mature data teams set up in various organization , data engineering was an ad hoc thing few years back, but it has become so much mainstream. So what do you think have changed from a people perspective? Yeah,
that, that's an excellent question, actually few things come to mind. So one of them is just organizationally. One of the things that I was personally pushing for quite a bit at WePay toward the end was just essentially the Federation of the tools that we were running. So we had a centralized data engineering team, In my opinion, having a centralized data engineering team doesn't scale beyond a certain point. And so you need to start having other teams help you out with some of that stuff. In order for them to help you out, whether that's like curating pipelines or tagging metadata or you know handling data quality check, whatever it is, they need to have the tooling in order to do that. So the data engineering team shifts to be more like a dev tools, DevOps dev platform, a team that's providing the tools that enable the entire org to be their own data engineers in a way. I think that's one thing organizational is a federation of a lot of these tools and pipelines. I think the second one is this advent of the analytics engineer which is I think a fairly new term, but one once I saw that it immediately clicked with me because we had that pattern at WePay where we had these I think we called them business analysts that kind of sat near the data engineers, but were much more data focused. I think at the time, that relationship that we had between the business analysts and the data engineers was like very amorphous, like who owns what, where things should sit, who's doing the queries and who's owning the dbt and blah, blah, blah. Right. I think that stuff is starting to get hammered out a bit more and analytics engineers are really carving out a space and what it means to be an analytics engineer, what it means to do that work, what kind of tools they have and so on. So I think that's for the good, and I think dbt is probably driving a lot of that stuff. yeah, I think organizationally and people wise, those are two things that I see, I think at a meta level, you just see more specialization ever increasing specialization, right? So the data engineers, there's now business analysts. There's now data scientists. If you go back, I think you mentioned 2007. If you go back to 2007, there's one person. Doing all that, right? Like the data science team at LinkedIn used to have a state of the union thing that they would put together, that was like all the business, metrics and stuff for a given year. They were also doing the data science stuff, and then they were also, working with the infrastructure engineers on transformation and ETL and whatnot. So we've kind grown a lot in that regard.
Yeah, amazing. And Chris, you have been an early investor in a lot of data companies Confluent, Prefect, StarTree, Stemma, Anomalo, Transform many more. And what's your hypothesis in this space, as an investor, what's your hypothesis?
So my hypothesis is I think essentially there are a bunch of different categories of data tooling that I, so at a meta level, I think my hypothesis is that, the data ecosystem and modern data stack is gonna figure itself out and it's gonna become ubiquitous across all companies. We spent maybe the last decade or two building out a lot of the infrastructure, right? We got the cloud data warehouse. We've got Kafka, we've got you know, all the AWS and GCP stuff that, that you get from that. And now I think we're building out a bunch of the tooling that kind of is a level or two up from that. So data quality, checking, data catalogs, you know, streaming transformation, stuff like that so there's gonna be a bunch of these tools. And so I think for me, it's just about figuring out what categories that I'm interested in and then hooking up and finding teams and products that I like in those categories. I think I had list, maybe 10 or 12 items, long headless BI reverse ETL, right. stream processing, realtime data warehousing. This is like on and on data catalogs, data mesh on and on. Right, now there's data products and data modeling stuff happening, that's what I would say is my cobbled together hypothesis. I think in practicality, a lot of it is just opportunistic. A lot of those early ones you mentioned is just people I knew in my network and stuff. So I got really lucky, working with just fantastic people like Kishore at StarTree or, you know Jeremiah from Prefect, which is one of the companies you mentioned is just somebody I work with on Airflow, open source, met him, on the mailing list. so it's kind of less structured than it might appear.
yeah And, what would be your advice to, other fellow investors who are looking to find these hot data companies? Like any specific word of advice you would share?
Oh gosh, I don't I have a hard time giving advice cuz I feel like everybody's story is a little bit different in terms of how they approach it and how they got there. I'm hesitant to give advice, especially to fellow investors. They probably are doing well as I am or better. But I guess philosophically I just, I try as best I can to be, to help people out and being unselfish and you going there knowing that things aren't always gonna work out. It'll be fine. Yeah.
Yeah. One of the other things that we keep hearing a lot, , is that, and that's a question I would need, you know, I would appreciate your thoughts on is the modern data stack suffering from the symptom of, Hey, I've got this solution now let's go out and find a problem.
I think there are definitely shades of that and I could point some of the things that I just rattled off in that list there, I think in terms of categories are have that symptom. What I will say I'll pick on one. So reverse ETL is one that personally I'm not a huge fan of. I see pragmatically that there is some value there and it is solving some problems, but I just think, in my opinion, there are better long term architecture and solutions to, to solve the problems that they're talking about there., I think Eric Sammer for example has done a good job of kind of outlining the thoughts that I share. He's posted on that sort of extensively he's the, the CEO of decodable, which is a real time streaming platform. That in, in my opinion, that style of architecture is more well suited for a number of reasons to, to solve the kind of problems that reverse ETL purports to solve, so I think there's definitely shades of that in some places. but in, in actuality, I think most of the problems that these vendors and all this chaos is coming from are real problems. Like data quality is like a real problem. This ETL moving data around transformation. That's all real stuff, even some of the little stuff that's a little more sketchy, like the metrics headless BI stuff. I think is a real thing. Like I have been at organizations where no one agrees on revenue. I, going back to that comment I made about the state of the union at LinkedIn. We were cooking up these metrics on the ad hoc on the fly. And then even year to year they were evolving sometimes. And so agreeing on what like a metric is and having it live in one place is actually a real thing. I think I would push back. By and large against the idea that a lot of this stuff is a problem in search of a solution. I've been at organizations that don't have data quality checks. It's not fun. I promise you, it's not fun. I've been at organizations that have unstable orchestrators and can't execute tasks and, on and on, I've been organizations that are rolling their own ETL and it's just, it's all bad. it's all bad. So yeah, I think they're solving real problems for the most part.
Yeah, I know. I agree. And you know, you talked about the streaming a little bit, but I'll get back to that in a while. , but before I jump into that, a nother question, another follow up question on that Recently in one of one of the previous episodes, one of our guests said that there would be a lot of consolidation that we'll see in the data space because you see this feature overlap and what we are seeing, in the example that you just mentioned about, Reverse ETL, we are seeing a lot ETL companies having reverse ETL as a part of the offering itself. What are your thoughts on consolidation?
Hundred percent
as an investor?
Yeah. Hundred percent agree. I could be proven wrong and a lot of these verticals could end up being, multi-billion dollar verticals, but I think it's far more likely that there are going to be some winners that end up slurping up a lot of these verticals and you end up with a more, integrated solution, which is just a better experience. Nobody wants to go to eight different. Cloud UIs across eight different vendors to manage their data stack. That's just not what they want to do. that's not what I wanna do. Right, so I think it's highly likely that there will be some consolidation. I think you're already seeing some of this with the orchestrators, airflow, for example, slurped up I forgot the name of it, but Marquez essentially, ostensibly was the company Julian's company and I think orchestrators are well positioned. I thought for a while, maybe the data catalog stuff would, go that way as well, remains to be seen, but there's a lot of affinity in the orchestration layer around lineage and data catalogs and, that kind of thing. So, That kind seems to be the running narrative. I also see a lot of stuff happening with data quality. I think that's another interesting area with a lot of directions you can go. So that's like the great expectations and Anamlo, Monte Carlo there's Bigeye there's a lot of those too. So which direction? I don't know, but I think it's going to happen and prevailing wisdom is probably the orchestration layers are gonna start. They're also well funded. So you look at like how much money Astronomer has or Prefect I don't know the case funding for Dagster, but let's just say just from Astronomer, you can see that they've got a good chunk of cash. So highly likely, I guess the other direction would be like these larger companies, like Databricks or something. So we'll see but yeah, I think consolidation is likely.
Yeah. one, one of the other trends that we are also seeing very often in the modern data space are projects being developed in big tech companies like LinkedIns and Ubers and the Lyfts of the world Transforming into a full fledged business. We have seen some good success with Confluent, Stemma emerging from project Amundsen and I think so you personally have seen a lot of these journeys firsthand itself, right? So why do you think, first of all, that's a great idea. And what are the common pitfalls that you think. An engineering or a, engineering person who's leading these projects in those organization should be aware of what would be those common pitfalls provide?
Yeah. Okay. That's a really good question. And probably something that's not talked about enough. The most common pitfall that I see when a team is trying to bring an open source project from a company, into its own startup is essentially having multiple factions within the same project that either manifest as multiple different startups for the same open source project or this adversarial system, where you have this, the startup with half the team members. And then the other half of the team members are still at the parent company and the incentives kind of no longer align on like the direction, the roadmap, what needs to get built and when the velocity of the project and so on. So I think those, that, that pattern is very common and it's really. It's really detrimental to the project and the success of the companies, frankly. That, that is by far the biggest thing I think navigating that figuring out how to deal with it is gonna be unique for each individual situation. It depends on the team members, how many people there are involved sharing incentives and stuff like that. But figuring that model out is just critical, just critical because cuz if you get into a situation where you no longer control your open source project, or you people are forking or reverting each other's commits, which I've actually seen. It gets really nasty that, that's the big highlight to me for the most part, something that, on the positive side is most of the folks that take those open source projects and go and spin them out. Usually the company seems to support them pretty nicely. It's usually amicable, it's not usually rancorous which you would think is surprising when you're taking like a chunk of talented engineers from an organization. In some cases, the company will invest in some cases they will not. So it's just a it's worth being aware that it may or may not happen So, that's actually a healthy thing is, the way companies have navigated this. The individual people in the company is sometimes tougher.
Yeah. So as an extension to this question, another question that comes to my mind is, what are your thoughts in terms of the future of the modern data stack from a open source versus a commercial offering perspective? we have seen, let's example of even the ETL space, you have seen Fivetran emerging as a kind of a leader in the commercial offering. But at the same time we have got, tools like Meltano and Airbyte who are competing really hard and you know, one of the biggest if specifically from an ETL perspective one of the biggest kind of selling point for these open source solution is addressing these long tail of connectors, which is difficult to be offered by a commercial offering. Right. And two questions here. Yeah, one what's your take on open source being a direction for the modern data stack and two if you were to pick investing in an ETL company now, what would you be looking for? What is it that's there in the ETL space that you think is still unsolved? if you see a pitch, you'll have your checkbook ready, what would be that?
Yeah. Yeah, yeah. , I'll start with the second question. First. The stuff that's unsolved, or if I were, really excited about an ETL system that I came across, one, one of them is I'm huge fan of real time ETL. So anything real time. I'm about it. I hate the idea. I don't hate is a strong word but, but I've lived in a world with batch ETL and I've lived in a world with real time ETL and the real time ETL world is definitely better. I think in terms of spaces, , SaaS ETL, I think is something that needs more attention. So as we, I spent some time with the Jamstack last year and, I had this epiphany that we're getting near a point where you could build software products, which are essentially just a bunch of SaaS glued together. But then the ETL, stuff stops becoming. ETLing from your, OLTP database and your event system. And it becomes much more about just ETLing all the, getting all the data from all the various SaaS vendors you have into yet another SaaS vendor, which is probably some cloud data warehouse. Right, and you there are definitely systems that do this. I would argue that some of these are more batch based, and less friendly. I think there's some innovative thinking going on in this area. So that's definitely something that. It does get me excited. I've had some recent conversations with folks that, that got me really thinking in that space to your first question around open source. I think the caveat here is this. I have I'm so steeped in, in the open source world, it's hard for me to not think about things in that frame of mind. So I'm of course, very, very pro open source and into it, that's just
the same. But at the same time, you're invested heavily into commercial offering as well.
Yeah. But many of the commercial offerings have open source. Right. So I would, I probably the majority of them, so Tabular has iceberg. Confluent is Kafka. Prefect is Prefect, StarTree is Pinot, right? Like these are almost all open source to SaaS vendor type companies, I think that model is pretty reasonable. I think if you wanna do open source in say, the next five, 10 years simply coming to the table with I'm gonna do open source, and then I'm going to have cloud hosted open source is probably not going to be enough. I think you're going to need to build something a little bit more than that. So the model of something like elastic or Mongo or Confluent I think is not going to work as well because this space is so much more competitive now than than it was, 10, 5, 10 years ago. In my opinion, I think the way that's gonna look is people are gonna have to start solving for, I like use cases or verticals. It's taking open source and applying it to FinTech or to health or to IOT or to edge computing or to like some specific vertical I, I think is gonna have to kind of I think that's gonna be the way forward. That's my personal theory we'll see, the reason, I don't think that just cloud hosting open source projects is gonna work is because like literally everyone else does that now. And it's like the cookie cutter pattern of looking at what Mongo and confluent and elastic and all these other companies did it. It's the velocity on that is so fast now that you're instantly gonna have end competitors. yeah. So you, you need to offer something more than just
One thing you just mentioned the real time ETL, right? And, Chris, you, you have been one of the creators of Apache Samza, which is and mind blowing project by the way but we haven't seen a lot of progress in terms of commercial offerings around even processing or complex, even processing, S per CP was written 20, 25 years back. Yeah. And real time processing is still far behind in terms of maturity, in which we have seen with respect to the batch processing, like ETL. Tell us from an engineering perspective, why is that the reason
man it's pretty easy answer. It's really hard. So I think in reality. I think what you're saying is true. However, I would point to there's a lot of work being done in this space now from vendors, right? So I would argue there has been some progress made from like the stuff that we did with SAMZA to now. I think, kind of two things holding back the streaming space. One of them, I think we're on the cusp of solving the other one, not yet. So the one that I think we're on the cusp of solving is just the operating real time stuff is just a pain in the butt, it's more like operating microservices than it is like operating batch stuff. So it's probably more fair to compare stream processing and streaming to, to microservices, but then they're very stateful and they're finicky. And whereas microservices, you have to generally. Process as many queries per second, as you get as many queries per second, but with asynchronous processing stuff can build up and you can have, back pressure and stuff. So suddenly you might be have to process a workload that looks a lot more like batch than like real time So there's an operational aspect to it. What I see there is just like the vendors and cloud hosting stuff. That's gonna solve that. So you will pay someone to deal with a lot of those problems. The second problem. That I don't think we've really cracked the nut on. And this is the one that's probably more important, frankly is the usability aspect of it, you know, everybody likes SQL like writing SQL is great. Writing SQL on streaming is complicated because of time, right? So you have late arrival, really? You wanna do aggregations across windows and time, and then there's just all these complexities around semantics and late arrivals. And you get into these really elegant kind of models where like the data flow folks, for example, from Google have come up with really beautiful ways of dealing with aggregation in a streaming environment and kind of being high level here that you can definitely, yeah, you.
So basically talking about stuff like the, the data flow the flow
algorithms. Yeah. Yep. Yeah, exactly. But then like to explain it, you need like a tensor of dimensions and space and there's all these animations and stuff, and it's just. It's not something that's approachable for your average, application engineer, that's trying to get a job done or business analyst or analytics engineer. That's trying to do something it's just way, way too complicated. And then you overly on top of that, just some of the physical stuff that's leaking in like partitions and, event arrival, ordering and transactionality guarantees. And it gets to be really hard to build reliable systems. If you don't know what you're doing. So I think the holy grail in this space is really figuring out a model that works out of the box intuitively the way that people expect it to so that they can just use it. The alternative is maybe that stream processing just becomes so ubiquitous and such a cool kid a thing that people are motivated to learn all of these complexities and, they suddenly, the application engineer does understand about late arrivals and window aggregations and sliding versus tumbling windows and so on. But I think realistically, a lot of the work that needs to happen. And this is PhD, postdoc kind of work, as well as vendor work is just figuring out a model that application engineers can use to build streaming applications. That's not prohibitively complex. Yeah.
I think so one of the, one of the very nice examples of what you have just mentioned in terms of SQL on a streaming data from a end user perspective, I think there's a couple of examples that we have personally seen working really well. Is an example of confluent where they have this thing called Ksqldb and a very parallel to that is what you have called this tool commercial vendor offering called materialized.com.
Yep. Yep.
I think so. I personally had a chance to try out both of these you know, database it's the streaming databases and, the experience was mind blowing. I think so we are getting there.
Yeah. I hope so. People are really love materialize, right? That's one that's definitely comes up in this conversation. And I think the real time, streaming database kind of system is definitely new but if we if we could crack that nut and if realtime DB's end up being the solution to this problem that's huge, right? That really is a game changer. And there, there are, like you said, there are a bunch of these actually Gunnar Morling, the guy who he's just fantastic engineer, he should absolutely follow him on Twitter. He runs Debezium and a number of other projects, but he had a thread. I think earlier this week where he listed up just all the different startups and open source projects in this space now like PranaDB, which is I think from cash or block or whatever the company's called now. And it listed materialize and Delta stream and on. And there's just tons and tons of 'em. So that's promising, right? Hopefully it works out.
amazing. Chris, so we are towards the end of our episode and, before letting you go, I'll just ask you one last question. And that question is what is that one, company in the modern data space, apart from dbt that you wish you would've invested in?
I, I never I don't, I didn't really have the opportunity on this particular company, but I'm really coming around to duck DB. I was really much a skeptical, a skeptic and I've had a number of, troll-y tweets about it, that, and the information response I've got from that has been super useful. I'm just really impressed with that project so I think, again, that's an early days one, but I would love . I would've loved to have been involved in, in one of those companies. So that's sort my go-to right now. You might not expect that reading my tweets, cuz it's all complaining about security and, you know, what's the big deal and stuff, but I'm genuinely, they're winning me over. They're winning me over on that one. I think it's pretty interesting in ways I hadn't thought about yeah.
Nice, amazing. So I think so that's pretty much it from my side, Chris, thank you so much for having this lovely candid conversation with us.
Yeah, likewise. Thank you very much. I appreciate it.