Nov 15, 202237 min
Share via:

S01 E10: Commoditizing data integration with Airbyte, Michel Tricot, Co-founder and CEO, Airbyte

When Michel and his team founded Airbyte back in 2020 there were already a ton of data integration tools out there and by 2020, it was a pretty mature space altogether. So what led them to start this company and what unique problem did they aim to address? To answer this, for this week's episode we have Michel Tricot, the co-founder and CEO of Airbyte.

Available On:
google podcast
Amazon Music
apple podcast

About the guest

Michel Tricot

Michel Tricot is the CEO and co-founder of Airbyte an open source data integration platform with a vision to commoditize data integration. Airbyte stands a total valuation of about 1.5 billion after four round of funding led by investors like Altimeter Capital, Thrive Capital, Salesforce Capital, Salesforce Ventures, Benchmark X, Excel, and SV Angels. Before founding Airbyte, Michel was the head of integration at LiveRamp, where he managed over a thousand integrations

In this episode

  • Story behind Airbyte.
  • Change in licensing.
  • Move towards Airbyte Cloud.
  • Airbyte's community and its contribution.
  • Acquisition of Gruparoo and reverse ETL.


Hello everyone and welcome to another episode of the Modern Data Show. Our guest today is Michel Tricot, who is the CEO and co-founder of Airbyte an open-source data integration platform with a vision to commoditize data integration. Airbyte stands a total valuation of about 1.5 billion after four rounds of funding led by investors like Altimeter Capital, Thrive Capital, Salesforce Capital, Salesforce Ventures, Benchmark X, Excel, and SV Angels. Before founding Airbyte, Michel was the head of integration at Live Ramp, where he managed over a thousand integrations. Welcome to the show, Michel. Hi.
Hi Aayush. Thank you so much for having me. So
Michel, let's start with the very basic question. Can you help us understand what exactly Airbyte is and what's the story of you know, what's the story behind starting Airbyte?
Yeah. So what is Airbyte? Simply, it's Pipes between any kind of source where you have data to any destination where you want to be able to leverage that data. And like what we see in every company they have data everywhere. They have data in SaaS products, they have data on files, they have data in databases, in spreadsheets, et cetera. I mean, there's an infinite place where companies can have data. And what we are doing with Airbyte is just connecting all these little data silos and we help them move to a place where they can run analytics, and they can send this data to the operational system. And this is what Airbyte is all about, going from point A to point B. The story behind Airbytes is, first of all, like you, said I've always been in the data space. I started my career in 2007. I was first working on financial data, like bringing data across a lot of analysts on companies and trying to get that data back to traders and others, and then I moved to the US in 2011 and I started at this company called Live Ramp and there, this is really when I discovered like the scale of how many silos you can have. And I was fortunate enough to just build a system that was powering the Airbyte product. And that was allowing Live Ramp. And that was allowing Live Ramp to just connect to all the marketing and adtech ecosystems. So making sure that people can send the data where they can leverage it. It's a very hard problem because these integrations, break all the time. So it's both a technical challenge and a people challenge. And going out of Live Ramp. You, I also realized that the world of data was changing. So before companies that were doing data were big data companies, they had hadoop clusters. They had a spark cluster, so they had to be very data savvy. But with warehouses and modern warehouses, suddenly any company can do data. Like you're using Big query, you're using Snowflake, you're using Redshift, you're using any kind of data warehouse. They're easy to use. They scale, anyone can just go right on SQL and do something with data. And from there you realize, okay, well that's cool. You have a very efficient, very approachable system for processing data. The question is like, how do you fit data into it? And that's when we got this idea of airbyte, which is taking everything we've learned from like very data-savvy companies and bringing it to a much larger audience in a way that is simple to use, simple to, extend and that was the story. Behind why we started Airbyte.
And you know, when you started Airbyte back in 2020 there were already a ton of data integrations tools out there. It's a pretty well-established space. And if you look at ETL and ELT as a space overall, by 2020, it was a pretty mature space altogether. What do you think was primarily missing in whatever tools were out there that led you to start this company? I know all the challenges around detail and everything, but what was missing out there in the world that made you start this? Yeah,
so people like to think that ETL and ELT are fully mature. The thing is, it's mature for very small subsets of places where you have data which is yes, you can go to another company and they will have like, I don't know, 50 connectors that allow you to call data from Salesforce, from HubSpot, et cetera, et cetera. The thing is, and these are mature they work as you can go, you can use them and you will get your data. The prime is the moment you need to customize these connectors, meaning that for some reason they decided they were not going to import this particular data from Salesforce or this particular field in a record, then what do you have to do? You have to ask them, Please add this new feed of data. Add this new field, or, and if they don't and you need it for your business, then what are you going to do? You're just gonna be rebuilding it on the side and recreating your custom homegrown connector. And the other one is around the longtail. Today we have so many SaaS services. Like if you look at companies, they go from like, I don't know, they start with one SaaS service that they're using. Then they have 10, then they have 20, then they have 100. It's very hard to address this long tail of data connectors, and the moment you have to pull data, I don't know if any kind of like random SaaS service, you will need to pull that data because you want to have this 360 view of your customers, for example, or you need to get that for your finance analytics. Then you're going to ask your team to build a connector for this particular service that the platform we're paying for does not have. And so that's why I like to challenge the fact when people say it's mature because it's mature, but just for a very small subset. And what we wanted to do with Airbyte is focus on the extensibility and the long tail. And that was the motivation behind Airbyte can we build a platform and a mechanism that a product that does ETL and ELT to have these two components long tail and extensibility?
Right, Right. And I think this is something that you guys have been pretty vocal about we have seen your investor decks. Love them by the way, they were amazing. And a couple of core value propositions that kind of stood out on those decks was one, the open source way in terms of, as you said addressing this long tail of connectors. And second, giving the power back to the organization on running their data integration processes. Now, a couple of things have changed since those texts, right? So one is a change in your licenses. Second the move towards Airbyte cloud. And to some, it might seem like Airbyte moving towards, a kind of management service kind of the whole paradigm it kind of stood against with. Right. So help us understand both of these points, the change in your licenses and the move to Airbyte cloud.
Yeah, that's a good question. So the first thing I would say is one of the core methodologies that we have at Airbyte is what we call community driven. And, I think, there have been different models of open source. You have the Linux type of model where it's something that is a hundred per cent community-driven, but then you have companies like Red Hat and others that are going to take a huge chunk of the development and that are going to be monetizing It. The same thing for Kubernetes and you have other types of open-source projects, and here I'm thinking about the Mongos elastic Airbyte where we have, there is a reason why it's Open source. But at the same time, we're also building a company and a business on top of open source. And that's, that is normal. And that's something we've always said the moment we started Airbyte, which is we will always have Airbyte open source and there will be an Airbyte that is a paid version, something that we've always said. We have our public handbook, and you can find that in our strategy and the way we're seeing the open source. Open source is here to create the standout. So for us, the standout is how you build connectors. It's how you provide the core feature to exchange data. Now, what we want to have on the cloud is more like what goes above the standout, which is how do you add features around, I don't know, privacy on data? How do you add user management? How do you do like role permissioning? I mean, as an open source user, you might, you probably won't need this, but as an organization, you will need this. So our goal is just empowering small teams and individuals with open source, but an empowering organization with the paid version. And if you think about it we have to support open source. And the way we can do it is if we can make revenue. So that's, that is one of the reasons why it's so important for us. And these have been, a model that has worked like Terraform is, oh, sorry, HashiCorp is very much like that, where they have open tools, but they're also building cloud services on top of it. So it's a model that has proven to work, and it's not because you have a paid product, open source is a second-class citizen, quite the opposite. It's just one power the other and the other one forward the other. So that's, yeah.
Amazing. Michel quick question on the licensing. What are the things people cannot do with the license? Can you help people understand the things they cannot do with the open-source Airbyte?
Yeah. So first of all, just let's break down how we've applied for the license. So we added a little bit of complexity on the repo because we wanted to only change the license on a specific piece of it. So typically, if you look at connectors, which is the stand down. And also we want to give the freedom to our community to decide what license they want to use for connectors when they credit the place where we went for an elastic license. So actually what we wanted is we wanted to give to keep the ease of adoption for companies to use Airbyte. So we did. We want to, we wanted to have as few limitations as possible. The main limitation you get from Elastic is that you cannot take Airbyte and create a company that sells Ai byte as a cloud service. That's the main limitation. And the reason is it's simple we want that revenue to help us power the project.
So does that mean, can other SaaS businesses are not competing with Airbyte in terms of selling the whole Airbyte as a solution, but let's say building more capabilities on top of the whole data integration? But their core data integration pipeline is supported by Airbyte Is that allowed?
It’s around as long as you don't expose like the Airbyte API, the Airbyte UI and things like that. Some companies do that today, and it's fine. They are just using Airbyte, open source as their like connectivity layer. It's okay. Okay. It's just, it's more like the backend engine, more than what they are exposing to that customer
That does. That's good to know. And is, and it's
because, so their product is probably not ETL poducts. We are building an analytics product, and to do these analytics, we need to have this connectivity layer, and that's perfectly
fine. Perfect. So is that also the reason why your repo connector repo is structured around a monorepo versus, are multiple repos where like singer, where you have different repos for different taps and targets? Is that the reason for that?
So the monorepo is. We started every single at once. So we started open source, we started to build the community at the same time. And we also started to build this Airbyte protocol at the same time. And for us, going from Monorepo, when we started it put all the eyeballs onto one single ripple. So it allowed us to just concentrate the community into a single point. And it also allowed us to have more control as we were iterating on what the V1 should be. Cause if suddenly we say, You know what? The Airbyte connector is everywhere, and tomorrow we decide, you know what? We made a huge mistake at the beginning. We want to fix it. If it's spread across multiple repos, it's too late. But here, because we have. Monorepo, we can decide ourselves to just say, You know what? We're going to change that across like the 250 connectors that we have now. Is that gonna be like that in the future? It's a, it's an ongoing conversation, but yeah, there is something about like theorization of connectors that is interesting. But today it allows us to just ensure that the direction of Airbyte that the airbyte goes in the right direction.
Right. And have you ever seen situations where you wonder if, had these connectors been separately organized and separate from the core? Would it make it any easier for developers to contribute to them?
Potentially, yes. Potentially, yes. We've, we have some experiments. We have like, a small CLI that we wrote that allows us to build a connection on a separate repo. But to be clear we have a pretty high growth in terms of contribution and new connectors being created. So I don't think it's it's, it might be a little bit of friction, but it's not like a wall.
On that note, mission. Talk about the CDK, connector development kit. Like how did that come along and how's that even going?
Yeah, that's a very good question actually and we, so for us, we released the CDK in May 2021. The moment we released it, we started to have contributors and this is why we are open sources we want to give a lot of ability for people to build on all these standards and to build these standards features, which are connectors. So we put a lot of effort into CDK. The first version of the syndicate, the one that has been running for the past year, was cut-heavy. You can go, you can write something in Python it requires you to know a little bit of coding to understand APIs, et cetera. Now we are going for this, the two of the cdk, which is what we call the low code cdk, and this one, and we've seen it during Hack Tober Fest. It is just a game changer because suddenly the only thing you need to write is a YAML file and you can cover it. I don't know. Today it's maybe 30% of the APIs, but as we will add more features, we'll get to 50%, 70%, et cetera. But it just changes the rates of like how fast you can build. But more importantly, it removes a lot of pain when you're maintaining the connectors because now you don't have to read through the card. You don't have to understand the logic. It's just, okay, this trim has changed. Let me just change this field in the YAML File. So cuz that's the maintenance is where the cost is. So for us, it's just giving something to the community so that they can start building and also make the best use of their time.
Yeah. And on, on that note, Michel you know, as I say, with great power comes great responsibilities. And how are you tackling the issue of the reliability of these connectors? And, one of the basic arguments against the open-source data integration solution is that connectors are not reliable and it's easier to reach out to a vendor and essentially have a responsibility laid out to them to be able to address that problem. How are you tackling?
Yeah. So one thing that we've done with Airbyte since the beginning and if we get, we collect a lot of elementary on connectors. And when you're thinking about connectors, they are living organisms. They start simple and they grow to become more reliable. And that's what we call the thousand-pay-per-cap program. Meaning that if you write a connector, it'll work for you. If I use your connector. It might work or it'll break. If I want to use yours, I will have to go and I will have to tune it so that it also works for me. And bit by bit, you start building reliability. So, we always consider that the first time you use a connector, it is not gonna be a good connector, but as more people are using it, you get the proof that it's working more and more. And if you see that this, the usage is like plateauing maybe there is something that's blocking more users to you do. And we have this certification process and a key metric in this process is how many users are using the connector. If it's only one, yes, you cannot say it's reliable. You can say it works for them, but you can not say it is reliable. But when you get 50 people using a connector, you can say, Yep, it worked. And the thing, the good thing is when you have 50 people using a connector and depending on the connector. If the connector fails of this 50, the likelihood of someone going into the connector and fixing it. It's very high. And the moment they do boom, 50 people have access to the new version and the fixed version of the connector and the RO code CDK is gonna help tremendously on that. Cause now it takes you one minute to fix a connector.
Yeah. How often, like the, like if someone were to, and please excuse me if that's a very stupid question, but how frequently would you expect Airbyte? Cloud to be updated with these connectors. If let's say, we find some connectors to be broken and, if someone puts you know, some request soon they can get it fixed on Airbyte Cloud.
Yeah, first of all, there is no stupid question, only stupid answers. So I hope I won't give you a stupid answer. no algorithms to have very fast on around. That's also one why we're pushing so much on the local CDK we also want to make sure that we can have the speed of like a speedy turnaround. And if you have the low code, doing a review is very simple because it's just general and is gonna be one or two or three or four lines. If it's code, then you need to also put a lot more effort into the security side are the libraries that are being used malicious or are they good? , orthe this little snippet of code, is it doing something that it shouldn't do? Whereas with low code, it's very simple. You can't do anything. It's just our framework behind the scene knows what to do with that YAML file so for us, you just, yes, the moment it's fixed. And if we also see the same problem on the cloud. We need to take that change and we need to apply
it to cloud. Yeah. Let's dive a little bit deeper into Airbyte about the cloud. Right. And you know, I guess there is a char that every open source project needs to cross before kind of the pick up the cloud version. And we'd love to get some sense in terms of the challenges of building the Airbyte cloud. We'd love to get a little high-level overview in terms of how does the architecture of a cloud even looks like.
Yeah. Yep. Ah, I will have to describe an architecture that's gonna be fun. . . So the first thing is when we've built Airbyte, we've been very specific about what dimensions we need to be scaling on. And you scale on connections. You scale on streams and stream will be one type of data and. Maybe a later one is you stream you scale on the partition. So if you push getting something from Kafka for example, you probably need to have multiple workers working on the same topic but spread across a partition. So these are the three dimensions, and we've always built Airbyte with this, whether it's for open source or whether it's for the cloud. Now, what we've done with the cloud is we've pushed a lot on Kubernetes because what we want is we want to create a very cloud-agnostic way of running Airbyte. And so the way it acted is you have the control plane on one side. You have the data plane on the other side. Today both of them run on Kubernetes, but the data planes can run on any kind of Kubernetes cluster. And what happened when you ensure a request to a connector is gonna spin up a new pod? And so if it's a replication job, it's gonna spin up the source, the worker. and the destination and everything is gonna flow between the source and the destination. If you're running a check command, it's gonna spin up a pod, just issue the check query, and that will return that to you in the UI. But overall it's just you have this very strict separation, and then you also have the connection to a kms like to manage. this is extremely key for the cloud and that's something that you can do. Also, an open source, it's connecting Airbyte to a kms. The advantage of having the data plane to be very separate is that we are going to be launching Airbyte in Europe and for data privacy and regulation reason, we had to have the data plans to run over there. So what we've done. Okay. We just go, in Europe, we spin up an AWS Kubernetes cluster and boom, now we have Airbyte running over there. And it's always controlled by the same control plane, but the US never sees the euro, the European data.
Right. And, tell us a little bit more about, the challenges when it comes to specifically around scaling.
Yeah. So that's what the thing is. Every connection is its job
and just to be sure by connection do you mean a particular job within a connection? Like for every batch, you are talking that's a connection.
Yeah, so basically a connection. Okay. No, the connection is more abstract. It's more about how do you configure a source to push data into a destination. That's what a connection is for me. Now, how it translates on the infrastructure is that creates a pod and that pods can run anywhere. And because we're using Kubernetes, because we can have as many data plan as we want for us, we can automatically balance where this pod is gonna be run, whether it's for regulation reason, or whether it's for like scale reason. And it can run on our main Kuberneted cluster in the US so it can run on another one. So the only place where today we're working on that scale. Another type of scale, which is the volume of data and the speed at which we want to replicate. That's a very different one, and that's when the second dimension of scaling comes into play. A big project that we had in Q3 is what we call the curse stream state, meaning that now every type of record has its own state, meaning that now we can run if we have 10 streams, we can run 10 jobs and they each going to be replicating this streaming parallel and this is a very big deal. The other one is, if one stream has a lot of data, how can we make that stream faster? And that's why we also exploring a binary protocol today. It's very based on Jay, which is extremely useful at the beginning of a project because you can debug very fast. But we want to go more far faster serialization and deserialization that's where a lot of time is spent today. So, what is aspect where we are scaling today.
Nice. Amazing. And another you know, kind of things that is very noticeable is, Airbytes focus on replicating data across cloud sources. Right? That's where the majority has of the focus has been. But what we are also seeing in the industry, and, one of your competitors recently acquired a company that was more into you. Data databases, replication through the CDC methodology. Right what's your thought on tackling that piece of the market in terms of the data replication where you are actually dealing with traditional relational databases, replicating data across those traditional relational databases using cdc? What's your thought on that?
Yeah, so, and that's, We have a huge advantage both by the fact that we are open source, and the fact that we have this concept of a data plane. Because the moment you start connecting your internal databases, then the red tape and the security aspect just rockets like goes really high and people are much more concerned. And for us, the way we see. The data plane today, we can own it, but what we want is what if people own their data plane? So what if we can have like the data plan to actually run directly within your infrastructure? It makes the security story a lot better. It makes the speed a lot better because now you don't have to go through from you to assess service back to you. And so when you're talking about cdc. The volume is gonna be high, the concerns around security are gonna be high, and that's where we have a huge advantage. And today we have a lot of people that are actually using opensource directly because we don't offer yet, like the ability to have the data plan to run elsewhere, like on, on like customers infrastructure, but they're using opensource and that's how they're doing it. The moment will have it. This is something that they will start. Be able to use through cloud.
That's very interesting. That's very interesting. And that's actually very powerful, now that we think about it. That's really powerful so Michel another thing, you talked about community, right? You've got an amazing community. You've got over 10,000, members on almost 10,000 people on your Slack channel. It's a great community.
We beat 10,000
Oh, nice.
Two weeks ago.
Well, nice. Very nice. So, tell us about some of the. Cases some, one of the most interesting or unexpected or innovative ways in which you have seen community using Airbyte?
Yeah, so I, I have two, two stories. One that has been instrumental in how we are developing the product. The other one, which I think is very, it's a very interesting use case. So the first one around how it has changed the way we think about the product is we released Airbyte and the first, the only new case we had in mind at the time was just doing analytics on your warehouse. Just you are using it for your own new case. And boom, three weeks after we really said, people were asking us, Okay, can we have an API so that we can actually offer Airbytes the connectivity layer for our customers. So it was an analytics company and they needed to pull data from Shopify, strip, et cetera. And they didn't want to build the connectors. They just wanted to offer these connectors to their customers to power their product. So that, At the time it was an interesting use case. Now it seems like, Oh yeah, okay. That makes complete sense. The other one that was very fun is people are using Airbyte to warm up caches and to refresh caches. So they had a Reddit instance and they wanted to cache a database into Redit and they would just use Airbyte to replicate from database to Redit. Then this is not, this is the kind of thing where if your infrastructure and your mission is, you get data from point A to point B. Wow. You can do a lot of things with that . Yeah,
absolutely. Absolutely. And you know we've talked about lot of cases where Airbyte can be incredibly helpful. I have two follow up questions. One is tell us when Airbyte is not the best option and two, what is a hidden cost of ownership of Airbyte, Because one of the arguments that comes from, from commercial offerings versus open source is the hidden cost of ownership. So when is Airbyte the wrong choice, and how do you kind of, how do you justify this hidden cost of ownership of an open source platform like,
Yeah. So when it is Airbyte not the, It depends. So if we're talking about open source, I would say if you have no technical people in your company or people that are technical, but that cannot focus on the infrastructure, that's gonna be a problem for you because you will have an operation that you need to do. If you cannot tackle them, you should not use open source. But at the same time, that's why we have the cloud. Like we are the expert in operating Airbyte for these users. Yeah, maybe don't choose open source and go on the cloud. So you're trading people's time like cloud spend. In terms of ownership, it always depends on what the company is optimizing for. The good thing about open source, Okay, let's go back to the basics here. Data integration in general, the default for companies is to build data integration because it's only, it always starts with one and you never see the value in buying if it's only one. With open source, we replace build, so we get into companies the moment they decide to build something and that's, that is great. So we are replacing their cost of building and we're replacing their cost of maintaining exchange for, they have to operate to the system, but they would've operated anyway. So it, I would say the only argument for the cost is gonna be whether should you go for cloud or should you build it yourself slash use open source. That's why we are building this hybrid version of Airbyte, is it reduces the cost. in exchange for you paying us instead of paying people in the team to do the operation. But there is, there will always be cost involved. I mean, whether it's your time that translates in the end into salaries or whether it's on the cloud. It's just, can we find the right balance between the two? But it becomes an internal choice. It also becomes, if you have like big security requirements around data then never go to a cloud. Then you have to do it internally and you have to pay people. And we are here to actually reduce these costs because we provide something that works out of the box
that, that makes so much sense. Michel, Airbyte recently also acquired the company Grouparoo which is to power the rivers et capabilities. Are you bullish on Reverse ETL?
Very on ETL now. I would say it's a sequencing thing. I always see data as a pyramid Airbyte as ETL/ELT is at the bottom, meaning we are helping companies build the fundamental of their data engine and their data system. River CTL is at the top of the pyramid. It means that your organization has a level of maturity that allows you to actually use this data for very operational and business critical use case, meaning how are you going to reach out to people on HubSpot? How are you going to be affecting your sales process? Because now you are flagging or you are labeling customers differently. So there is a for me it's a journey and today a lot of people at the bottom, they are just starting to use data warehouses, understanding analytics, understanding dbt, understanding Airbyte. There is a little bit of delay before they start saying, Oh yeah, that's great to have analytics. Now I have all this data that's been joined across multiple dataset. Let's send this data into Salesforce. That requires a level of maturity that not a lot of companies have. So for us, there is a timing thing where it is very, I. But most of the market is still on the first quadrant of the pyramid. So we want to do reverse etl. That's why we acquired Grouparoo and actually what I really like about what they're doing within Airbyte is that they are actually shaping decisions to make sure that we're not boxing ourselves, where that makes it super hard to do reverse. So they are. How the protocol should evolve so that we can do reverse ETL easily.
Yeah. And how's the product integration coming along?
And so what's the plan then? The plan
is for them to just figure out with everything they know, How does that adapt to Right. Sometimes when you do product acquisition, if this product has overlaps, the integration is very hard and you spend a lot of time on integration. Sometimes just the knowledge is more powerful because you can build it faster and you don't have to live with the decision that you've made that you're trying to reshape into another system. See the thing is today need to be the best at the foundation. We need to also look for the future. So make sure we make a decision that allows us to go for the future. Yeah. But today we're not building River CTL per. This is something that we will do that's we'll do next year, and it's part of the rationale for going from point A to point B. Point B should be also an API. Yeah.
So basically for the near future, we should expect both products to separate, kind of exists independently and operate independently. Nice. So before we let you go today Michel, tell us about what's next for Airbyte now. What are the hot new comings out of Airbyte now?
Yeah, so first. We're going to be releasing the local CDK, both as a YAML file, but we're also building a UI on top of it. So something that will make it so easy to be connectors that I would be a like 2023 is just going to be a massive ramp on like the coverage of the wrong pay. Yeah. The second one is we need to focus on the scale of high-volume data sources, and we had a lot of groundwork that we had. And this is going to come fruit in the next quarters. And the other one is cloud doing hybrid. This is a very important project for us and yeah, I mean, for us it's continuing to grow the community, continue to also provide more valuable content about what the data world is, where it's going. I mean, we're providing a lot of support to the data community on yeah, what does it mean to be a data engineer? What are all the different concepts in the data world if you want to get into it? And just helping the community to grow and helping the company to become extremely solid with data.
Nice. Amazing. So thank you so much again, Michel, for taking our time for this one. We all love Airbyte we are super thrilled to see the kind of journey that you guys have made in the past couple of years, and we wish you all the best for your future. Thank you very much for being on the show.
Thank you. Thank you very much for having me.