Oct 25, 202230 min
Share via:

S01 E07: Powering real time Change Data Capture using Fivetran with Mark Van de Wiel, Field CTO at Fivetran

In this episode, Mark Van de Wiel (Field CTO at Fivetran, previously CTO at HVR Software) walks us through the tech architecture behind HVR (acquired by Fivetran in 2021), why they merged with Fivetran, how both the technologies complemented each other and how are they now powering real time Change Data Capture. We also dig deep into open source technology and solving the problem of long tail connectors.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Mark Van de Wiel
Field CTO

Mark Van de Wiel is the Field CTO at Fivetran, the leader in automated data integration, delivering ready-to-use connectors to thousands of customers globally. Mark has a strong background in data replication and real-time business intelligence and analytics. Before joining Fivetran, Mark was the CTO at HVR Software which provides a real-time cloud data replication solution to support enterprise modernization efforts. HVR Software was acquired by Fivetran in 2021.

In this episode

  • About HVR and its architecture.
  • Why did they merge with Fivetran.
  • Limitations of Change Data Capture.
  • Open source ELT/ETL technology.
  • Expansion of Fivetran in adjacent categories.

Transcript

00:00:00
Hello everyone and welcome to another episode of the Modern Data Show. Today we have Mark Van de Wiel, who is the Field CTO at Fivetran. Although Fivetran needs no introduction, it is the leader in automated data integration delivering ready-to-use connectors to thousands of customers globally. Mark has a strong background in data replication and real-time business intelligence and analytics. Before joining Fivetran, Mark was the CTO at HVR software, which provides a real-time cloud data replication solution from traditional DBMS systems to support enterprise modernization efforts and was acquired by Fivetran last year. Welcome to the show, Mark.
00:00:35
Thank you happy to be here today.
00:00:37
Mark, tell us a little bit about your journey with HVR. How it all started. What are the kind of problems that you were you guys were solving at HVR and how it evolved from when you started to what HVR is with Fivetran today?
00:00:52
Yeah. It's been quite a journey. Certainly, if I think back to it I started at HVR in March of 2014, and at the time I was the 12th employee and I was the first one based in the United States. We opened our US offices. We had a handful of customers in the United States, but no physical presence yet. So my role was to get started and build out an organization and build out a presence for HVR software in the United States. So with the customers, of course, I started supporting them. Initially, my role included everything from an account executive to a support analyst and everything in between. The IT function was mine for the United States almost until the very end of the journey with HVR. But then of course we started hiring a team over time. So, sales came in 2015. We made a transition from leadership, from a CEO perspective in 2016 put in place a marketing function and started generating, like really building a business building essentially a startup from a technology that was already there and then grew it quite significantly leading up to the acquisition with Fivetran. Thinking back on it, when on the first day I was sharing an office where all I had was a desk, a phone and a laptop. Sharing an office with one of the investors in the company to the acquisition of Fivetran. And then of course, now, a year later has been quite a journey.
00:02:26
And did you guys raise any capital with the HVR?
00:02:29
Yeah, we did. Initially when I started with HVR had just raised call it something like Series A funding, although it wasn't labelled as such. So that allowed us to then start operating in the United States, but then following that in 2019. We had an investment round led by Level Equity that was in the news. Then eventually led to subsequently the acquisition by Fivetran.
00:02:59
Tell us a little bit about HVR as a product from a technical perspective, from a product perspective what was special about HVR?
00:03:07
Yeah. So, with HVR the core of our intellectual property was in change data capture and specifically on relational database technology where we focused on log-based change data capture, and I say that our intellectual property was there simply because we invested a lot of effort into building what we believed was the most efficient approach to do the change data capture on these relational databases to. The goal was to get customers to unlock their data and replicate their data out of mission-critical production systems, where any slowdown or any impact on the system could potentially have a direct revenue impact on the organization. So our premier focus was around that challenge. Now, On top of that challenge, of course, once you take the data out, you, you wanna be able to efficiently deliver it to a destination. Our focus had been on real-time or as close as possible to real-time replication. And then we also provided a data validation capability for the customer to understand at any moment in time, Hey, it is my destination system, the correct representation from my source. And so. Provide metrics around that. So that was in a sense the bread and butter in a heterogeneous environment where there was support for many different source technologies as well as many different destination technologies and customers could then hook up any source with any destination.
00:04:41
Correct me if I've understood it right. So your focus was mostly towards traditional DBMS systems like Oracles and the SAPs of the world. And what HVR would do is empower organizations to have data replication of those traditional systems onto the cloud using this change data capture methodology.
00:05:01
Yes, that is indeed That became absolutely the common use case. You're correct. From a traditional relational database technology perspective, and of course SAP and other ERP systems like Oracle and also some, lots of homegrown systems running on relational database technologies became the dominant sources and indeed thank you for adding that. The destination was was becoming a cloud destination, right? Like certainly the last 5-10 years. Like that's, Well, it's not been quite 10 years since 2014, but yeah, like the cloud became more and more prevalent. And of course, the technologies that are part of the modern data stack, became the dominant destinations, right? Like, think snowflake, think Databricks, think technologies like querying numerous customers on Redshift. Synapse, those kinds of technologies became very popular for, I think reasons, of course behind the modern data stack considerations. Yeah.
00:05:59
So that brings me to my next question the kind of very obvious one, why did you merge with Fivetran?
00:06:06
Yeah, so the very good question and. I think what is unique about the merger between Fivetran and HVR is that even though we both played in the data integration space and we both had this concept of change data capture and incrementally updating the destinations the modern data stack destinations dominantly we would rarely compete against each other. So the merger was very complimentary. And let me explain, this, so Fivetran became big around originally this concept of taking software as a service application and delivering the data into the modern data stack, and Fivetran had from the ground up, been delivered as a managed service. Yeah. Now when we were independent as HVR, we'd been very successful in delivering real-time database replication using a software distribution model. So customers would come to our website, download the software, put it in their environment, and it would be theirs to manage in that environment we recognized in 2018 and 2019 and started projects towards this, that there were two main objectives that we needed to achieve to continue to grow the company and maintain our relevance in the market. One was to start delivering our solution as a service. Because with the advent of cloud technologies and everybody consuming services in the cloud using an operational model rather than a software download model, that seemed to be a natural evolution of where we were. And the second thing that we realized was we're in databases today, there are more and more applications that are used as software as a service. We need to start unlocking software as a service platform because organizations want to integrate using a single solution, right? So then as we were looking on how we were going to build this out and get the funding in the summer of 2021, right around that time, Fivetran was looking to see like, okay, how can we as Fivetran become better at the database replication and specifically the on-prem database replication. Fivetran has been incredibly successful in building out the platform based on software as a service and getting into databases. Rich support for a variety of databases, but recognize that to support some of the most mission-critical and largest databases, that specifically still ones that reside on-premises, you need an agent technology and yeah and to support that of course there's the build versus buy decision. And I think in the summer of 2021, that culminated in a buy decision. And the combination of these technologies and then also the presence that HVR had already in the enterprise, I think was very enticing to Fivetran, who's grown very large out of the commercial and growth market but had limited presence in enterprise at the time of the acquisition. So those, Like it was an incredibly complimentary acquisition from that point of view, even though we played in the same space and even though we essentially solved a very similar problem for our customers.
00:09:30
Great. So two follow-up questions on that. Is HVR still agent based collector mechanism that you would download and place next to your databases to be able to collect those CDC logs and push them into a data warehouse or has HVR itself moved towards a more kind of managed SaaS kind of model, or if not, when can customers expect that?
00:09:54
Great question. So we are actively integrating our technologies. With HVR of course, as I mentioned we recognize we needed to become a managed service. Our goal as Fivetran is to offer, extract, load and then followed by transform as a service. We believe that when customers have a choice and customers if all else is equal and the customer would have to manage their environment or they can consume it as a service. Again, with all else being equal, including the cost of it, they would always go for the solution that provides, that gives them the least amount of work, which is the service. So we are integrating the HVR capabilities into the managed service. Now we're in a transition period. We've started this work. We have the initial deliverables out of this Already in a beta state at this point with high-volume agents. The agent is what the customer will need to download and install on their environment. But beyond that, it's a fully managed service. Until get further in our integration. We still also offer the HVRsoftware as a download for customers to install and use. And of course, we continue to support our existing customer base as well. Right.
00:11:12
So I'm sure a lot of people wouldn't appreciate the difficulty of these challenges. So let me ask you two follow-up questions. What, why was it so hard to replicate the DBMS data and how did HVR go on to become so good at it? Let I have to ask you in a very stupid way, why can't one write a simple python script to be able to batch, replicate that data from the source system and push it into a DBMS system? What was the technical complexity around that?
00:11:43
I think that's a great question because indeed that goes to, to like the guts of like why was HVR in a position to become so successful in large enterprises? And I think that's where we took the time to invest in what I would call binary log readers. If you look at the different database technologies many database technologies like Oracle, SQL Server, Postgres, and My SQL, like many databases, relational database technologies, will have an interface to perform change data capture. The capability is available from the database vendor and it uses the database's resources to perform the function. And you can use that as the foundation for a change data capture for a replication solution. And there are quite a few vendors out there who've done this and Fivetran also had done this now when I say HVR went beyond that by building a binary log reader is that for select platforms? Or for, actually quite a few platforms. We went all the way to essentially parsing the transaction log records based on direct file access. So running completely outside the database by directly accessing the file system, looking at what changes in the file when we insert a row when we update a row when we delete a row, et cetera. And I think it's been that investment that's been so incredibly valuable because that ultimately allowed us to be the ultimately most efficient way to get the data set with the least possible impact, our experiences that when you use some of the native solutions, you're. You're gonna have to live with whatever the resource consumption is that those solutions require, You're gonna have to live with, let's say, the limitations that are imposed upon you by whatever the solution provides. Like at the end of the day, no database vendor is providing the solution to be able or to make it very easy to migrate off of their platform to somebody else's. So, There are gonna be limitations. We do experience that in some cases resource consumption is relatively high for various reasons. And by building our binary log reader we manage to circumvent those disadvantages and have full control over the solution ourselves. And ultimately get into those environments that are the absolute, most critical systems where it's like, okay, we run a major enterprise, call it CPG, call it manufacturing in finance, where it's really like where transaction processing is so core to the primary business process that if you were to slow that down, it's gonna hurt the business. And customers trusted us to put in place our solution, to be able to essentially unlock that data and move it into an analytical solution in a modern data stack. Consolidate other data sources with it and build these machine-learning algorithms at the end of the day.
00:14:49
So basically what you're saying is through this change data capture-based replication methodology, what you were able to achieve is have a replication of the database without actually making any query to the database itself. Is that understanding right?
00:15:03
That is right. Except that of course, for the original historical load. We take the existing data set that is running an SQL query Yeah. Against the database. But yeah, beyond that, it would be purely based on the transactions that get committed, against the table. We pass those out of the transaction logs by directly accessing the file. I, In many cases and move it on to the destination.
00:15:29
Okay, so basically, apart from the initial sync all the furthest sinks that happen are happening through the bin lock files that are being generated. So that means no impact on the original database when it comes to running those batch jobs. So it's not gonna slow down. Your database is not gonna, any bulk operations that are being run will not slow down your database if it's so magical. What are the limitations of the CDC if you're able to, get all those CRED Operations, create, read, update, and delete through bin logs, what are the limitations of it?
00:16:03
So what are the limitations? So there are always challenges, right? And so rather than that, that calling them limitations, I call them challenges, but the challenges include things like with cloud-hosted systems you don't always have access to the underlying logs. We've come across systems or where of our customer scenarios where the customer says, Look, my, my production environment is so critical. I do not want any impact on no third-party software on this system. And we develop solutions to then be able to capture from a physical standby or to be able to capture from an environment where we had access to only the backups of the log so that we'd have, of course, additional agency, but still, The benefits of ongoing change, data capture with like incredible amounts of change volumes hitting the database. So those are challenges, but then there are additional challenges like, relational databases provide native, Or transparent data encryption methods and so you have to be able to cope with those to unlock the data. We see a lot of SAP environments and SAP tables can be very complex, in the table structure in and of itself, but also long-time ECC environments, like SAP ECC environments. We see cluster and pool tables where the structure of the data in the database does not match the structure of the data per the application. And some decoding needs to happen along the way. We built decoders for that. So there is, there are numerous challenges that we have to overcome to continue to successfully enable organizations to unlock data in these busy relational database systems,
00:17:53
right? So we talked about, the technical challenges that HVR saw. Let's talk about Fivetran turn now. So, help our audience understand why is it so hard to replicate the SaaS data, which Fivetran can do, so amazingly well right now what do you think? Why that's a technically challenging problem.
00:18:12
Yeah. So I think it's a technically challenging problem. Maybe not so much to initially get it to work, but the big challenge, and I think Fivetran has, has an excellent model of addressing this, is to make sure that it continues to work. Because what happens is the source application, of course, it's software. As a service, these applications continue to evolve into new entities, new tables arise in the system. Maybe existing definitions change, and as a result of that, data pipelines might break. So what Fivetran Provides is a managed service where, first of all, as an organization, we look at we believe the customer can get the best value out of accessing the software as a service application. So we will define essentially what the schema will look like for the organization. Then the customer gets to select what objects and entities they want. They can start replicating, but then, It's Fivetran responsibility to make sure that it continues to work. When the API changes, we will make adjustments without the customer having to get involved. And then because we redefine the model for the different systems and different use cases, we also provide open source packages in debt that can be the source for the customer to then downstream consolidate that data with similar data sets from other sources. If you use different marketing sources or different ad services, for example, we have a predefined model there with predefined essentially pre-available packages model into that world. So I think it's pri like it's. Like what's, Again, To summarize, I think what's valuable in this world is it's not only to get it to work in a way that's quick to get up and running but to continue to run it and maintain it going forward. I think that's arguably almost the bigger challenge than doing the initial data sync.
00:20:13
Okay, so that also brings the next question is there a new set of tools that have emerged recently that is built on the top of open source technologies? The open-source ETL alternatives. There are a few of them that have emerged recently and have, are getting a lot of traction actually, and one of the core value propositions, when they talk about open source, is the exact problem that you talked about, is having maintainability in terms of the long tail of connectors that is very hard for one particular organization to keep and maintain. So that's, so the USP and the value, the core value proposition is democratizing maintenance of these connectors to that individual maintenance and having it as a part of the whole community that can benefit from the active maintenance of those connectors. So my question to you is, what is your thought on open source as the future of ETL?
00:21:10
Yeah. So I think if I take a step back and throughout my career, I've been around the data analytics and ETL and data integration and replication space. So there's always been a good amount and there still is a good amount of essentially in-house development happening around it, right? Whether it's SQL scripts or Python scripts, or. There has traditionally been a lot of in-house development of that. But I think what organizations have underestimated and potentially continue to underestimate is the effort that goes into maintaining it now specifically in open source. There are lots of examples of course, of open-source technologies, that did catch on and that have become incredibly successful. There are also examples out there of open-source technologies that have not been quite so successful and fizzled out over time. And of course, the jury is out on where we're gonna go with the ETL. I think one area. And of course, there is the larger economic point of view. And then the, I guess market conditions, et cetera, that might drive different behaviours at different times. But I think one thing, to recognize and realize in the close source versus opensource is where's the responsibility to make sure that it works when it has to work, right? And in the closed source world, certainly from a Fivetran perspective, it's our responsibility to make sure that it continues to work. And I agree, there is a long tail of connectors we have some solutions in our pipeline to deal with those and so by all means look forward to announcements from us in that regard. But yes I recognize that in the open-source world. There are readily available platforms that you can start utilizing to build this. But I think the challenge is if you're an organization you rely on a particular connector to, let's say maybe close the books or to determine like how, what's the next marketing campaign we're going to run? And now suddenly your connector no longer works. I think you have to rely on an open-source community to help you to address that problem. It's a big challenge. It's much easier, in a scenario like that to be able to reach out to a vendor and essentially have the responsibility laid with them to be able to address that problem. And I think where the jury is at is what is that effort, that maintenance, how reliable is it going to be too. To be able to fall back on a community-maintained connector. So I think that's where it's at. Like obviously Fivetran certainly at this moment, we're closed source and we take a lot of responsibility for the reliability of our platform, and we believe that's one of our differentiators. But yes that's where I see the market.
00:24:02
Does that mean that if the open source community and in general open source platforms, if they're able to solve the problem of reliability of those connectors, would Fivetran see open source as a threat?
00:24:16
Well, on the one hand, of course, we'd see open source as a threat, right? If you can essentially have access to if all else is equal and you can get something that you don't have to pay for versus something that you do have to pay for, then absolutely there is a significant threat there, right? Like then organizations are gonna, gravitate toward the lower cost solution. So it'll, it's our challenge to be able to provide value add on top of the open-source replication or the CDC component of the long tail of the extracted load. Now, all of that said, I think if you look at the grand scheme of modern data Stack and what it has enabled organizations to do if you think back a couple of decades when there were similar concepts to a large degree to technologies and concepts that are available today when the but it was only available to a select few of organizations because the cost of the initial investment, the effort to build it, and then. A lot of investment had to go into a platform like that. And with modern Data Stack and consumption-based pricing and scalability, essentially available on demand. Very valuable data services are readily available in the cloud at scale using a pay-as-you-go model. Like a lot of that has allowed organizations that were not traditionally. That could, that traditionally couldn't afford to build enterprise data warehouse systems or analytical environments, they now can do that. That's democratized. Yeah. If you like the space, I think there is a lot of room in this space thanks to the modern data stack. for new opportunities. And from that perspective, we welcome technologies that help activate and enable those kinds of environments because we think that there are still a lot of green fields, if you like, of Of organizations who have not traditionally been in a position to leverage data analytical solutions who now can and opensource can help unlock and essentially get these organizations started From that perspective we welcome opensource as a that's as a competitor.
00:26:39
That's a very fair point, Mark. So that leads to my next question, right? So Fivetran recently launched metadata API, right? And you talked about, not just, Fivetran, not just enabling the ETL part of it, but like the entire value chain within the whole modern data stack. You're the enabler. ETL is always the entry point to the modern data stack, right? That's where once you have the data, that's where you make sense of that. Let's talk about a couple of things first, what are your thoughts on reverse etl? We are seeing a lot of ETL companies also working towards bringing in capabilities around the reverse etl and that kind of makes sense, right? You have connectors to bring your data to a destination. Now you have reverse connectors that pull data from these connectors and pull it into, those sources. You guys are experts in these connectors. You guys know the Shopify connectors, the Google Analytics connectors, or any other connector than anyone else in the market, right? So tell us, what are your thoughts on reverse ETL and should we expect reverse ETL as a functionality coming out of Fivetran times soon? Because a lot of your computers already have it. We have got Airbyte who recently acquired this company called Grouparoo to be able to have those reverse retail functionality. So the market is asking for it. What are your thoughts?
00:27:48
Yes. Great question. And yes, reverse ETL is here. We used to even back in the HVR independent days, we used to say like, Eventually when every destination eventually becomes a source. And that was kind of like part of our mantra and how we looked at it. At Fivetran, we have dominantly focused on building source connectors to be able to deliver the data into the modern data stack. That is currently our focus, and I think the focus is incredibly important for an organization to be able to move forward. We are a sizeable presence in the market but we currently and for the foreseeable future, continue to work with partners from a reverse ETL perspective. We have no plans to be able to deliver the data back into a number of our source connectors at any point.
00:28:38
So you talked about focus, right? So that means, for at least for the, now the focus is gonna be on ETL and kind of delivering it data value. There are a lot of moving elements within the whole modern data stack. So you are not seeing any kind of immediate kind of expansion of Fivetran front into any of those adjacent categories within the modern data stack.
00:28:59
No. Like I, So you look at Fivetran, right? Like, and you mentioned etl. I do want to correct you in that we look at it from an ELT perspective. Now, in the modern data stack, I think part of the success of the modern data stack Is, it's the ecosystem, right? There is no single vendor who tries to grab everything. And I think that's part of what's made us successful. And so we focus on our part where we think is our strength. And likewise, we see the snowflakes, the Databricks, and the BI technologies, the everybody focuses on what they're good at. And it's the combination of options. With the open standards that we've agreed upon from a specifically think about it at an SQL layer that enables a lot of the technologies in the modern data stack, and that allows for other organizations to come in and contribute to the value of the modern data stack. Now you made a reference earlier to, the metadata API and I think that's, that is a very important area where we. We are doing our, what I would say, duty as part of the modern data stack to provide that insight. Provide access to essentially lineage information through our platform, right? Like we know when data resides in a modern data stack destination, we know where that data came from, and we know to the extent that there has been some limited transformation, what that transformation was. We can provide that when you utilize the ability through our platform to make callouts to dbt, which is another technology part of the modern data stack. We provide data lineage, right where we show you where the data came from and we make that available through a metadata API. So, If you like organizations talk more and more about governance, data, cataloguing and things like that. Companies need to understand that they need to understand when they provide access to a particular data source, like which users, which parts of the organization are supposed to have that access, and where is all of this documented. So with the metadata API, we provide that component now. Going back to your question about the focus and are we go, are we planning, to expand beyond our current focus? And the answer is no. There are no, there are not currently plans. But we do want to make sure that we provide as much as we can and do our essential duty as a citizen of the modern data stack. For our partners in this space.
00:31:34
Right, right. So before we wrap up today's episode Mark, let me, leave you with one last question. What do you think is next in elt? Like, what do you think are some of those meaty problems in the ELT space that you think is yet to be solved? And that's something that Fivetran is actively working on.
00:31:53
Well, I'll say three things, right? The first one is just high volumes of on-prem complex data processing engines, right? Like, think the SAPs of the world make those available through the managed service. So that's number one. I think the number two, as we already talked about, is the long tail of connectors. Yeah. So, that's one. And then the last one that I think is relevant in this space is. To provide trust in the data. So where we can provide the organizations with the let's say metrics about the quality of the data that's in the destination, and essentially the level of trust that these organizations need, whether it's for regulatory reasons, or whether it's for essentially feel-good reasons. But they need to understand, Hey, is the data that we moved into the destination identical to the representation of the data in the source? And of course, in certain industries, there is a mandate to provide that if, let's say you want to close the books based on a replicated data set. So I think those three things, that the high volume complex environments, the long tail, and then the trust of the data, are three areas that I think are in the not too distant future.
00:33:12
Wow. So thank you so much for letting us know that, and we wish you all the best to be able to lead the path for the industry around this whole data integration space. Thank you for your time on this episode. Mark, it was such a pleasure having you on the show.
00:33:26
Yeah, thank you. It was fun.