Mar 07, 202334 min
Share via:

S02 E03: Innovating the Modern Data Stack: Change Data Capture and Beyond with Gunnar Morling Senior Staff Software Engineer at Decodable

In this episode of Modern Data Show Gunnar Morling discussed his interest in software engineering and databases and his recent move to Decodable, a real-time stream processing platform based on Apache Flink. He talked about the importance of cohesive data pipelines, from source to sink, and how his work with Debezium led him to become interested in stream processing. Gunnar also discussed how Decodable provides managed stream processing based on Apache Flink, ingesting real-time data streams and processing them, and putting the data into other systems.

Available On:
google podcast
Amazon Music
apple podcast

About the guest

Gunnar Morling
Senior Staff Software Engineer

Gunnar Morling is a software engineer and open-source enthusiast by heart. Before joining Decodable, he worked at Red Hat, where he led the Debezium project, a platform for change data capture (CDC). He is a Java Champion, the spec lead for Bean Validation 2.0 (JSR 380), a prolific blogger, and has founded multiple open source projects such as kcctl, JfrUnit, and MapStruct. Gunnar is based in Hamburg, Germany.

In this episode

  • Move from Debezium to Decodable
  • How stream processing works
  • Decodable and Apache Flink
  • About Change Data Capture
  • When CDC doesn't work


Hello everyone. Welcome to the Modern Data Show where we explore the world of the modern data stack with leading experts and innovators. Today we are thrilled to have Gunnar Morling with us. Gunnar is the founder of Debezium, an open-source project that provides a platform for change data capture to stream data changes from the database into event streams. He's currently working with Decodeable, a real-time stream processing platform based on Apache Flink and was previously a software engineer at RedHat. Gunnar has extensive experience in database technologies. Event-driven architecture data integration and has spoken at numerous conferences and meetups on these topics. We're excited to have him join us today to share his experience and insights on change data capture and stream processing. So without further ado, let's welcome Gunnar to the show.
Awesome. Well, thank you so much for the nice intro. This was very nice. I'm very happy to be here. Can I correct one small detail? I'm not the founder of Debezium credit where credit is due. It was Randall Hauch who created the project. I took it over after some time and I was the project lead, but I've not initially created it.
Oh, thank you. Thank you so much for that clarification. So Gunnar, let's start with a very personal question for you. You have had an inspiring career so far, and the very first question I have for you is how did you first become interested in software engineering and databases, and what motivated you to pursue a career in this field?
Okay, wow. You are starting with the tough questions right away. I like that. So to be honest, I think it partly was even a coincidence. When I was going to school, I was interested in all kinds of things; at some point I even was considering becoming a journalist, for instance. And actually, I'm very happy I did not become a journalist. But yeah, then I learned about computers. I always had an interest in programming. I started to program when I was still quite young and then I thought, okay, let me make a career out of it. Butthen, one step happened after another, right? So I explored some areas. I saw, okay, actually I'm interested in the data space. Let me go a bit there. And one thing led to another, it was not real, like a super planned trajectory, I would say.
After spending kind of good amount of years at Red Hat and working on the Debezium project you have recently moved to Decodable. Can you tell us a little bit more about Decodeable and what inspired you to join that company?
Oh yeah. So Decodable is a startup in the data streaming space. So essentially what they do, or, well, what we do, is managed stream processing based on Apache Flink. So this means, it ingests realtime data streams, let's say from Kafka or something like a Apache Pulsar or Kinesis processes them, and then puts the data into other systems.. So it allows you to do realtime data integration with all kinds of connectors using streaming queries. So you can filter your data. You can map your data, aggregate it, process it in time windows, and so on as it arrives. So it's not batch driven. You don't have to go to your data and pull for your latest data. It all happens in a push-based manner in real-time. So that's what Decodable does in a nutshell. It's based, by the way, on Apache Flink. Now why I was curious about it is, well, I was working on Debezium, as you mentioned for quite some time, and after a while I, well, I realized, okay, Debezium and change data capture – and I guess we will touch on it later on in more depth – is a part of a data pipeline. This concerns itself with the question, how do we take data in real-time out of a database into a data streaming platform like Kafka? And I was curious. Okay, let me, I would like to explore the entire space. So I would like to explore how can we build cohesive pipelines from source to sink without having to integrate like, not quite fitting parts. How can we make this one unique unified experience? I was curious about this processing part of course, because I saw in the Debezium community the need for that. It came up all the time. So people had their change data capture streams, maybe in Kafka. Now they wanted to do stream processing. And I was again curious, okay, how can we use Flink for this? How can we make a nice user experience based on all this stuff? So I was interested in this space. And I just felt, okay, this is something I would like to dive into in more depth. And then also, to be honest, I had been working always in large companies until this point. So Red Hat, when I left, it was like 21 K employees, I believe. I had been working at even larger corporations before and I felt okay, I would like to have this small team, this small startup experience, like everyone on deck, everyone pulling into one direction. Everybody is super energetic and that's what I wanted to explore. And well then I learned about the opportunity to join Decodable and I felt, okay, this is real what allows me to do this. It allows me to go into this small space, hopefully, make an impact on this small team, and that's why I'm here. And so far, I've been very happy.
Right. And what about stream processing makes it tough that captured or piqued your interest as an engineer? What makes stream processing even a hard thing? Why isn't a still a solved problem? What makes it tough?
Right? I mean, so I think at the core there is the stateful nature, which makes this hard. So let's say, you have two data streams and you would like to somehow correlate them. Now, this means you need to reason about those data streams in terms of some sort of time window. And the question is how do you demarcate those time windows? How do you identify which data belongs together? How do you go about data which arrives late? Right? This happens very often. You have some sort of time window-based processing and you really want to be sure you only can close or you only close a specific time window once you have all the data. And now what happens if maybe one hour later you still have an item for an earlier time window, which you already assumed to be fully processed? So those kinds of things. Then, well there is no transactional semantics around it. In particular, when people come from a context of relational databases, they tend to think very strongly in terms of transactions. And I'm, maybe I'm doing some sort of data changes like I persist a purchase order. This reflects in changes in multiple tables in a relational database, and only once this transaction has been committed, all those changes become visible to other transactions. Whereas with real-time stream processing, they don't have the notion of a transaction. I mean, at least not in commonly used systems, I would say. Which means, establishing these transactional boundaries, that's an unsolved issue, I would say, and I'm looking forward to exploring that space and hopefully, we can provide some solutions there.
Yeah. And how to stream processing, like if you were to explain it to a very kind of budding engineer in terms of. How does stream processing even work? Like how is it different from a database? In a sense, I had an opportunity to work with stream processing almost a decade back when I was working with this tool called Esper CEP. So that was my first exposure to this complex stream processing data. Right. How does it even work? Like, let's talk from a fundamental perspective. So you have the stream of data that's coming in. Let's say, the most beautiful definition of stream processing that I've ever come across in my life is in a traditional database, you have data on which you run a query, and on stream processing, you have a query on which you run that data actually, right? How does that work?
I mean, the fundamental difference is in terms of this interaction, the direction of this interaction, right. In a relational database setting – or any kind of classic database setting, it could also be like MongoDB or whatever – this interaction is pull-based, right? So this means if you have a query and if you want to have new results, ou need to rerun this query and it will give you a fixed result set, which is valid at this point in time. And now in regards to how this query is executed, well, usually it runs in a pool-based way from the top to the bottom. So let's say you do something like select something from something. Right. Just to give you a very basic example, we would first go to this select clause. We would identify, okay, that's the stuff we wanna select. And then we would go to this from clause and we would fetch some maybe even one item from that fetch clause. And maybe then in a loop, we would process more and more. Until our query is complete. So maybe we have like a limit clause where we would only fetch as many results as we need to have to satisfy this limit. Otherwise, we would do it until we have fetched all the results. Whereas exactly as you say, for a stream processing system, it's the other way around in those kinds of systems queries. Evaluation happens in a pushed-based way. So this means in this simple example, we would start at the bottom. So we would essentially when there's a new row, which satisfies this particular clause, this would then push upwards to the select clause. The select clause would project whatever data it needs and push it to the consumer. So it is, as you say, it's like a query you can subscribe to and you will be notified whenever there's a new result coming in
Right, so there would be a lot of state management internally, which for exactly manage the state of that query. Why talk, talk to us a little bit more about the state management. How does that happen?
Yeah, I mean, so again, as you say those operators can be stateful. So let's say we want to do an aggregation, like, I don't know, the aggregated revenue of my purchase orders per category and maybe by day, something like that. So now this means if a new purchase order comes in, which would then, sit at the very bottom of this query, we need to identify essentially, which are those buckets in this grouping which need to be updated. So we need to keep all the information. Or let's say we do some sort of join. Well, then we also need to essentially keep the state of this joint so we can add more rows to it; now, how is this implemented? Well, typically some sort of stay store is used underneath those systems. So let's say in the case of Kafka Streams and also in Flink you could use RocksDB as a state store, which then allows us to retrieve that state in an efficient way.
Right. And Apache Flink has been out there for a while, right? And you just mentioned like Decodable is a platform that is built on top of Apache Flink. So is that, is Decodeable like a kind of a managed version of Apache Flink, or is there something that you know, that's a common question that you often get, right? Is this a managed service? Is it just Flink packaged for enterprise suit applications or? Or is it different from that? And if so, how is it different?
Yes. So it is a managed Flink in a sense, but then also with a special twist to it. And to take a step back there, In Apachelink, you have essentially multiple ways for how you can interact with Flink or how you can define your stream processing jobs with Flink. There are programmatic APIs, which you could use with Java or Scala, there's a Python API even. And then has also Flink SQL, which is a SQL layer based on top of Apache Flink. And actually, this is what we right now use in Decodable. So the user experience in Decodable is fully based on Flink SQL. So this means you define your stream processing jobs in standard SQL, and then Decodable the platform takes care of taking that SQL definition and essentially, deploys Flink jobs in a cluster based on that. But that's the part you don't need to care about. So you solely care about defining the logic of your query. What are the kinds of data transformations, joins, grouping, and whatever your use case requires. So that's what you think about. You connect it, of course with your source and sink connectors. So you can take data from something like Kafka and write it into, I don't know, Snowflake, let's say, or Elasticsearch. But then how this all is executed, via Flink, this happens underneath. So in that sense, it's not a managed Flink service where you would have the Flink admin UI and you could use that. You interact with Decodeable via Flink SQL.
And do you also extend the Flink SQL in terms of like, introducing some kind of, a different kind of let's say event processing like a new rolling time window function or something like that?
Yes. Right. So there are a few specific additional UDFs (user-defined functions) which are provided there on top. What we right now don't have is the ability for the user to provide their own custom UDFs, that's a requirement which comes up now and then. We are exploring it, but it's not something which is there. Right now, if a customer has a specific need and we feel this makes sense for the user base at large, we would add such a UDF and then everybody can use it. It's not something which you could do yourself. But again, it's an area which is an exploration.
Very interesting. And can you provide some examples of like realtime applications or use cases that Decodeable is well suited for? Like any interesting or amazing use cases that you have come across your customers using Decodable for.
One of course, which I'm personally interested in is just everything related to CDC change data capture, and reacting to data changes in a database in real-time. So this definitely is a popular use case for Decodable. And so people then, for use this, for instance, to join multiple CDC streams and emit one joint aggregate structure to something like Elasticsearch. Because, you know what you wanna do in Elasticsearch, you don't want to you want… If you have like a data structure, which is like a purchase order – which is like a hierarchical structure of multiple items – you want to have this in a single document in your search index in Elasticsearch. And using Flink SQL and Decodable, for instance, allows you to join multiple table-level CDC streams into a single document. So that's a common use case. Then, of course, just filtering your data. Maybe you have sensitive columns in your database, which you don't want to expose to external consumers. So you could mask those values, you could drop those values. Maybe you wanna establish some sort of data contract. Maybe there's like a schema change in your source database and you want to shield your consumers from that. You could use Decodable to establish a data contract, which gives a stable interface based on that raw stream. So that CDC is in the widest context, but then of course there are many other, let's say business use cases. Something like fraud detection. Maybe you have a stream of transaction data. Which comes in via Kinesis or Kafka. And you look at the location of where did those credit card transactions take place and maybe I see your credit card has been used in Germany, and like in the next minute it has been used in some store in the US, this probably is fishy, right? So maybe your cart has been stolen or your number got revealed. Then we could do this kind of pattern matching, this kind of analysis or this real-time transaction stream. And then, I don't know, send, for instance, those suspicious transactions to another topic. There could be an application which consumes that and then raise an alert or send you a text message or something. Something like that. So we see that a lot. What else is there then? IOT I would say is also a big thing, right? So you have sensor data. I mean, I don't know, temperature measurements. or let's say here in Hamburg where I live in Germany, we have a network of bicycle counting stations. So whenever a cyclist comes by, you will be able to react to this event via MQTT. So you could take that data and, I don't know, visualize it on a map. So you have like a heat map of where are most cyclists and all this kind of stuff.
Wow. Very interesting. Very interesting. Talking a little bit about Debezium, right? So can you provide a brief overview of what change data capture is and how it works and more importantly, why is that an important concept in the whole modern data architecture?
Right, right. Absolutely. Yeah. So what it is, I mean, as I mentioned, it's change data capture (CDC) which means it taps into the transaction log of your database and extracts changes from that. So whenever there's an insert or an update or a delete the CDC process will react to this event which gets appended to transaction log in your database and it'll propagate this event to any downstream consumers. So just to take a step back there. All transactional databases have what’s called the transaction log, like the write-ahead log in Postgres, or the bin log in MySQL, the redo log in Oracle. You always have that for transaction recovery, replication, and so on. And this is the canonical source of changes in a database. So whenever something changes, an event will be appended to the transaction log and log-based CDC, which is implemented by projects like Debezium, are a very powerful tool to react to those data changes. And there are a few very important characteristics which come with this log-based approach. So for instance, we will never miss an event. Also if, updates or inserts or maybe an insert and the delete happen in very close proximity. Because sometimes people think we also could implement a query-based CDC approach. Where we go to our database and like every minute we poll for changed data. But then, well, if within one minute something gets inserted and something gets deleted, well you wouldn't even know about it, right? And you couldn't identify that something gets deleted, to begin with. And then of course you could say, okay, let me poll more often, like every second. But it would create a huge load on your database and you still wouldn’t be quite sure that you don't miss anything. And all those problems go away with the log-based approach. So that's why I think log-based CDC, that's the way to go. And why is it important data architecture? Well, people, of course, have large volumes of data in their databases, and they would like to react to it with low latency. And just to give you one very common use case, it's taking data into a data warehouse, something like Snowflake or Apache Pinot, maybe as a real-time analytics system. So you wanna do those analytics queries which you cannot do on your operational database, because it's not designed for that. And now, of course, those analytical queries should work on current data, right? You wanna run your reports, you wanna run your real-time queries on fresh data, not on the data from yesterday. And this is why CDC is so important because it allows you to feed such a system like Snowflake or Pinot or Clickhouse or whatever it is, with very low latency. So, for instance, I know some users in the Debezium community, who go from MySQL to Google BigQuery, and they have an end-to-end latency below two seconds. So within less than two seconds, their data will be updated in BigQuery and they can run very current queries there. And people realize that. And very often, what I also observed is, okay, so maybe users have one particular use case where they feel, okay, we would like to use CDC, we would like to have this low latency. And once they have done it, once they have seen, oh wow, I can have my data in two seconds, they want to have the same experience for other use cases. And they see, oh, I can also use it for, I don’t know for, streaming queries, for building audit logs, all this kind of stuff. And that's why people are excited about it.
Right. So is change data capture the underlying technology that supports the database replications as well?
So I would say, the technical foundation is the same, right? Because I mean, of course, there are different ways how replication works, but let's say something like logical replication in Postgres, it's exactly the same mechanism. Whenever something gets appended to the WAL, the write-ahead log in Postgres, this will be sent over to any replicas. And in that sense, Debezium as a log-based CDC implementation is like just another replication client, only that then of course it takes the data, formats it into like a generic change event format, which is the same pretty much across all the Debezium connectors, and then sends it out to all kinds of systems, right? It could be another database, but it also could be like a microservice or something like Elasticsearch, a cache and so on.
Right? So we're glad you mentioned this, generic change format. How does Debezium capture data changes from a variety of data systems, each database would have its different log formats. How do you manage that and how do you even scale with that?
Yeah, that's a challenge. So, as you say, those formats and those APIs, how a connector would get data out of Postgres, out of MySQL, out of Oracle, SQL Server and so on. They're, they are different. And until the arrival of powerful open-source CDC tools like Debezium, this definitely was a challenge, because as a user you would have to explore all those ways for ingesting changes. And it is not exactly trivial depending on the specific database. So the good thing is, now with Debezium, which comes with support for a variety of connectors and databases, you as a user, you don't have to care about it, right? So the Debezium engineering team, they do this sort of original research, exploration and engineering, and then the goal is to provide one uniform change event. So for you as a user, then it doesn't matter Does this event come from… from which database does it come from? Right? Is it from Postgres? Is its from SQL Server? It pretty much looks the same. And by the way, the core structure of the events, of course, resembles your table structure, right? So the schema, the core schema resembles the schema of your captured tables. And now what's interesting is that Debezium is becoming. And I, of course, I'm biased, but I would say it's becoming a defacto standard there. Because what we are seeing is that other companies and other teams, which are not even part of the Debezium core project itself, also use the Debezium change event format and the Debezium connector framework. So, for instance, there is ScyllaDB, which is like an API-compatible implementation for Cassandra. And in general, just like a scalable NoSQL store. So they implemented their CDC connector for Kafka based on the Debezium framework. So now they also emit the same Debezium change event format. And again, it does matter then does this event come from ScyllaDB, or does it come from any of the built-in Debezium connectors. It's the same for YugabyteDB; so they also have their CDC connector based on Debezium- And just lately, and I'm super excited about it,, Google contributed their CDC connector for Google Cloud Spanner to the Debezium project itself. So now this is part of Debezium. And again, you get change events out of Spanner in the Debezium format. And I think this is just super powerful for users because, I mean, you always try to converge on a single database. But then it never is going to happen, right? Because you have different teams there, different needs, different requirements. Maybe you're doing an acquisition, then all odds are off anyways, and you always have the situation where you have multiple databases and one canonical format. That makes tons of sense.
Wow. Wow. And one important question that comes into mind is how does Debezium handle schema changes? What happens when the schema of the underlying table itself changes. How Debezium handles this?
Yeah. So the way it is handled is, essentially, all the messages which Debezium emits, are associated with a schema. How this looks like, this depends a little bit on how you have what kind of serialized format you are using for your message. So let's say you are using JSON, you could embed schema or you could embed schema into every message, which makes those messages self-contained. But of course, also, it's very verbose and generally has lots of overhead because if you think about it, you don't change the schema of your database tables like every day, right? So this means we will give you tons of messages with the same schema, which doesn't change. So, it's not what I would recommend in production, definitely people use what's called schema registries, and this means, now your message is they just contain the actual data, of course. And then just refer a reference like an ID to a schema, which is in the registry. Then this could be used with JSON but also, and I guess more often, this is used with Avro, or maybe protocol buffers, Google protocol buffers. Now, this means as a consumer, you just go to that registry. And you received a message, it tells you, okay, this message has, is schema 123. So you can go to this registry, get this schema 123, and I guess you would buffer it locally of course. And then you can interpret, let's say this Avro payload. Andthose schema registries, they come also with schema evolution rules. So typically if there was a schema change, and we would push a new version, which let's say isn't backwards compatible. Maybe we drop a message, we drop a column, and then this new schema version would be rejected. So, that's one angle using a registry which enforces certain kinds of rules. What I also often see, and what I would recommend is, and this comes a little bit back to Decodable and Flink, to use stream processing to shield your users from certain kinds of changes. So let's say you rename a column. Well, in that case, what you could do is, you could use stream processing or maybe if you are just in Kafka and Kafka Connect, you could use a message transformation in Kafka Connect, and you could use it to add back that field using its old name. So for some sort of transition period, you would have the same field, using the old and new names in your messages. And then this would allow consumers to react to that renaming at their own pace, and then you, I guess you would define some sort of SLA where you say, after six months or something you drophen finally the old name. Until then, everybody needs to be migrated. But stream pressing would allow you to, do this kind of… handle schema changes in a more graceful way.
Right. And so, so CDC sounds like a bit of a fairy tale as of now. It's like, it looks like a perfect system that has low overhead, very less performance bottleneck. There has to be a catch. So what are the cases where it doesn't work?
Well, I mean, I would be lying if I were saying that there is no operational complexity to all of that. I mean, it already starts with the database. You need to configure your database typically in a way, for instance, that the translation log is structured in a way so we can extract change out of it. So this means you need to, I dunno, reconfigure your RDS database, or you need to go to your on-prem DBA and tell them, Hey, can you change those particular settings in the configuration? So our database, for instance, keeps more extensive information in the logs so we can interpret them. So that's the first part, one first challenge. Then another challenge is how do you react to failure scenarios, and what do you do then? So let's say, your connector crashes and you need to restart it. This means now, once you restart this connector, of course, you want to consume that change feed from the database from the point where you left off before. And the way this works in Kafka connectors, there are offsets which are stored. And if we get restarted, we can go to this offset storage and we will know from where we need to continue to read. But then those offsets, they are stored in intervals in Kafka Connect. So it doesn't happen like for every record, which means if we crash, we don't get to commit an offset. Then we would read a few events another time, right? So we would have what's called at-least-once semantics. Oftentimes people wanna have exactly-once semantics. You can get this with Flink and Decodable by the way. But, that is a challenge. And then just to name one very particular example, in Postgres we use what's called a replication slot which is like the logical identifier of this connection. And this keeps track of how far we have read the change stream. And now what happens or what can happen is, if you don't consume from such a replication slot, the database will not discard any WAL which comes after the latest offset, which has been confirmed for that slot. The reason is, well, if you want to continue to read from that replication slot at a particular offset, well, the database still needs to have the transaction logs for that particular offset. Now, if you have this slot set up, but then you don't consume from it for one week, let's say, because I dunno you stopped that connector, and you forgot about it. Well, the transaction, the database will then keep, it'll retain all those transaction logs. And of course, this can be a problem: you might run out of disk space. So what I'm trying to say is you need to add some sort of monitoring to it, so you are aware of that situation. And that, by the way, again, is the reason why I joined Decodable, because there I feel we can provide such an experience for you. So if we could make this a part of Decodable and give you this sort of alerting function, if you say, okay there's this connection which you configured, but you don't consume it, and now well, there's a chance that WAL piles up in Postgres, you should do something about.
Right? Right. And now as we inch closer to the end of the episode, one question I have for you is, How do you see the field of data management or stream processing evolving in the next few years? I think so. And what impact do you think this will have on organizations and developers? We have seen a lot of fast-moving changes happen in the modern data stack in 2022. What's gonna happen now? What do you think is gonna happen now?
What I'm, one of the things I'm excited about is that we are embracing more and more that data gets duplicated and denormalized, right? So, I mean, when I started it my career, it always was the aspiration, okay, this data sits on this database and there should be highly normalized and there should be no duplication. This is the one system of record. And if you wanna have this data, you need to go to this particular database. And now I feel, we are already more now, more and more to embrace, okay, data needs to live in different systems. It needs to live in our operational system and this realtime analytics store. So we can have like a realtime dashboard, this kind of stuff. Maybe we wanna push data at the edge. So you, we haveread models of our data. I dunno, maybe in an embedded SQLite data. Close to the user so we can satisfy read requests with a very low latency because we have already the data close to the user. And I feel managing all this that's a challenge, right? So you need to, of course, you need to keep this data in sync. So this is where CDC comes into the picture. This is where stream processing comes into the picture. But you need to think about, I dunno, who's the owner of specific data. Who are my consumers? How do they talk to each other? Who do we need to notify if we wanna do a schema change? So there's lots of operational management, to all of that. And I feel there's a need for like, I don't know, maybe a cohesive platform which provides all of that. So I think that's a part which we're going to see. And then, yes, generally speaking, I feel like this trendtowards low latency and real-time data, this is just continuing. So I feel people will move more and more to those kinds of streaming queries. Databases will have those CDC capabilities built in. I feel like in a few years it'll be odd if there was a database which wouldn't give you such an interface. So I feel, those trends definitely continue and also speed up.
Perfect. Perfect. So Gunnar, thank you so much again for your time on this episode. It was such a pleasure having you on the show and learning all of these amazing concepts from you. So thank you for giving your time.
Absolutely. Thank you so much for having me. And I don't know if people have questions for me, feel free to reach out to me on Twitter. @gunnarmorling is my handle and I'm very happy to exchange and learn from you as well. Yeah.
Thank you. Thank you so much, Gunnar.