Right, right. Absolutely. Yeah. So what it is, I mean, as I mentioned, it's change data capture (CDC) which means it taps into the transaction log of your database and extracts changes from that.
So whenever there's an insert or an update or a delete the CDC process will react to this event which gets appended to transaction log in your database and it'll propagate this event to any downstream consumers. So just to take a step back there. All transactional databases have what’s called the transaction log, like the write-ahead log in Postgres, or the bin log in MySQL, the redo log in Oracle.
You always have that for transaction recovery, replication, and so on. And this is the canonical source of changes in a database. So whenever something changes, an event will be appended to the transaction log and log-based CDC, which is implemented by projects like Debezium, are a very powerful tool to react to those data changes.
And there are a few very important characteristics which come with this log-based approach. So for instance, we will never miss an event. Also if, updates or inserts or maybe an insert and the delete happen in very close proximity. Because sometimes people think we also could implement a query-based CDC approach. Where we go to our database and like every minute we poll for changed data. But then, well, if within one minute something gets inserted and something gets deleted, well you wouldn't even know about it, right? And you couldn't identify that something gets deleted, to begin with. And then of course you could say, okay, let me poll more often, like every second.
But it would create a huge load on your database and you still wouldn’t be quite sure that you don't miss anything. And all those problems go away with the log-based approach. So that's why I think log-based CDC, that's the way to go. And why is it important data architecture? Well, people, of course, have large volumes of data in their databases, and they would like to react to it with low latency.
And just to give you one very common use case, it's taking data into a data warehouse, something like Snowflake or Apache Pinot, maybe as a real-time analytics system. So you wanna do those analytics queries which you cannot do on your operational database, because it's not designed for that.
And now, of course, those analytical queries should work on current data, right? You wanna run your reports, you wanna run your real-time queries on fresh data, not on the data from yesterday. And this is why CDC is so important because it allows you to feed such a system like Snowflake or Pinot or Clickhouse or whatever it is, with very low latency.
So, for instance, I know some users in the Debezium community, who go from MySQL to Google BigQuery, and they have an end-to-end latency below two seconds. So within less than two seconds, their data will be updated in BigQuery and they can run very current queries there. And people realize that. And very often, what I also observed is, okay, so maybe users have one particular use case where they feel, okay, we would like to use CDC, we would like to have this low latency. And once they have done it, once they have seen, oh wow, I can have my data in two seconds, they want to have the same experience for other use cases. And they see, oh, I can also use it for, I don’t know for, streaming queries, for building audit logs, all this kind of stuff. And that's why people are excited about it.