Jun 06, 202326 min
Share via:

S02 E15 Data Source to API in Minutes with Matteo Pelati and Vivek Gudapuri, founders at Dozer

Prepare to be amazed in this episode as Matteo Pelati and Vivek Gudapuri, the brilliant minds behind Dozer, reveal their experience in pushing the boundaries of data management and analysis. By simplifying the process of data serving and allowing companies to create APIs quickly and efficiently, Dozer's approach sets them apart from the modern data stack. Their open-source approach allows developers to build custom operators and extend connectors, ensuring that Dozer can cover a wide range of use cases while still offering customization at each step. They also discuss the challenges they faced during the development of Dozer and how they are positioned to adapt to upcoming trends and developments in real-time data processing.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Matteo Pelati
Co-founder at Dozer
Vivek Gudapuri
Co-founder at Dozer

About Matteo Pelati: Matteo is a software architect, manager, and entrepreneur with 20+ years of experience leading teams and delivering big data analytics and machine learning products. He co-founded Dozer and previously held leadership roles at Goldman Sachs and DBS Bank. Matteo also designed big data analytics platforms at DataRobot and Singtel. With his extensive expertise in software engineering and management, Matteo is well-positioned to make further impactful contributions to the technology industry. About Vivek Gudapuri: As an experienced technology professional and entrepreneur, Vivek has a strong track record of building and scaling successful platforms from scratch. He has a deep expertise in data and platform engineering and has helped bring two startups to scale, in addition to co-founding Dozer. Currently serving as the Co-Founder of Dozer, Vivek is focused on building a data infrastructure backend that enables engineers to build highly scalable APIs from any data source. With his strong technical background and entrepreneurial spirit, Vivek is poised to continue making a significant impact in the technology industry.

In this episode

  • The Journey of Dozer Co-Founders
  • The core functionality of Dozer
  • Dozer's contrarian Approach to the Modern Data Stack
  • Conversation on Dozer Platform and Benefits of Rust Programming Language
  • Discussion on Licensing and Success Stories

Transcript

00:00:00
Welcome to another episode of the modern data show, where we have the pleasure of speaking to Matteo Pelati and Vivek Gudapuri, the co-founders of Dozer with an impressive background as software architects, manager, and entrepreneurs. Matteo and Vivek brings over 20 years of experience in software engineering and management to the table. Previously, they have been involved in groundbreaking projects that have pushed the boundaries of what's possible in the data management and analysis. Now, as the co founders of Dozer, Matteo and Vivek are leading the charge and revolutionizing the industry with innovative solutions. Join us today as we delve into their journey and gain insights from their vast knowledge and expertise in the field. Welcome to the show, Matteo and Vivek. Guys, let's start with the very first, basic introductions, tell us a little bit more about about your background and how did you guys got together in terms of building a Dozer?
00:00:52
Yeah, so we basically started our journey last year is less than a year that we started me and Vivek, we have been really good friends for the last 10 years. We have been living in Singapore for 10 years. We actually met during our first job in Singapore when we were working for a Sequoia based company here and we were always, we have been iterating over different ideas in these 10 years. And then that, and we were so convinced about Dozer that we decided to start start something. A little bit about my background. I'm coming more from a a mix of experiences between startups and financial services. Always in the data space. I've been, in the early days, part of the data robot actually, which was quite successful. And after that, I moved to DBS bank where I basically build the the data team there and later on moved to Goldman Sachs where I was leading the, Data group for Asia pacific. And after these experiences, I saw a lot of opportunities in building something like Dozer. And yeah, I just propose it Vivek. And here we are just eight months after that we started.
00:02:18
Yeah. Personally, we came across this problem multiple times in our previous experiences, and we solve this in multiple different ways. And we realize every company has to do both. Some amount of data plumbing himself, because there's a lot of innovation in the data space. There are many companies that do very nice things, but when it comes to data serving, we still realize that companies end up building a lot of things from scratch integrating several platforms to achieve the purpose of data serving. And we wanted to productize that. So that's where we come in. Myself, I've since since I worked with Mario 10 years, almost 10, 12 years ago, actually I've always been in startups mostly. I was involved with a few successful startups mainly startups, series A, series B. Some of them had an exit. I was also a CTO for a publicly listed company in Australia. I've held leadership positions, CTO positions, managed product data and engineering teams across Singapore, Eastern Europe, and Vietnam, India, several other places. That's a little about me.
00:03:19
And before we, go all in into a Dozer and exploring the details of Dozer, what's the elevator pitch of Dozer? Help us understand in the simplest words, what does Dozer actually do?
00:03:31
Point us at a data source, get APIs in minutes. That's what Dozer can do.
00:03:35
Brilliant. Perfect. Let's start with the, the very first question is how did you identify the need of a solution like Dozer into the market? And what were the challenges that you really saw in the, the real time data processing that kind of motivated you to build this product?
00:03:57
Yeah, I think I can take this. So I can tell more about this with a, with an interesting story. Having work different organization and the, let's say the 'aha' moment came when I was at DBS, the DBS, we we built an entire data API infrastructure layer that needed to connect multiple source systems do some real time data processing prepared the data for it to be served through the mobile banking application. And this seems apparently a very simple task, but in reality, it's more complex than it seems. And when you venture into something like this, you have to put a lot of different tools together. And and sticking them together custom codemand one of the most painful part is maintaining all this infrastructure. And sometimes it is not really necessary to build such a complicated system, especially if you don't have that volume of data , you don't really need a fully distributed system. So that's when we started thinking about those are saying, how about we could build something that simplifying the entire process. Instead of putting together all these pieces, we can create a solution, what we call a data API backend that can go all the way from the sources. Down to the serving part. That's how we identified the solution and DBS was a case. I saw a similar case in in in Goldman Sachs and be back at the and encountered similar cases as well. So we said. Maybe there is a, there is an interesting aspect here.
00:06:02
Yeah. And that's pretty interesting. And can you help us understand how does this thing whole thing work? Because I believe there would be so many moving elements in this. You're talking about real time data extraction from the source systems. You're talking about stuff like caching, you're talking about stuff like scheduling indexing stuff, it's a pretty complicated stack, right? So walk us through your kind of, how it works a thing. Walk us through your, what's the stack under the hood? Tell us about a little bit about what's your database systems. How are you managing streams? How are you managing caching? How are you managing scheduling? Walk us through how all of these, huge pieces working together to give the end users a simplified API experience of just querying the source system directly.
00:06:59
At a very high level, when we talk about Dozer's architecture. So we have a streaming SQL engine that can take in data in terms of a stream where you could write SQL on top of data and columns, just like you would query a relational database and the output of that is put into a cache built on top of LMDB. LMDB is is very low level and it's very performant. So we have built secondary indexes on top of that to power APIs for queries. So you could do MongoDB style queries on top of pre materialized data. So end to end flow what we do. And from a connector standpoint, you could control what data you want to move, how you want to move it, if it's real time, if it's done in a schedule, that's what developers can do. So with a simple configuration, you can say that you're moving data from several source systems. Put a SQL in place to combine data and produce data that needs to be cached. And you have APIs that get automatically generated based on the output of SQL in terms of GRPC and REST. And because we control data movement end to end, we also end to end is controlled through those types. So all the types we control, and that's why we could actually control the performance as well. And protobuf and open API documentation is also out of the box because of that. So with a simple YAML configuration, you could actually combine data and aggregate data and produce APIs, all, all with a simple configuration. And each of these steps is configurable in such a way that implement custom components on top of that.
00:08:31
And this is quite a refreshing take on the existing modern data stack, right? Because what we're seeing in the modern data stack is there is a kind of a Cambrian explosion of tools, which allows companies to, collect data from these operational systems, put it into a data warehouse, build an analytics layer, put in a BI and be able to actually consume this data. This is quite a different take. Why do you think was the need of such a kind of a, it's a contrarian approach to the whole modern data stack, right? Did that why did you take this approach?
00:09:09
Okay. The, so that's a, how you brought it up is very interesting because that's not really necessarily goes against it. The way we approach the problem is that you can source data. From either data warehouses like Snowflake or Databricks and your source system as well. So you have a choice. So if you are okay to get data with a certain delay you can get it from your data warehouse. But if you want real time data, you get it from the source system. Since you brought up the new data stack, it's interesting because if you look at data, the data stack diagram, you see a lot of arrows going in like ingestion tools like AirByte, tools like Fivetran, and you see a few arrow going out, like for example, Reverse ETL and dashboarding, and what you don't see is a platform that is specifically designed To serve data and integrate data in customer application. And that's where we fit very well in the data stack ecosystem. So we take care of data, wherever that data is coming from, could be from a data warehouse or system, and we make it very easy for a user. to prepare the data to be integrated in customer facing application and with all the serving and API aspect.
00:10:42
Yeah. And how did you guys decide which parts of the whole data processing pipelines to handle in a kind of a plug and play fashion versus giving full configurability to the customers?
00:10:55
Yeah. All the way from a real-time SQL engine, each of the SQL operators we have made it available. That's one of our motivations to be open source as well, where developers can actually build custom operators in wasm in other languages as well down the line, to be able to Modify the functionality of SQL. Similarly, we also made it possible to extend connectors, which means you can bring more connectors to the table - what you want to integrate with. And on top of that, from an API functionality standpoint, soon we will have a middleware or functionality where you could actually customize the functionality of API behavior, right? So each of these steps is customizable, but if you want something which is already working in terms of what Dozer can provide, you could do that with a simple configuration. So that's the main reason for our open-source approach as well. As we know, we are taking on a big scope and we would not be able to cover all the cases that users would want and we've made each of these steps customizable.
00:11:53
Got it. And how do you ensure that the latency of the data queries and API responses are as minimum as possible? And let's just take an example. For example, let's say a customer have, one of their operational systems, let's say, Salesforce, where you have a source of record for the CRM data, right? And what I believe and what I understood, what Dozer would be able to help them is give them an API that allows them to directly integrate and query the Salesforce data from an API versus them having to write an whole ETL to put all of this data into data warehouse, and then building an application layer on the top of that. How do you ensure the latency in this case? Is the data being fetched just in time with the query or is there a kind of a scheduling that goes under the hood that sinks all the data into a kind of an intermediary place, on top of which the APIs are being served. How does that work and what are some of the optimizations that you have done on that part? Vivek, if you would like to take that.
00:13:03
Sure. What we do actually is based on your SQL query, we move all the data into a cache and your queries directly hit the cache. And because it's coming out of an LMDB, the response times are really low because this is where we pre-optimize the queries, pre-materialize the views and move the data into a very fast read-only cache. In theory, we actually move, take a copy of all the data in the form of the transformations you have put in place and based on the SQL you have put in place, it becomes a full DAG processor in a way. So if we move data based on aggregations, we'll construct several nodes. All of that will basically, the sync of the processor processing pipeline would be a cache. And you can query the cache and you can distribute the cache on several nodes or a single node so we can easily kind of scale the deployment as well. So this is the opinionated approach as well where we don't actually query the source system rather we move the data into a cache and cache is very fast.
00:14:02
Understood. And I also happened to notice that, Dozer is built on the Rust programming language. And can you elaborate on why Rust was chosen as the foundation of your platform? What specific benefits or features of Rust aligned with your goals and requirements?
00:14:22
Yeah, this is a very good question, actually. So, coming from a data background, data has been traditionally be based on JVM based tools actually - Java or Scala. That comes with some pain actually, managing in production software runs on the JVM with all the GC, etc. That's quite people. Rust gave us an opportunity to have incredible performance. We doubt the pain of memory manager of C++ and if you look at the the landscape of what's happening in the data engineering space, people have started to realize this that you can squeeze more CPU cycles and have much more performing tool, especially in the data space. That's where you need these tools. And on top of that, new processors that we have, most of the ARM based processor that can give you a core scalability. So today, it's possible to run on fundamentally a single machine, what it would have required before a distributed system? So that's why we believe Rust is going to be the future of data engineering. And that's the reason why we decided to use Rust to implement it.
00:15:59
And Matteo, a follow up question is since Dozer provides a kind of a comprehensive solution that takes care of the whole data movement, from the source systems and the whole transformation and the whole API generation, kind of everything in a single layer. What would this mean for your customers, for existing people, let's say, who have separate tools like, maybe AirByte or Fivetran for data ingestion and then a data warehouse and then you have an analytics layer, maybe Elasticsearch to be able to solve this data. What does it really mean for them? What would be your call to those people?
00:16:43
It what it means for them is that. It simplifies the entire development phase. And another important aspect is that, in a bigger company, there is always a time constraint between the data engineering team and the product engineering team. And when product engineering team requires some data inside that should depend of the data engineering team to get them. And that's where sometimes problem can be. With Dozer we empower a single engineer to build what you needed to be embedded in your customer facing application. And when you can source your data from data warehouse or source system, you can do all the transformation you want and automatically you get your API, we do all the, we do all the management. So we fundamentally, we enable the single data engineer or even the single product to own the entire cycle.
00:17:46
That's very interesting. And I think that's very powerful. Vivek tell us a little bit more about what were the biggest technical challenges that you and your team faced while building Doza?
00:18:00
Yeah. Just to follow up on the previous point, we are talking about what would typically take an entire ETL tool, an entire ElasticSearch, entire streaming database. That's what Dozer solves end to end. So there are today verticals of product solving a problem very nicely comprehensively. So we have chosen to chosen to build an end-to-end experience for developers where we solve a part of ETL, a part of streaming database, a part of a caching layer on top of that, a part of a API orchestration. So we have dealt with some of the more difficult problems, and it has been a challenge, to being developers we also quickly gravitate towards the hardest problems rather than, What solves the product problem? So it has been a very interesting journey so far. So we have consciously chosen the right problems to take on. While we have built an entire streaming SQL engine, we have dealt with some database problems here. And we have built secondary indexes and how data is to be cast and queried. So there are some query engine capabilities that we have built. And in terms of how data types have to move around. And we also care about query latency as well as data latency in terms of how fast the data moves. So we cared about performance. We cared about user experience. There are, we had to solve some of the fundamental problems all the way from scratch.
00:19:17
Just to follow up to that Vivek I saw in your TechCrunch article about saying that Dozer takes an opinionated approach, right? Tell us more about that. What do you mean by that opinionated approach?
00:19:32
Yeah. Just to this is what what we just discussed as well in a way. So today for a company looking to build data API sort of, let's say if you have several microservices to build APIs on top of that, you would create a a intermediate layer where data gets combined and stored. And on top of that, you would use something else to build APIs. Typically you would put a Kafka in place or a message queue in place, a radius and elasticity, several things come into play. This easily talks about a few months of effort and maintenance. Every time you change schema on one end, you have to think about API versioning on another end. So there's several moving parts. So those are takes an opinionated approach of, we want to simplify and unify all of this development, right? So a single developer with a single configuration now can Say, I want to create a new version of API with a single command. You have two versions of API that are deployed in a blue, green cache way. And you can migrate data in a very easy fashion. So this is our opinionated approach.
00:20:34
Thanks. Thanks. Thanks for that. Another follow up question is regarding your licenses, right? So Dozer is not currently not quite open source, you guys plan to switch to a dual license. So one is, help us understand this license and tell us why did you choose this licensing model and the trade offs that you considered for that?
00:20:57
That's right. This was a kind of a debated topic even from our community. So since then we have switched to Apache v2 license for our core. So now it's fully open source in the dev, but by what is it, what is okay with the community. So we have switched to the open source model. The reason for us to choose Elastic V2 license before was was to give us, give ourself a head start. Not to prohibit people from going to production. We always wanted people to be able to self deploy and self host in production. That was never the, that was never a question for us. It was more to protect ourselves from the big boys to take our code and run a parallel service. What is available today? The main core, Dozer core is fully available in Apache V2. You can take that, you can work on top of that, you can deploy it yourself. All of that is possible. We have a separate commercial offering called Dozer Cloud. Where we scale and host and manage your deployments for you This is for a commercial, this is the commercial offering.
00:21:57
Understood. Thanks. Thanks for that. Matteo are there any kind of notable success stories or use cases of companies or developers who have already implemented Dozer that you would like to share with us?
00:22:09
sYeah. We are very early in the, in our journey, actually, we just open source a few months ago. Nevertheless the the journey has been quite exciting. So in just a couple of months, we we got to a thousand stars on GitHub. We started getting a traction of developer using it. Now we are actually working together with some of our open source users. As well as talking about let's say more POC with with enterprises, actually, and these ranges from gaming company to telco to financial services, actually obviously, I cannot say names here because. We are talking to them, but that's what we are working on actually. And we are happy about progress that is how the, how things are progressing.
00:23:10
Amazing. And as we inch closer towards the end of the episode, let me leave you guys with one last question is looking ahead. How do you see the role of real time data processing evolving in the future? What trends and developments do you. anticipate that will impact the industry and how are you guys positioned to adopt to these changes?
00:23:36
Yeah, it's it's very interesting. We even among enterprises actually that are usually the last ones adopting new new technology, we see an increasing usage of real time processing. That people are enterprise companies are realizing the value of enterprises the value of a real time processing. And for those are, that's actually important because Yeah, If you if you generate a dashboard it's okay to have a dashboard that is generated once a day or twice a day. But if you are starting to integrate data in your customer facing application not for all the data, but for good part of this data, it needs to be real time. Needs to be real Time means that you have to be connected to your source systems. So that becomes extremely important. The second aspect is not just the let's say the consumption of data, visualization of data, but is the interaction with the source system itself. And on that note, for example we are releasing a new feature. We have it in data right now that what we call a change data function that allows basically to an application to react to some data changes and take actions and could be like sending messages could be like writing back to the source system and these for this feature, it requires a real time. What we think is that the moment you go closer to your customer, the moment you want to provide a real value to your customer, you need to integrate with your data system in real time. And that's where we believe in general, why we believe in general, real time system will be growing.
00:25:34
I love that phrase change data function, good play on the word change data capture. So love it. We'd love to give it a try one day. Guys thank you so much, with that note, we'd love to close this episode for today. Thank you so much for joining us. It has been such a pleasure hosting you guys and learning more about Dozer. Wish you all the best. And thank you again for joining us. Thank you.
00:25:57
Thank you very much.