Sep 13, 202235 min
Share via:

S01 E01: Understanding the data platform at Canva with Greg Roodt

Managing ETL processes, data infrastructure, and data teams is already an arduous task, imagine doing all this in a very data-intensive organisation! In this episode, Greg takes us through how he and his team have kept things rolling at Canva. Greg walks us through what data platforms they use at Canva, ETL pipelines, data team structure, and much more.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Greg Roodt
Head of Data Platform at Canva

Greg Roodt is Head of Data Platform at Canva, which is the world’s leading online design and publishing platform with a mission to empower everyone in the world to design anything and publish anywhere. He joined Canva in 2017 to set up data engineering team and built up the data platforms from scratch. Before working at Canva, Greg was the CTO and co-founder of AirHelp, a claims management company that promotes and enforces passenger rights in the case of flight disruptions, globally.

In this episode

  • Greg’s journey into data.
  • Data stack of Canva.
  • ETL and reverse ETL processes at Canva.
  • Managing the stakeholder’s expectation.
  • Structure of their data team and skills to look for in new data hire.

Transcript

00:00:00
Hello, everyone. You're listening to the Modern Data Show. For today's episode, we have Greg Roodt joining us from Sydney, Australia. Greg is the head of data platform at Canva, which is an online design and publishing tool with a mission to empower everyone in the world to design anything and publish anywhere. Although Canva needs no introduction - it's used by over 60 million monthly active users and over 7 billion designs have been created on Canva. Before working at Canva, Greg was the CTO and co-founder of AirHelp, a claims management company that promotes and enforces passenger rights in the case of flight disruptions, globally. Welcome, Greg. We are super happy to have you on the podcast and thank you for joining us today
00:00:42
Thanks for having me Aayush
00:00:45
So Greg, before we jump into, you know, the specific ETL part of the things, tell us a little bit about your journey from AirHelp to now at Canva
00:00:55
Yeah. So, my background is in I've worked in large data processing systems for most of my career and distributed systems as a software engineer I was fortunate enough to co-found AirHelp as a technical co-founder and CTO. As you can imagine that business deals with a lot of flight data. So we were ingesting sort of flight data in real time into our platform. We were using Redshift at the time as well. So we were sort of entering that world of the warehouse. So that went really well. After a couple of years, I moved on went a little bit of traveling around the world and this exciting opportunity at Canva, popped into my inbox, and yeah, the rest is history. So about five and a half years ago, I joined Canva here in Sydney, Australia, and set up the, what was originally called data engineer and we built up the platforms from scratch essentially very large scale processing, as you can expect. And yeah have been fortunate enough to ride the wave of the modern data stack and hire a bunch of awesome people to help us build it.
00:02:19
Now that you talk about the modern data stack, you know, I think so the audience would love to know a little bit more about the data stack at the Canva. Like what platform are you using for, you know, various components of the data stack.
00:02:33
Sure. So, Canva originally data was a little bit siloed away from the rest of the sort of product. And my philosophy has been to have analytics as code and sort of shift analytics far left as possible. So we've been on a journey to do that. We used to have separate data lake separate data warehouse. And that used to be frustrating between the data scientists and the data analyst, trying to work on like they needed the same data set, but one was in S3, in a lake and one was in a warehouse platform. And over the years we have consolidated everything into one unified modern cloud warehouse. We use snowflake. So snowflake is our data platform. We are heavy users of dbt. So dbt is the sort of central nervous system of all of the transformation that we do. To get data into the system, we use Fivetran for our third party loading and we have our own custom built first party loaders for our primary data stores and events. So we built a first party data loading, service and system that loads into fivetran sorry, that loads into snowflake fivetran is also going into snowflake. And then from there the warehouse platform team has built a framework around dbt that enables our analysts to model and manipulate and transform the data through various stages. And then on the consumption side we have Looker for our sort of high quality curated business metrics. And for the more ad hoc analysis and exploration, to empower users who know SQL we've got Mode analytics connected as well. Finally we do a bit of reverse ETL. So we do operationalize some of this data and we use Census to sync that data out into various third party destinations as well. So we've kind of got that full end to end data stack.
00:04:44
Yeah. Thanks, Greg. So there, there's a school of thought that primarily believes that, you know, reverse ETL is kind of an extension of ETL. So would love to understand your journey in terms of, you know, how did you manage syncing the data warehouse back to those operational systems before using a reverse ETL tool? What was the process around there before, using a particular tool?
00:05:13
Yeah, totally. it, used to just be a bunch of like bespoke scripts and it was done a lot less frequently. You were either you would have a silo, like maybe your events would go into segment or something and then segment would forward it on. So it was very ad hoc. You were not really able to operationalize using your enriched data as much as you are now. So before that we really had very little operational usage of our data. It was very like, okay, let's build a dedicated pipeline. and we would always have marketing teams trying to say, well, can we get LTV into the platform? And it's like, oh, okay. Someone's gotta write a processing job to specifically do that. Now that we're able to use the enriched data in the warehouse, you're able to build a table that's like customer and LTV. And then you can sync that wherever it may need to go. So, yeah, it's a very nice new it's so wonderful how easy it has become. uh, Yeah.
00:06:11
a couple of seconds back, you, mentioned, you know, analytics as a code. And I think so that school, thought emerges from your background from software engineering itself. And it's, a lot of companies these days are, trying to apply the principles of software engineering when it comes to data engineering. What are your thoughts around that? Is, that a too far fetched or too much of a stretch in when it comes to applying software engineering into the data engineering?
00:06:44
I definitely think it's the way that everything is moving. I think we've seen similar with the whole DevOps movement and, the sort of there used to be historical silos between operations and software engineering and throwing things over the wall. And even with data engineer, And analytics before. I've, we've actually broken down the silo as much as possible. So we no longer at Canva have a data engineering specialty. We've really tried to embrace analytics engineering. So the end to end enablement We do have analytics engineers who look after the more platform centric piece, but we also have analysts, engineers embedded into the product groups. You know, who, work closer with the subject matter experts to do the work. we try to keep everything as code because it, gives you the history. It gives you that ability to have an audit record. It gives you the ability to like, share your work. It seems to be the direction that everything is going. we had infrastructure is code. I think analytics as code. We have, we're seeing data quality rules as code. We seeing data transformation as code. So I think. The days of, you know, visual UIs to wire up your systems and things like that, are, behind us. I think, people coming out of universities and things like that are, learning how to use Gits and, you know, everyone is becoming more and more technical. so I think. I think software is eating the world that's saying. And, and I think it's similar for, data and analytics.
00:08:19
Amazing. then, you know, Greg, one, one specific thing that I would love to get, a little bit more of your thoughts around is, when it comes to specifically around ETL, you said, you guys use fivetran as a, kind of a major platform to be able to sync data from various operational sources and put it into a data warehouse. one of, the recent trends that has been emerging in the industry is around, you know, open source platforms for this data integration. we have tools like Airbyte coming up, which is basically, an open source, ETL platform. That's kind of competing with Fivetran and one of the. you know, the key, value proposition when it comes to open source is, tackling the long tail of connectors when it comes to, you know, building those data pipelines. how often do you find yourself in a situation where you would want to integrate some of your systems into the data warehouse, but you don't have a ready connector and what happens in those cases?
00:09:30
Sure. So, certainly I'm certainly of the opinion that there, that engineers shouldn't write ETL there's that there's that blog post that came out from, I think it was stitch. and we resonate a lot with that. we, initially thought because Canva's very engineering heavy and product heavy organization. We originally thought. Let's write our own integrations, right. We are software engineers let's do this. and we did, but we ended up writing just three. We wrote three. it was a enormous amount of effort for these engineers. It wasn't really very interesting work for them. and there was like, well, we've got three. It's gonna take us forever to do. A hundred. so quite quickly we got business support to justify the expense of purchasing a service provider, such as Fivetran and we quickly switched to Fivetran and we were like immediately able to connect like 20 connectors with like hardly any effort. and that mindset sort of enabled us. So presently we have like beyond 70 connectors through fivetran and then for the, for the long tail, we only have about five or six that we have had to write. And it, it is incredibly easy, with, fivetran And I think also other providers, because they have like a Lambda connector, so that they'll call you all Lambda. So yes, you have to write a bit a code, but you don't have to do all the scheduling and the monitoring, etc They, provide a little bit of a framework. So we do still have to do that. But I think the expense of a paying for SaaS product, like Fivetran or the other vendors that are out there is well worth it, if you compare it to the engineer's salary. So we are quite a small team powering quite a large organization. So we, think Fivetran is value money for our, for us. we are able to tackle the long tail, but we might just be in a unique situation. Look, we're quite a. Like a SaaS company with all the typical SaaS connectors that you might expect, the salesforces the Google ads, et cetera. So we've got maybe. Very common connections, perhaps in other industries, you might have more like very unique needs. And then perhaps you might need to build your own complement a paid product with, a bit more run at yourself, open source, sort of services. We, didn't have to do that. we, tackle the, custom stuff using the Fivetran stuff. And so far everything's been, worked really well for us.
00:12:14
that's a very interesting thought that I, didn't know that where what you can do is if you have a very specific connector, all you can do is define the business logic in a Lambda and let Fivetran take care of the whole scheduling and infrastructure. So that's a very, interesting thing. Amazing. So, tell us a little bit more about the data volumes. Like, you know, I believe in terms of the scale at which Canva operates you know, the amount of data that you have been kind of managing or kind of moving around from places would be huge. you, mentioned you guys are on snowflake, right? So right now you're on snowflake. And, you know, as with any other cloud data warehouses, it's you pay for the compute? the major cost, when it comes to cloud data, warehouse is come, comes from the compute, How do you manage these exploding costs specifically when it comes to the cloud data warehouse? I'm not even talking about the Fivetran piece of, piece of, this whole picture, but specifically around cloud data warehouses, how do you manage those costs?
00:13:21
Yeah, look, it is, a constant challenge. snowflake, these cloud waves are these consumption based warehouses where they charge you for the compute. you have to, you do have to keep your eye on it. there are controls you can put in place with snowflake that sort of alert you when you're using too many credits, cetera, you can put quotas, onto specific users or specific areas. so we are on a journey. we are trying to keep our costs control. We're starting on a project to attribute costs. more fine grained. So two specific business areas at the moment, we've got very broad buckets of like, okay, this is for experiment analysis. This is for business intelligence, but we'd rather really like to go. This is specifically for the marketing department. This is for the finance department. So we're gonna. Move to a bit more granular monitoring and you can do that with tags. So we're gonna tag, we're gonna tag our data models and we're gonna be able to, attribute the cost more effectively. But yes, you, have to make your clusters auto suspend. you have to make your BI tools point at the sort of more aggregated layers. Don't let the BI tools point at the raw layers that you know, will be slow to query, et cetera. So yeah, you have to keep an eye on it. It is a challenge. however, you do have to. What you really need to do is if you can sell a positive ROI story on your data group, that helps. If people can see that, yes, these tools cost a lot of money, but without them, you know, think of all the use cases, we're powering at Canva We power our machine learning platforms. We power our business intelligence platforms. We power our experiments analysis, our AB testing framework as processing data through snowflake. We've got the reverse ETL stuff. So. You, you need to sell that story. You do have to keep an eye on it. it's not easy. However, like the engineering time that's that you save by being enabled with tools that are a little bit easier sort of end to end tools like dbt where your analysts can actually almost build a pipeline. Right. that's kind of gold because pre snowflake we used to have EMR jobs and, you know, the, you know, running over S3 and that's like a whole lot of heavy lifting with AWS infrastructure and Terraform. And like when the spark job fails, it's like, whoa, that's complicated. So yes, the things get expensive. we are processing, we've got petabytes of so we we've put petabytes of data in snowflake you do have to archive data, get it outta snowflake, you know, keep an eye on your, compute. but as long as you're getting business value out of it, I think, you know, if you can say you spend $1, but it returns you $1.50. It's easy to get the business case.
00:16:19
Wow. Amazing. So, you know, I think that this is for the first time I'm hearing, attributing cost in the data teams, because that's a very common concept that you would hear in the marketing side of the things where you attribute the marketing cost. So that's, a very new concept that I've heard, like, you know, attributing cost for the data teams. So, let's dive a little bit deeper into that. when you say you attribute cost, when it comes to like, you know, associating the data processing cost for various teams, do you just do it at the data warehouse level or do you attribute that cost across the value chain across the tools that is there, in, your data engineering processes?
00:17:01
At the moment, we're just looking at the, big buckets, which would be the, processing cost and the warehouse. And that will be the easiest place to do it. So, that's where we are starting. We see that in a similar way, as you might see, your AWS infrastructure, your cloud infrastructure, where like each service team or each area might be like, okay, you spend. X thousand dollars on AWS, you know, so each of those, it's almost like this chargeback infras, you know, your infrastructure team can't justify spending that it's like, well, why are you running this computer? It's not actually for infrastructure teams. it's for the marketing purpose or for the product purpose. So the data platforms are similar. Like we're running a data platform. We're not necessarily the subject matter experts of all of the data flowing through our systems. Right. We, we help connect things. We help things run on time. We help keep them operational. But if someone asks us about a specific column, We'll point them at the data catalog, for example. So you, for us to justify like, oh, you're spending so many thousand dollars on snowflake, we need to be able to say, well, that's actually because of usage, that's driven by the marketing team or the sales team or the education team. So yeah, I think we're presently just doing it. at the snowflake level, the data warehouse level, but you're right. Like in perhaps in future we could attribute the Fivetran costs that way based on the, who owns this connector, is that connector or so the credits that are coming through Fivetran that's actually a good idea. Maybe I should do that. we could, we could attribute that connector cuz those connectors also cost credits and money. we would be able to say, well, that connector is actually coming from the marketing team. Do we think that the marketing team is using their. Wisely. Great. No problem. Or, oh, maybe we should reconsider. So yeah. presently, we're just looking at the big cost, which is the, this is the warehouse cost, but you could even do it for your. Data loading and your data unloading through, you know, Fivetran or, Census as well. So,
00:19:05
so, Greg recently, there is a lot of talk around, you know, the data trust issue and the quality of the data and the most common complaint among the data consumer is that data do not trust the data that comes in and, Part of the, the whole process around creating a value chain around data is focusing on the quality itself. Right? So I want you to help our, you know, listeners understand how do you handle data quality when it comes to both data at rest, which is once the data is there into your data warehouse, as well as. In terms of monitoring the quality of data when the data is at motion, when it comes to your, you know, ETL pipeline. So help us understand the processes and your practices around data quality for both aspects.
00:20:02
if we're honest, we're still very early on this journey. we are focusing more and more on that. It's really challenging. at the moment we've got, alerting and monitoring. In our loading for our, for the data loading framework that we built ourselves, we've got datadog monitoring that all of those, extraction processes are running. We've got, you know, observability over our Kubernetes clusters, et cetera. the Fivetran You get the error notifications when there's a failure to load. So we've got some simplistic stuff there. We really wanna look into the, reconciliation checks to make sure that the data that's in the source is actually matching in the destination. So we early there, we've got some. Potential things that we're going to look into, but we're presently not very mature there as it's moving through the, transformation pipelines, in dbt, et cetera, we do have some, basic DBT tests. So we use dbt tests a lot, but there's still sort of, not really at a high semantic level. So they're more at the technical, control level, like making sure things are unique, making sure. It's a key that actually matches at another table. So we've got some basic technical controls that we do run, into the pipeline, which do help. Like they actually are quite effective. They do catch things. but we do want to put a layer on top of that, where we do a bit more semantic, meaningful tests, or that, when you look at things like, great expectations and things like that, maybe dbt great expectations to sort of plug in. So. we do wanna focus more on the data governance and data management aspects, as we mature, but presently we're, struggling just like everyone else. the explosion of the modern data stack, the ability to do all of this stuff so quickly, Sometimes you're not exactly sure what's flowing through through the system. So, all of the start observability tooling, all of the start quality tooling is definitely an area that we're going to, invest in next.
00:22:14
Amazing. you know, next thing I want to talk about Greg is, you know, specifically when it comes to the data teams and the way data teams are organized within various companies, there are a lot of business stakeholders who are always like data engineers after data teams. Like, you know, I need to get this done. I need to get this data. How do you. How do you manage those expectations specifically when it comes to, you know, justifying the ROI for, for that particular request data request that has comes in. have you made any attempts or any thoughts in terms of building self-serve analytics platform for various stakeholders or do a lot of business requests directly come through the data teams and which is where you would have your prioritization processes, help us understand through that process.
00:23:09
I think we struggle just like everyone else. Probably. we have a federated model. So we don't have a central intake. We do have a federated model. So we're, we hire data analysts into the, teams and groups where they can be most effective. So group leads are able to hire data analysts, into their groups. and recently I think over the last year or so, we've also been pushing Federated analytics engineers. So sometimes you pair data analyst with an analytics engineer. If it's a particularly data heavy area, marketing's quite a heavy area, a lot of specialized transformation, attribution modeling, et cetera. So we're doing similar there. So we actually allow, groups, departments effectively to own their own head count. And. Analytics engineering has been successful at canvas. So they're actually quite in demand. We actually have, we're continuing to bulk up the sort of analytics engineers across the organization. they can tell the central platform teams what's working well. What's not working well. So managing the stakeholder expectations is still hard though. You know, stakeholders. I want a dashboard. They don't realize that like, well, behind the dashboard is, you know, creating the dashboard's one thing that's hard for the data analyst or the business intelligence analyst who whoever's creating the dashboard. That's, skilled work. It's hard. And then it's like, well actually getting the data there in the first place. If it's not there, it's like, oh, that's, that's a whole bunch of transformation. Do we have the data at all? How do we model it? So I think. That's just a education thing. I think that's an empathy thing. I think most people have come to realize that you don't ask an engineer to create a web app or a page like the next day. It's like, that takes time. They think, I think they understand that it takes time, takes work to do, to do that. I think that I think the understanding the data literacy needs to improve to a level where people understand that, oh, actually, It takes time to create these data products, these assets, you know, they need to value them. They shouldn't just, if they just need the number for like a, very low priority reason, they should say. So, they should say, oh, I just need a guess. I don't really need a actual report. So, that's just challenging. I think you just need some advocacy in your company. We, certainly, I'm not saying we have it. Perfect. We don't have it. Right. it's we, some, stakeholders can be challenging. we are under resourced just like everyone else that can never be enough data people so we're always hiring data analysts. We're always hiring analytics engineers. never seems to be enough so it's, just hard. but it's, it is getting easier and easier. I mean, the cloud warehouse has made this explosion work, so dbt the cloud warehouse. So it's almost like a victim of its own success. The more data you have, the more answers people want. Like a, it just makes. yeah, you provide some data, then there's just more questions and then you provide more data. So it sort of accelerates, here is no end yeah.
00:26:34
Yeah, no amazing. I mean, cause I actually have one question around in terms of what kind of skills do you look when you are, you know, when you're making a new data hire And before, before we go into that, I have another question is, help us understand how your data organization is structured. So you mentioned you have a global data platform team, and then you have federated engineers, help us understand a little bit more about the org structure specifically around data.
00:27:00
Sure. So we, have a federated structure. so we have a central data group that, looks after the data platforms looks after sort of central key metrics and insights and looks after sort of data governance and data management, areas. so we've kind of got a central, I guess, but more platform. Centric view of, how to enable the organization, holistically. So not only, Canva also some of the acquisitions that we've made. So, I sit in that org and I look after sort of the more technical teams there. we have, specialties, what we call analytics, engineer, machine learning engineer, and then. Traditional software engineers, frontend, backend, et cetera. these roles federate across the organization. So you might be an analytics engineer in one of my platform teams, you know, building frameworks around dbt setting, best practices around dbt, making sure that the dbt jobs run on time. So they're sort of data agnostic, you know, you sort of solve the problem in an agnostic way so that it can be federated out to solve every problem. So they're more generalists and more technical minded folks. and then we, we federate those roles out into other teams in the organization. So, the content and discovery team, where there's a lot of content at Canva there's analytics, engineers there, helping them build their own insights over their own data. There's folks in the marketing space, the sales space. So we're hiring, federated roles, across the organization. I look after three key areas. So like we've got our own event processing system. So click stream data sort of data, in motion. So that's sort of more traditional software engineering at scale, you know, thousands of events per second. then we've got the sort of warehouse area with a data at rest. Data comes in, gets processed, gets transformed, and then we've got a machine learning platform as well, which is operating like training platforms and serving. For the organization. so for this, podcast, it's probably the analytics engineers and the warehouse org. but we also federate the ML engineers as well. So we've got ML engineers and my team building the ML platform. And ML engineers in the product building ML powered product features. So, I think the federated model probably applies at most sort of medium to large organizations. You get to a point where you can't become a central bottleneck and you need to begin to build a, have a platform mindset. There are, it's not perfect, but it seems to be, it works pretty well for us. It seems to be a model that kind of applies at at many places. It seems to be the way that I'm seeing most modern organizations, moving and it does enable the sort of level of self-serve because once you've got these building blocks together, you know, the fivetran the snowflake, the dbt The BI tool, you can basically take that stack and put it wherever in the org and sort of replicate it like, like software, like everything, software defined. It's like, cool, got a config fall that lets you set the stack up it's you are able to sort of copy paste it to some extent. the challenge is making sure that each, implementation is sort of following the same patterns. So you do need a way to in ensure that everyone is continuing down the same paved. road which is what we call them a Canva. so we solve a problem once ideally, and then replicated across the organization.
00:31:02
Perfect. Amazing. And now, so that brings to the next question, like in terms of what skills do you look for, in specifically when you're looking to make a new data higher and specifically like young data hires, any specific set of skills that you look maybe like the technical skills or the soft skills that you look for, any new data. hire
00:31:24
Yeah. So Canva has quite a structured engineering hiring pipeline. So we do build pipelines for the bespoke for the role. So for analytics engineers, we have a specific sort of set of interview questions and challenges that they go through. you know, we have some technical pre-screening questions and hour long sort of coding interview and then a few sessions of you. know Warehousing data platforms, data modeling, et cetera. But we also do have the, soft skills sort of interview, you know, the communication skills, the leadership, the strategy thinking. so we do have that as well. So we want all rounders. we're generally not looking for any particular technology, so we don't need people to have worked with AWS or snowflake or even dbt. We talk about fundamentals, for the analytics engineering side, it's, Python coding. So like, can they write basic scripts and, sort of, you know, general Python skills are, valued here. and then classic warehousing, you know, like. How do you move data through a system, ELT, you know, how, what are the best practices for modeling data? Can you do dimensional modeling, you know, can you solve this business scenario? So we put a business scenario, in front of the candidate and we work with them during the interview session to solve the, you know, solve the problem with SQL or, however they'd like to solve it. So, yeah, technical skills, like I believe that analytics is code. So for us, everyone sort of needs. To code. self-taught is fine. doesn't matter where they studied or what they did. but we do, value the coding skills. So we do put technical, challenges in front of.
00:33:14
Amazing. So, and, are you guys currently hiring? We're
00:33:17
Always hiring . Yes. we are fortunate enough to be hiring we've not hiring as much as we had planned, but fortunately we, haven't had to stop or anything like that. So we are just, we've reduced the rate at which we're hiring while we keep an eye on the market. But, yes, we are still hiring for analytics engineers. So if, anyone's listening and they, wanna join, please reach out.
00:33:40
we'll have the, link to apply for the open roles in the show for net as well.
00:33:45
Okay.
00:33:46
Yes. So I, I think so, you know, that's pretty much it from our side. you know, Greg, thank you so much for answering all of those questions in such an amazing way. And I hope, a lot of our listeners would be able to learn in terms of the data practices and the processes that you have around Canva and make world a little better place. So thank you very much for joining this show, Greg. And it was such a pleasure talking to you.
00:34:11
You too. Thanks very much.

Relevant Links