Apr 25, 202325 min
Share via:

S02 E09: Building Data Pipelines at Shopify: Insights from Marc Laforet, Senior Data Engineer at Shopify

With its widespread popularity and success in the e-commerce industry, it is difficult to imagine anyone who has not at least heard of Shopify. This episode features Marc Laforet, a senior data engineer at Shopify, who shares his journey of how he transitioned from being a biochemist to a data engineer at Shopify. Marc explains the type of data Shopify works with, which is diverse in format and comes from different sources, and how the company determines which tools to build to extract the most value from the data. Marc also discusses data governance and explains two possible architectures: a gating process or a trust-but-verify approach.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Marc Laforet
Senior Data Engineer

Marc Laforet is a senior data engineer at Shopify, which is a renowned e-commerce platform that simplifies online selling for small and large businesses. With over 7+ years of diverse experience in the tech industry, Marc has excelled in building and managing cloud data platforms, designing ETL workflows, developing time-series models, and collaborating with experts in deep learning and computer vision. Moreover, Marc has also built SaaS applications and leveragied AI to design pharmaceuticals and forecast their effects on the human body.

In this episode

  • Marc’s unconventional journey into the world of data engineering.
  • Type of data Shopify works with.
  • What tools to build to extract the most value of data.
  • Ideal data stack for Shopify.
  • Shopify’s approach to Data Governance.

Transcript

00:00:00
Ladies and gentlemen, welcome to the Modern Data Show. Today we are thrilled to have Marc Laforet as our guest. Marc is a senior data engineer at Shopify, which is a renowned e-commerce company that simplifies online selling for small and large size businesses. With over seven years of experience, diverse experience in the tech industry, Marc has excelled in building and managing cloud data platforms, designing ETL workflows, developing time series models, and collaborating with experts in deep learning and computer vision. Marc has also built SaaS applications and leveraged AI to design pharmaceuticals and forecast their effects on the human body. Thanks for being a part of the show, Marc.
00:00:37
Thanks for having me.
00:00:38
So Marc, let's start with the very first question. Share a little bit about your background, your role at Shopify, and would love to understand what your day-to-day responsibilities looks like.
00:00:48
Yeah. So I entered the data world by accident. I started out as a researcher. I was actually trained as a biochemist. Started working in deep sequencing, also using AIs to predict protein DNA interactions. From there I completed a thesis, did a masters in that field. But then once I completed my masters, I was wondering what was next for me. I didn't wanna pursue a PhD. And I decided to enter the job market. In my masters. I didn't really feel limited by my data science ability. I felt more limited by my programming ability. So I decided to go to a pharmaceutical startup and basically plead my case that I would be able to be a good software engineer for them because I understood what their AI was doing. It was related to my thesis work. So from there I was a full stack software engineer. I learned how to deploy stuff, learned how to build a backend, worked with databases, build containers, even did some front end and then it was there for two years. And with that new kind of like skillset in my arsenal, I decided to re-enter the data world. But at this time, I was a lot better at programming and where I started finding my niche was working like deploying data applications and AIs and so I would really collaborate as you mentioned with experts in AI, computer vision, and I would help them get a hold of the data that they needed. But then also on the other side, once they built the application, deploy the application into the real world.
00:02:37
No. So I was just saying, just to expand on that one point, Marc a lot of people that we see in the data industry are non-programmers who wants to get into data engineering and, with the tools like dbt and having a basic knowledge of SQL, most of these data driven applications are primarily SQL driven, right? What was your biggest challenge in terms of acquiring those technical and programming skills and what advice would you give to other people who are probably in a similar situation as that of yours, how they can get into data and build up those programming skills that they require on the job?
00:03:15
Yeah so I would say the biggest challenge for me was just, you don't know what you don't know. And so my first job, a mentor of mine showed me docker. It was hard for me to like even wrap my head around what was even happening. Like I had to build my own computer in a way and then ship it, it was hard for me to actually understand what I was doing. That was probably the biggest challenge. My advice to people who want to build that skillset is to find something you're passionate about, something really small, and just go through the entire product journey. Think of an idea of something you wanna build, then go out and build it and stumble along the way. There's so many resources online. I'm taught by the internet more or less, Google is your friend. There's tons of resources out there to help you that really just build that product, deploy that product like trip over your own feet, stumble, make mistakes. The best thing about computers is it doesn't really matter how many mistakes you make. In my early days I did actually brick my computer a few times because I just did stupid stuff on my computer and I had to go into Apple or Mac and be like, can you reset my computer, please? Cause I deleted some startup script or downloaded too much data onto my computer and just wouldn't even move anymore. So that would be my advice, for example, for me I really like computational photography. So I would actually just take images and start building applications that would process the pixels in images and maybe like animate between different pictures and stuff like that. So just find something you're interested in and build a product around it.
00:05:10
That's a lovely advice. So Marc, let's talk about your role at Shopify. What do you do at Shopify?
00:05:15
Yeah, so that's a good question. So I joined Shopify, in a couple weeks it'll be just over two years. Before Shopify, I was this full stack data person where I really brought the breadth of my knowledge to my work. But at Shopify, it's my first time working at a big tech company and you're no longer really needed to do the breadth of the cycle. You really need to, obviously that's helpful, but when you're working at a big tech company, you're actually more valuable to have more depth in one specific area. And so my journey at Shopify has really been centered around storage. I've really focused on the storage layer of our data lake. And yeah, so that's mostly what I've been focusing on. And there's many different facets about storage. There's the technical aspect of it like the physical aspect of it. But then there's also these ideas around like governance and standards and yeah we can build a data lake but you want to try and prevent as best as you can, is your lake turning into a data swamp? And so we have this other aspect of storage where it's like more about governance.
00:06:34
Right! And tell us about what kind of data are we talking about?
00:06:39
It's mostly structured data when you work at such a big company, it's not so ubiquitous. Yes we have some parquet data some data is like in JSON files there's very diverse some of it comes from Kafka, some of it comes from third party tools. Some of it comes from employees wanting to upload Excel spreadsheets. So it's really diverse and that's one of the biggest challenges is like determining you can't build every tool in the world. So it's okay where's 80% of our user base? What kind of data are they using and what kind of value can we give them? And that's mostly like data from our production databases of course. But then also data coming off Kafka, which sometimes can be the same.
00:07:31
Right! and you said most of the data structure. And for a lot of people there is not a lot of clarity into these terms like Data Lake versus Data Warehouse versus Data Lakehouse. How would you explain this?
00:07:49
Honestly, I cringe at our industry sometimes because I even find it confusing having been in the field for eight years. To me I think some people would say, oh, when it's highly structured it's a data warehouse but if it's unstructured then you're in the data lake. To me I don't really care. It's just like both of those terms are meaning the data the central data assets of a company. Whether you want to say they're in a Data Lake or Data Warehouse, who cares? They're essentially it's data on disk somewhere, probably in the cloud.
00:08:26
Right! And the next question is explain if you would care to draw the modern data stack for Shopify, how does your data stack look like? What tools, what open source tools and technologies are you using right from the storage layer to the insights and decision layer, would you walk us through that?
00:08:46
Yeah, for sure. So I'll give you like the canonical stack, but people who may work at large companies you know that you're always pivoting and you have some remnants of your stack of your data in the old stack and you have this new stack that you're driving towards. So it's not so easy to say this is the stack that we're using. Cause we're actually using, unfortunately a couple of stacks. But basically what I'll just define the ideal stack for us would be which would be this idea of you have data like data's on Kafka. That's like our canonical case. We stream our data using a Kafka Connect plugin out of the Kafka brokers into GCS, so it's like a cloud object storage. We register that data using a Hive meta store and that the table format is actually an Iceberg. So that is how our data lands into our data warehouse, data lake, whatever you want to call it. From there we would use something like Trino or Spark to hook into this data lake and would allow people to run like SQL or their Spark applications to further model their data. Once they model their data, they will then write it back into the data lake using the Hive Meta store and writing it into Iceberg table format. Then you can come along and put your favorite visualization tool on it. Whether that's Looker Studio or Tableau or anything else you want to hook on top of that. Of course, we have things like Airflow that we use to schedule all our workloads and yeah that's pretty much the stack.
00:10:42
And do you guys use dbt?
00:10:45
So our data modelers use dbt, I myself do not use dbt. I'm like a raw SQL kind of guy.
00:10:54
Got it. And one of the important other things is, one of the issues that we see people working on the kind of a storage layer of the stuff is around Data Quality and Data Governance. You mentioned that you and your team is also part of, responsible for Data Governance. And one of the biggest reasons, where people tend to turn the Data Lakes into Data Swamp is mostly because of these one off ad hoc things that comes in. And how do you guys decide? Walk us through the process. Like for example, let's say there is a team, let's say there is a product team and you say we want this to be stored into our data lake. What's your process on serving those requests and where and how you put in those governance process in place?
00:11:49
Yeah, I'm glad you asked this question cause this is actually something I recently have been thinking a lot about and in my head there's two different architectures that you can build. You can build a gating process where someone says they wanna publish into the data lake and they come to you with their data and you like do the quality checks right there and you say yes, you can publish in here, or no you cannot. Or you can do something that I call the trust but verify. You can let people just publish like wild west, but then have a crawler that comes through after the fact and constantly checks your data that adheres to the quality standards that you've put in place. There are pros and cons to each approach. Obviously the gating approach is a lot more strict and heavy handed and you might be blocking people from a normal workflow. So you actually have to implement like this new software that everyone has to interact with versus the crawler people probably won't even have to necessarily know about it that it's happening because it happens after the fact and you're trusting, but then you're verifying, so then you can go back to them and be like, Hey, you published this data into the data lake, but you didn't identify what PI columns you have or set an owner on the table or put an SLO on the table or put a criticality on the table and you can start implementing things that way. Based on how I've described them I'm sure you can guess which one Shopify is deciding to go with. But yeah that's how I view that problem.
00:13:34
Right! And one thing that I was very amazed to see on actually Shopify engineering blog is you have this very well written blog about how to structure data teams, right? And there are a lot of ways you can do that and it's there on the blog and we'll share the link of the blog post with the podcast episode. But would love to hear from you, how do took this approach on structuring the data teams? How are the data streams structured within the organization serving various stakeholders?
00:14:08
It's funny you mentioned this because I have a lot of opinions about this and I feel like this is actually an area where Shopify can really improve upon and somewhere that we're actually trying to, I think it's really difficult because resourcing of teams is a technical problem, but more so it's a people problem. This is people way above my pay grade that are negotiating how the team should look, what someone like me as a data platform engineer should actually provide to data scientists. Should I expect people on those teams to be able to program in something other than Python or even in Python, or should they just do SQL. This is something we like constantly debate and it's something we're trying to get more alignment on. I think in the idealized world, every data team has the ability to self-serve if they run into problems, they can maybe extend the platform themselves but I think that's a little unrealistic. You might have some teams like I keep coming back to working at such a big org. You don't have one team that kind of works here. We have some teams that can do that but we definitely have teams that can't do that. They can only write SQL and that's not that every data team should have this ability to write really complex scala apps. That's not I think that's impractical. So you have to build your platform so that it can service diverse teams and meet the needs of the company. So I think in the ideal world, every data team has like a really good data engineer that can write Scala apps and self-serve them on their own team on the platform but that's impractical I think in the real world.
00:16:00
Right and you also on a very similar note, you also had an experience of working with smaller companies and now working with Shopify, What would be your advice to founders who are building tech businesses in terms of how to start thinking about data team because data as a function usually comes up pretty late within an organization. You typically start with engineers, you start with business facing teams. Data as a functions comes up really late. What would be your advice to those founders where if they have to keep something in mind that there would come a time where we would structure our data as a function, what are the things those people, those founders can do in those initial days to be able to accommodate for that and do it in the right way basically.
00:16:49
Yeah! That's such a good question. I'm so glad you asked that because at one company I was the founding data team member, so maybe I can talk through my experience, give you some horror stories and then a recommendation. This was after my first, I spent two years as a full stack engineer. I decided to re-enter the data world and I went to this company that was doing computer vision and based on medical imaging and I knew in the interview process, I read the application or the job posting and it was like a mix between a Data Engineer and a Data Scientist and I had done both so I felt confident, but in the interview I literally asked them, I'm like, it sounds like a mix. Do you want a Data Engineer or do you want a Data Scientist? And they said we want both. And I was like what does your team currently look like? And they're like we have six Data Scientists and no Data Engineers. I was like okay. I knew from that moment that they really wanted a Data Engineer and lo and behold, my first couple weeks on the job, I went to the data scientist and I'm like, how do you get your data? And they, I kid you not, they were like, oh, I take this thumb drive, I go to DevOps, I ask them to dump the production database onto it and then I bring it over to my computer and I analyze it. And it took a couple months for me to really get into the founder's head that this wasn't scalable. The data scientists were working off the raw OLTP schema of the database. They weren't restructuring it so their queries, they're like, oh, my query takes four hours and it's yeah you're doing like the most complicated join like seven times. You probably should just have a pipeline that does it once for you. And then once I showed them how to do it, the queries were like seconds. That was like a really pivotal role for me because it showed the power of like good Data Engineering practices and let me get my feet wet with it. Like I had free reign to do anything I wanted and like really experiment with Star Schemas and pipelines and using Airflow and all the other tools like Five Tran and Snowflake that I set up at the company. My recommendation for a startup or a founder who wants to, every company apparently these days is a data company. So if you consider yourself a data company, start with a data engineer, maybe someone who has a little bit of data science knowledge but you really want the data engineering first because just from my experience going into these startups, you have all these data scientists. They hire the data scientists first but they're not equipped to be productive to actually be able to do their job yet until a data engineer comes in and sets everything up for them. It's really like a pyramid and the foundation is Data Engineering. So I would start with a data engineer who can get your data ready for data scientists to come in and actually build models off of. So that would be my recommendation.
00:20:00
Amazing. And tell us a little bit more about the data culture at Shopify. Have you found this to be, any particular thing that you would like to highlight where you think Shopify is doing really great with respect to Data.
00:20:17
I think that we're on the cutting edge of iceberg, which is pretty cool. Iceberg's been a fabulous community to interact with and work with. We've really shown like how open-source technologies can be beneficial to a company even at the scale of Shopify. It's not saying that it's easy operating open-source software is challenging and does require human capital to do that. But I would say that's on the storage side but what's been really cool is working with definitely when you go into a bigger company you work with a lot more diverse types of engineers and people who are really good at what they're doing. So before I was more of a breadth person where I was like okay at a lot of different things but now I'm working with people who are like experts in their domain and blow me out of the water and things like data modeling and AI so it's really cool to work with such diverse people solving such diverse problems.
00:21:24
And one other question. Now that you mentioned open-source and Shopify is still working on a lot of open-source technologies, tell us one of your last experience in terms of where Shopify procured any third party vendor for any tools within your stack. We are seeing an explosion of data stack tools around every single category that you can potentially imagine. Have you guys recently procure any tool that you are amazed of?
00:22:00
So I'll say broadly, hearkening back to what I said about working at a large company you do have operating different stacks in parallel. One thing that, one general trend I'm noticing at Shopify is we are moving more to a enterprise model where we purchase instead of build which is distinct from our past where we would pretty much never purchase things we would always build in-house.
00:22:30
Okay. Good to know that. As we are approaching towards the end of this episode Marc, one thing that is probably on everyone's mind right now is generative AI. The world has seen what generative AI can do, can potentially do. What are the things and what are the particular things that generative AI can bring a lot of positive impact in the world of Data Engineering? The obvious ones are the Metadata and the Data Cataloging, which is where people have already started to demonstrate the value of generative AI in terms of structuring this huge amount of unstructured data that a lot of organizations have. What else do you think where generative AI can create a lot of impact in your work?
00:23:19
Yeah. I would really see the impact in generating pipelines, like ETL pipelines for you. So let's just say that our data has landed and let's take that for granted that it's already in our data lake, anyone who touches the data knows it's like really messy. Maybe there's tons of duplicates, maybe there's some JSON column that needs to be flattened and the schema changes. A bunch of issues. I could really see some generative AI coming along and you just say de-duplicate my data, and it just does it for you. And it generates this new table and it's de-duplicated and it's great. And, you say schedule, schedule this pipeline on every hour. And I could see that being really useful in my field. I think where I see the future of our work going is it's more about delivering business value. I think the modern data stack is pretty complicated. It requires a lot of different tools and when you talk to someone in business like maybe our founder - Toby, he doesn't really care about all the different tools. He cares about what you're doing for Shopify. So, as much of the technology you can abstract and get for free, obviously it's not gonna be for free, but let's just go with that - go for free and the more value you can provide to the business, the more business leaders aren't gonna look at you as like a cost center. They're gonna see the value that you're generating and that's really where I think we want to be as an industry.
00:25:03
Amazing, So you know with that we are gonna wrap up this episode. Thank you so much for sharing all of these experiences with us, Marc, and thank you again for being on the show. Thank you so much.