Nov 22, 202232 min

Share via:

S01 E11: Unlocking behavioral data at scale with Alex Dean, CEO and Co-founder of Snowplow

'Data as oil' is an extensively used metaphor and its impact can be gauged by how every business is heavily dependent on the data provided to them by 3rd party sources. Source data systems are finite, they have a certain amount of data with a limited associated scope. This is where Snowplow comes in and helps businesses deliberately create that data. In the latest episode of the Modern Data Show, we have Alex Dean, CEO and Co-founder of Snowplow data discuss data creation, behavrioul analytics, data contracts, tracking catalog and where the modern data stack is heading in 2023.

Available On:

spotify

google podcast

youtube

Amazon Music

apple podcast

About the guest

Alex Dean

CEO and Co-founder of Snowplow

Alex Dean is the co-founder and CEO of Snowplow Analytics, the behavioural data platform that transforms data into actionable insights. Snowplow has raised $55 million from investors including NEA, MMC and Atlantic Bridge. Before founding Snowplow, Alex worked at a company called OpenX, where he met his co-founder Yali. Working at OpenX gave them an initial understanding of the open source business models and how you can potentially build a community and grow from community to business.

In this episode

Story behind Snowplow.
Data Governance and privacy challenges.
Product roadmap for Snowplow.
What are data contracts.
How does Snowplow enforce tracking catalogs?

Transcript

00:00:00

Hello everyone and thanks for tuning into the Modern Data Show. We hope you are enjoying these episodes and learning from the amazing guests that we have had on the show so far. For today's episode, we have Alex Dean, who is the co-founder and CEO of Snowplow Analytics, the behavioural data platform that transforms data into actionable insights. Snowplow has raised 55 million dollars from investors including NEA, MMC and Atlantic Bridge. After being bootstrap for quite some time before founding Snowplow, Alex worked at a company called OpenX, where he met his co-founder Yali, and working at OpenX gave them an initial understanding of the open source business models and how you can potentially build a community and grow from community to commercializing a business. Welcome to the show, Alex.

00:00:39

Hi, Aayush. Yeah, Great to be on the show. Thank you for having me.

00:00:43

So, Alex, let's start with the basics. Tell us about your journey in building Snowplow.

00:00:48

Yeah, so I think you alluded to some of the journeys in your intro. It is been quite an interesting and lengthy journey. I mean, Snowplow is a commercial open-source software company, and those kinds of companies typically have a kind of long gestation period as you're working on sort of project community fit and then figuring out how to commercialize those projects. So yeah, it's been a, been an interesting ride. As you said, Yali and I, my co-founder, Yali Sassoon, met back in 2007. We've been working together ever since and met in a kind of open source ad tech business and learned a lot about open source, learned a lot about kind of clickstream data technologies as well. And then left that went back into consulting in London doing these kinds of customer 360 projects for brands in the UK, but we wanted to build a software company. And there weren't a lot of startups in London back then we knew we had to build one ourselves. And yeah, we spotted this gap in the market. The brands we were working with, had really good transactional data. And we could use that to get some insights on their customers and build a kind of a customer golden record. But we wanted to add in the kind of behaviour of those customers, the digital footprint of those customers as they interacted, worked through the customers the websites and the mobile apps of those organizations. So we took our understanding of clickstream technology and made it run natively on AWS and open source as Snowplow back in 2012. And that started to get the adoption that then snowballed, pardon the pun, into what Snowplow is today. So yeah, it was a super fun time. We were very early to the space before, even Redshift was launched. And yeah, it's been a fun ride ever since.

00:02:42

Amazing. And how would you explain Snowplow as a product in simplest terms? What, if you were to explain to someone very new to the data industry, how would you explain what Snowplow as a product does?

00:02:54

So the way I would explain it is, in 2022, everyone adopted a kind of a modern data stack, and they have a cloud data warehouse or a data lake, and they have very often they have some sort of ETL tool that gets data into that warehouse or lake, a tool like Fivetran or Airbyte or Stitch or Matillion. And that's good. That gets your data out of your SaaS systems, out of your databases, replicates it, and extracts it into your warehouse. But what that misses is the really interesting data that comes from, like I said your customers, your users, your viewers, your consumers, merchants, whoever it is your people fundamentally interoperating, like working through, interacting with your websites, your mobile apps, maybe some of your back backend systems. And so Snowplow helps you create that data. So we have a bunch of tracking SDKs that you embed into your digital properties, and those generate net new data, behavioural data very rich and high quality and schematized, data about your customer's behaviour. And we flow that through. We validate it, we enhance it, we model it in the warehouse, and you can then use that to build cool data products and data apps. And those are gonna vary a lot based on your industry. But yeah, that's what Snowplow does. So, the way we think of it is its data creation and it sits alongside data extraction, and data replication, and gives you data that's very bespoke and proprietary to your business.

00:04:25

Who are the typical consumers of this data once you've generated them in the warehouse?

00:04:30

So, historically there were a lot of general insights and BI workloads, people building, reporting and insights off of the data that we've seen emerging quite rapidly in the last couple of years. People build much richer kinds of interactive data products. Some of those are operational, and some of those are real-time. Some of those are powered by machine learning or other. It's a bunch of different use cases now. But very typically the data is being integrated by data engineers analytics engineers play a big part, in data scientists. So all these people that are trying to build kind of differentiated data products of the data in the warehouse or lake house.

00:05:11

Right. And we have seen in the recent history that there is a trend that has been happening like previously CDPs or customer data platforms were, you know, typically owned by some third party, big third party companies where you would have us. Separate CDP platforms where would intend to act as a source of truth for everything around your customer, but they just become one more tool where you have all of the customer information. And this is a kind of shift that we are seeing in the industry where we are taking a data warehouse first approach where the database or the data warehouse acts as a source of truth. Would love to hear your comments in terms of, first of all, why is this happening. Why is Data Warehouse becoming the heart of this, modern data stack? Why and what are the significant advantages that people would have having their data warehouse as a CDP rather than a third party, 360-degree marketing, a customer overview, kind of a tool that they could implement separately?

00:06:12

So this is an area that's seen a huge amount of evolution and change in the last 12 months. And I think it's, I think it's only accelerating. So there, there are a few different strands I think to pull on here. So one is it is just being clear on the terms because people get super confused about this sometimes. So, Snowplow is all about creating this very rich customer behavioural data. And that's why we call Snowplow Behavioral Data Platform because we help you generate an enhanced model that data and get it into the data lake or the data warehouse. So we are not a CDP, like, lots of our customers use CDPs alongside Snowplow. A CDP to us is a customer data platform that is primarily brought in by the marketing team to build some kind of single customer view and then to kind of activate that downstream into different marketing channels because, the marketers are trying to grow the business and deploy advertising against, rich audience segments and all that sort of thing. The trend you're talking about in the market is interesting and it's very meaningful. So the way we see it at Snowplow, essentially we are on this kind of 10-year path to cloud data storage lakes, and warehouses becoming this central sort of source of truth is the central area where more and more kinds of data unification and data mastering is going on. And inevitably that involves more and more customer data. And we as a vendor, as an open source project as well, that helps you get the kind of customer behavioural data in we're accelerating that trend. The other people accelerating that trend are the cloud data warehouses and Databricks as well. So essentially when we first shipped Snowplow in 2012, you had to kind of work with the data in S3 using things like Athena and Hive, and it was pretty, difficult and unergonomic. In 2013, Redshift came out. Then a few years later, Big Query and Snowflake started to get a lot of adoption Databricks is getting a lot of adoption now, and so those big companies have been working on the cost, the ergonomics, the responsiveness and making it much easier to make those data stores a really core, central source of truth. So then it kind of comes back to the CDPs and I think we're seeing a couple of really interesting trends. I think we're seeing some of the CDPs and market engagement hubs and, people like MessageGears and Supergrain and Vero are starting to become data warehouse natives so that they just literally sit on top of things like snowflake and Databricks. And that's super interesting. And then you have reverse ETL vendors like Census and Hightouch are making it easier to have your customer records in the warehouse and then send them on to other places. And think we see that, all of these trends are real. And for us, the most important point is that you do your customer mastering, You build your customer behavioural profiles in the warehouse or the lake house, and then we think there's gonna be a mix of different things downstream of that marketers want to use to then, deliver on their goals. So, we certainly don't see the death of the CDP or anything like that. Companies are made really good investments in CDPs. We just think more and more CDPs are gonna recognize the kind of data mastering and unification and the behavioural profiles are gonna live inside a Snowflake, databricks, big query going forwards.

00:09:46

Yeah. And do you see Snowplow going beyond data creation at some point to be able to capture that value chain? Do you see any kind of, from, your product vision perspective, from your product roadmap perspective, do you see taking any steps in that direction beyond data creation?

00:10:02

So data, Data creation's the heart of what we do, and we're interested in the kind of midterm and what other types of enterprise data we could create. In the short term, we need to just make sure that the behavioural data we create is well set up for marketing and growth use cases downstream. So we don't want to steal anyone else's lunch, in the kind of MarTech or CDP spaces. But we do wanna make it very kind of low friction to get the kind of customer behavioural profiles that are being built in the warehouse and send those downstream to tools. So yes we're very interested in helping with that. But we don't for a minute think that the CDPs are gonna be displaced or should be.

00:10:50

Yeah. And you guys have been very strong and vocal advocates for the whole concept of data created within an organization. Right. And one of the challenges for data creation or having that, utopian view of your customer behaviour data is a cultural or a people problem that you might have within those organizations. And you always end up seeing especially personally talking about myself I've always seen events like user click, user click_v2, and user click_v2_final. First of all, how does Snowplow as a product kind of address this problem around governance in terms of analytics tracking, and what are your thoughts on that?

00:11:35

Yeah think it's a hot topic again Aayush around this. So I think what we've seen over the last 10 years is a real kind of, ebbs and flows around data quality, data governance and schematisation versus kind of move fast and break things and it's come back into being a hot discussion recently with Chad Sanderson and he's been doing a huge lift to kind of get data contracts back out, front and centre as people think about this. I have a couple of different views on it. So one is I see in a very early stage company that's pre-product market fit, you wanna move fast and break things. You wanna experiment and you can end up, just wanting to track things fast and do that and wanna do that in a low friction way. I think, however, there's a point, and it's kind of earlier in the journey than a lot of startups realize, where you've got to start moving into a more governance-centric approach, data contracts, strong schematisation, making sure that the data passing through the pipeline is well-understood end to end. And the reason for that is that as you grow up as a company, the uses of that data, the data products, and the reporting become much more serious. And, we have customers that use the Snowplow data for board reporting, for production fraud detection like identifying high-value customers, and churn prediction these are really serious use cases. And so I think that switch from permissive to wanting to be structured kind of needs to happen. And then that brings you into the data contracts world. I think the increased work at the front end, at the data production end, is quite visible but sometimes what's missed is how much time and energy and mistakes are saved at the data consumption end which starts to become a major saving.

00:13:29

I'm glad you touched on the point around data contracts, right? So, can you help us understand what exactly is a data contract and how the Snowplow event-driven model kind of fits into this whole concept around data contracts?

00:13:44

I think data contracts to me are a combination of kind of like everything in tech, people, process and technology. A data contract is, to me, data producers and data consumers asserting that there are parameters, there are structures, there are expectations if you will around this data. And you're asserting those as a data producer. And you're kind of enforcing them potentially as a data consumer. So we love the kind of discourse coming back to data contracts because philosophically we've been working on this since about 2015 when we put schema technology built our schema technology and then, baked it into Snowplow. And we were very influenced when we were designing Snowplow as a real-time engine with schemas in it. We were very influenced by the concept of happy paths and failure paths and a kind of railway-oriented design where essentially if your data fails validation, If you're in breach of contracts, so to speak, then that data gets moved into a kind of a separate area for quarantine and then the organization figures out what to do with that and whether it can be recovered. So, yeah, we are big believers in data contracts, and I think it's more than just schematisation and so we've been doing quite a lot of work to help data producers. We just shipped a tracking catalog inside our commercial product to help producers. And, we're thinking a lot about how we make things easier for the data consumers, given that this data is conforming to the schema. How can we make it easier for the consumers to work with it in dbt and things like that. So it's a really fun space. We're happy to see the debate.

00:15:31

Tell us more about the tracking catalog. What exactly is that? So The

00:15:34

tracking catalog is a way. So, to zoom up a level, I think one of the interesting things about data mesh and the idea of data producers and consumers is that, data mesh gets most interesting when you have very different data producers from data consumers and a kind of a counter-example is Kafka very often in the Kafka ecosystem it's the same person who is the data producer and the data consumer. It's often even the same person. It's like Bob or Susan in the morning writes the Kafka and in the afternoon they read from Kafka. The kind of web and mobile behavioural data is quite different. The data consumers are the people that listen to this podcast it's data engineers, data scientists, and analytics engineers. But the data producers are people that don't listen to this podcast. It's people that are just trying to like, ship the latest version of like their Android app. It's people that are grappling with kind of a rewrite of their website. A move from one front end to react or whatever, it's a very different persona. And so the tracking catalog is a part of Snowplow that's built for that persona those kinds of, those front-end engineers, those kinds of full stack, web engineers, folks like that to try and kind of educate them and help them on the journey of creating structured data, structured events that are then much, much easier to consume on the other end.

00:17:07

And how does Snowplow enforce tracking catalogs? If there is anything that's out of catalog, what happens to that? So

00:17:14

it is all under the herd, it all flows through into our schema technology. So we've seen a few other people in the market work on tracking catalogs and things like that, which is cool. It moves the sort of the state of the art forwards. But it's really important to have like, schema technology under the hood so that the tracking catalog doesn't get stale so that it's an actual expression of the data flowing through the pipeline. Right. Right.

00:17:39

And does this catalog gets updated automatically, as in when the new events come in?

00:17:44

So monitored. So it's updated by the data producers. And then we also have some interesting features that are monitoring the data flowing through to see what actual events and entities are flowing through Snowplow reporting that back.

00:17:58

Right, right. So, moving on to a kind of a parallel question around this, so, of late, we have seen Google Analytics getting a lot of heat concerning privacy. There are vices in the organization that wants people to move away from GA or to do something that's doing a better job at privacy. What do you guys think about privacy, especially when you guys are collecting a lot of behavioural data? Where do you draw the line and what's the take on that?

00:18:24

So privacy is super, super important and I think one, one thing I'd always stress about Snowplow is we are not collecting, any data ourselves. We're a software vendor and, we have open source and then we have our commercial offerings, but we are always just running the software for a brand, for an organization. And it's their data. It's the behavioural data of their customers. So it's not something that, we have access to. It's not something we do anything with. We're a pure kind of, the data processor. We don't control data in any way. The evolution on the privacy side's been super interesting. From a Snowplow timeline perspective, we launched the open source in 2012, and then from 2015 onwards, we had a commercial offering where we would run Snowplow for customers in their AWS account and that's evolved up to today to be a kind of a private SaaS offering where there's, a UI, there's an API. And that, that control plane is kind of essentially hosted, but the data plane is still inside the customer's own AWS or Google Cloud account. And so it's a high ownership model and the behavioural data being created. It's being sent to that brand's endpoint. So, events dot modern data stack.xyz whatever. And so that, that model's always been strong about data ownership. We've got a new version of Snowplow, hosted version of Snowplow that we're working on now. It's in private preview and we'll go into general availability early next year. But even with that version, we are still just kind of processing the data and then landing it into people's snowflake data warehouses. So we're not the data controller on that either. So it's kind of interesting, I think, to take it back to Google. GDPR was launched a few years ago. Different countries in Europe are interpreting it in different ways and a lot of those interpretations are quite unfavourable to Google Analytics specifically. And so that's leading, companies in Europe and increasingly in the US as well. We're seeing that in the US with global brands headquartered in the states as. It's leading them to think hard about where they go from here. And from a Snowplow perspective, what we're seeing is some of those brands are not very data mature and they just want, packaged web analytics, a digital analytics tool that is more compliant or held to be more compliant by authorities. And so they're going in one direction. However, a bunch of other organizations, in the last few years they've gone up the data maturity curve. They've made big investments in snowflake or databricks or big query, and they're now thinking, Well, hang on. What if we reimagine our digital analytics needs inside the warehouse? And use something like Snowplow power that. And then, kind of bi tooling, ml tooling on, on, on the back end too, to serve those use cases. So that's exciting for us. So we're having some awesome customers out of that.

00:21:30

And from a product perspective one of the other tools in the event tracking industry popular tool called Segment. If you look at Segment they have a lot of SaaS destinations where they delivered this analytics or event data too, did you guys took a conscious decision to not support various other SaaS tools for delivering this event data and just keep kept focusing on data warehouse? If so, why?

00:21:54

Yeah, It's a great question. And Segment grew up pretty much alongside us, so I think they started in 2013. A lot of the analytics and data players in this space start in 2012, 2013. I don't, I dunno what was in the water back then. But Segment kind of evolved on a different tech tree from us. They went much more with the kind of mixed panel, unstructured Json event tracking approach. But they did something quite powerful, which is they relayed the data and those events into lots of different SaaS tools. And we took a different path. We were much more focused on the kind of data and analytics teams and their use cases, which were much more around reporting off of the data and the data warehouse. And did look at that area, that concept of relaying into a lot of different, SaaS tools a few times over the years, and we do have an integration and solution with Google Tag Manager service side that, that does that. But to be honest, we think that is less of a priority because essentially that point around the warehouse or the data lake becomes the central source of truth. So what we see is you wanna get all your data, your behavioural data, your transactional data, your demographic data, you wanna master that data. You wanna build these very rich customer behavioural profiles, and then you kind of want to activate those downstream into different tools. But you don't want your event flow sitting in like 10 different downstream SaaS systems because, I mean, you're gonna pay for those events 10 times over, but more importantly you're gonna have a very siloed experience. And none of those 10 tools is gonna deeply understand a unified view of that customer. The emergence of reverse ETL about 15 months ago Census and Hightouch and a couple of other players was great for us in terms of, getting the market understanding and accelerating this view of like, no, you build your rich customer behavioural profiles inside the warehouse or Lakehouse, and then you activate downstream.

00:24:07

Yeah, I agree. Both, both these technologies ETL you sense, as you mentioned, is a very strong validation of your hypothesis in terms of how the CDP should be organized. So, that's a fair point. So moving in the same direction. What do you see next from a product perspective for Snowplow? What are the cool things that are coming, in early Q1 next year? What are the cool things that you guys are working on?

00:24:30

Yes, a couple of really fun things. So I mentioned BDP Cloud, so that's our hosted version. And it's really exciting because we've never had it before, so we've always had the open source and we've had the kind of private SaaS, more like enterprise flavour. And so it's super exciting to build this kind of middle skew if you will. And we think that's going to get some exciting adoption. And we think we're gonna learn a lot about kind of what are the, what are the kind of friction points or the more fiddly things in a, in Snowplow setup that people want us to automate away. So that should be good and we wanna build that in a way that gives people a lot of control. So it still needs to feel like Snowplow needs to feel like you're getting a full Snowplow experience. It's just, you don't have to have it. Your own AWS sub-account or your own like, SREs working on this or whatever. So that's quite exciting. The other big piece that we've been working on is a kind of library of what we are calling data product accelerators. So essentially what we found is that there's a kind of a blank sheet of paper problem in this space where there are great data teams. They've got some great tooling flowing, some great data into warehouses, but they don't necessarily know how to get from that into, rich use cases, sophisticated insights, ML-powered data products, operational analytics, You. And so the data products accelerators are basically kind of rich templates that use Snowplow and other technologies like Snowflake and dbt and Databricks and Hightouch and Streamlit that you name it. And show the art of the possible and shows you how to get started. They're hello worlds if you will, but for data products, we've been pioneering this by working quite closely with the big kind of centres of data gravity. But we've been pioneering it because we just haven't seen a lot of this out in the market. We haven't seen other kinds of modern data stack vendors doing a lot of this. And so we just see a lot of people ending up having to recreate the wheel. And we see that a lot with our customers. So yeah we're very excited about that library.

00:26:47

That's very interesting. And I can see a lot of potential kind of, hello oils use cases front as well. Tell us a little bit more about your decision of launching the cloud. So apart from the cost of ownership for an open source project, what do you think are the kind of key benefits for any customer to even go into a snow block cloud?

00:27:11

So I think it's a great question. So I. For us and our kind of user base, it's gonna open up a sort of new path to trying out Snowplows. So, essentially, the open source is really powerful, but it's a bunch of Lego bricks and bringing them together is fairly complex and a lot of people just want to get their feet wet and try this out and we launched a kind of a fully hosted, sandbox environment called Tri Snowplow early last year. And that did well. And there's a lot of people that are trying it out and then they just wanna move into some sort of structured way working with us and paying for Snowplow. So that's a big thing. I think it's gonna open up some interesting market segments for us. So we're seeing some signing up marketing agencies, signing up to private preview. We think that smaller data teams in earlier-stage startups are gonna start looking at this as well. So yeah, it's exciting.

00:28:19

And we talked about a lot of, the evolution that has happened in the modern data stack from CDP warehouse to reverse ETL. What do you think, or where do you think the modern data stack is heading in 2023? What are some things that you have a feeling we would see a lot of activity as we come to the new year, And what are the kind of things that you think are kind of, the noise around that kind of go down in next year?

00:28:46

That's a fun one. Prediction-wise, I'll call out two. So the first is, I think this idea of being data warehouse-backed or Lake House-backed. I think that's a real runner. I think that's gonna grow and grow.

00:29:04

Would you see applications being built on the top of the data warehouse?

00:29:08

Yeah. So I think I think more and more of CDPs and existing CDPs and even new entrants to the, to these spaces are gonna say, Well, hang on, if this organization's doing the kind of data mastering in Snowflake or databricks or whatever and if there are very rich, unified customer behavioural profiles being built in there, then why don't I just treat that as, the data model and treat that as the data store. And I think that vendors doing a lot of kind of shadow copying into their kind their databases behind the scenes, I think, that will come under a bit more pressure. So I think that's one interesting trend. The other trend I think I I think is gonna increase is we've been in a bit of a honeymoon period where data teams have been growing. They've been staffing up and they've been given a budget to, build the kind of platonic modern data stack and have a bit of everything and, have a bit of data quality, a bit of observability, a bit of extraction, a bit of replication a bit of catalog, et cetera, et cetera. And I think that's gonna come under more pressure as the macro environment changes and becomes much more demanding. And so I think we're gonna see a couple of things. I think we're gonna. The kind of lines of business saying, Well, hang on, we've been tasked with some aggressive, growth or savings or whatever it is. Can we use the kind of data investment we've made over the last few years to power that? So I think lines of business are gonna be coming to the data team much more aggressively over the 12 months. And asking for stuff and asking for, significant new data products and apps and things like that. And then I think on the kind of modern data stack budget, if you will, I think there's gonna be more pressure to prove that the individual components are, moving the needle are an important part of delivering those end-to-end use cases at the right quality and velocity and all that kind of stuff. So I think there's just gonna be more. More demands fundamentally on, on modern data teams their tooling.

00:31:18

Wow. That was insightful. So, Alex, I think, as we come close to the end of the show, any parting thoughts that you would like to share with our audience in terms of anything specific to, Snowplow or any upcoming features or, any activity from Snowplow that people should be watching?

00:31:41

I mean, I just like always, I'd love to like, just thank the Snowplow community and our customer's partners and, it's been really fun bringing these kinds of, sort of data product accelerators together getting more, sleeves up with partners and testing it, testing those out with customers. So, yeah, just a big thanks to the whole ecosystem and a big thanks to you Aayush all your work on the Modern Data stack. It's a very, it's very complicated space and you've done a huge, you've done a huge, in terms of making it way more tractable. And, I think everyone in the space and all the vendors owe you a big debt of gratitude for clearing it all up for people.

00:32:25

Thank you so much, Alex. So thank you so much for having to give giving your time for the episode. Alex, it was such a pleasure having you.

00:32:32

Thank you. Thank you very much for having me.