May 17, 202336 min
Share via:

S02 E12 Unveiling Twilio’s Data Transformation: A Journey into Modern Data Stack with Don Oriti, Head of Data Platform and Engineering at Twilio

Twilio has built an open source data lake using AWS technologies and Databricks, processing billions of events daily through their Kafka environment. They aim to provide a cohesive view of data across platforms and enable other businesses to use data wherever they want. Don, the Head of Data Platform and Engineering at Twilio, shares insights into Twilio's data stack in the latest episode of the Modern Data Show. The conversation covers the Twilio data stack, which begins with data ingestion through Kafka or CDC for Aurora databases, followed by storage in S3, high-level aggregation and curation using Spark, and the use of tools such as Kudu, Reverse ETL, data governance, cataloging, and BI tools.

Available On:
spotify
google podcast
youtube
Amazon Music
apple podcast

About the guest

Don Oriti
Head of Data Platform and Engineering

Don Oriti from Twilio is the an experienced leader in data engineering and architecture. Having spent multiple years in various senior roles in this field, before joining Twilio, Don served as the senior director of data engineering and architecture at Genesys and GoDaddy for six years serving as the director of data engineering and architecture.

In this episode

  • Exploring the Journey of a Data Platform and Engineering Leader
  • Data Lake Architecture and Enablement at Twilio
  • Investing in Data Platforms and Teams During Economic Uncertainty
  • Benefits of Leveraging Managed and Serverless Offerings
  • Generative AI and Its Impact on Businesses Leveraging Data

Transcript

00:00:00
Hello, data folks. Welcome back to another episode of the Modern Data Show, where we delve into the most recent trends and technologies in the space of data and analytics, engaging with some of the most brilliant minds in the industry on our podcast. Today, we welcome Don Oriti from Twilio, who is the and an experienced leader in data engineering and architecture. Having spent multiple years in various senior roles in this field, before joining Twilio, Don served as the senior director of data engineering and architecture at Genesys and GoDaddy for six years serving as the director of data engineering and architecture. Welcome to the show, Don.
00:00:36
Hey, thank you. Thanks for having me. Excited to get started.
00:00:39
Absolutely. So don't, let's start with the very first and very basic question. Tell us a little bit more about your background and how did you end up in your current role as your head of data platform and engineering at Twilio?
00:00:50
Yeah probably my journey started back in 2004 when I joined a burgeoning, we started a in store inventory service I used to work for HEB Grocery Company out of Texas and we had an automated ordering service where we had a system that would track and maintain inventory within our stores and then place orders to the warehouse to replenish the store stock levels. So part of my job was to analyse the data that was coming in from the system and then tweak forecasts and make adjustments as necessary, working with people in the stores to make sure that we had the right stock levels to service our customers. And so trajectory projecting from there I eventually moved into the martech space and doing marketing operations. And then that led me to business intelligence. Back in those days we were using Teradata appliances and then setting micro strategy on top of that with a data architecture that lies inside a micro strategy. We were very strong data practitioners at HEB. When you're working with very thin margins, like you have to have very Well organised, curated governed data. And so I'm learning those lessons from there. I took that over to GoDaddy where we worked with a lot more big data technologies, a lot more modern data stacks. So there we were using on-prem Hadoop. We had Kafka for real time data use cases. We also leveraged SQL server CDC to scoop data into the data lake. And then from there is my first kind of taste of what we call today data mesh, right? So this is this practice of like we're providing data capabilities for our producers to push data into the lake. We weren't curating data like data mesh would propose, but we're starting to see that process developed there at GoDaddy. And then in 2018, we moved to the cloud. And changed our tech stack to leverage AWS use cases. There are AWS technologies there. So moving from Kafka to Kinesis, leveraging EMR, leveraging serverless glue, leveraging like formation and taking our data mesh architecture to the next level there. And then my responsibility there was to expand on the data that was being pushed to the lake and do this more centralised data engineering and data curation. One of the things I love about data mesh is that we're really asking our data producers to be part of this larger data community and build data sets that are ready to use. But like when you're working in a company that has a lot of products and businesses, even. Sometimes, the things that we want to look at cross product those product teams aren't necessarily able to produce on their own. So there's still probably some level of further curation that needs to happen, whether you do that centrally or do that within a distributed analytics engineering environment. Those data models and data lake duration mechanisms are still very important. And so we were running that as part of our data and analytics org at GoDaddy. From there, I went over to Genesys and built on, working on building out a data platform from scratch there in the CIO organisation. There we built out an open fundamentally open source data lake where we leveraged a lot of AWS technologies again. But we also incorporated Databricks to help build out the lake house model. We used a lot of lake formation, a lot of Glue, but then we had two patterns for users. So anything that's critical for data coming into the lake, we used a combination of Fivetran plus our own internal tooling that we used to publish to the lake. And then taking the data mesh principle of being an enablement team and data platform as infrastructure. Really started to work on data platform capabilities that engineering teams around the org could leverage. As well as our analytics data science and machine learning customers were, we provided them more easy to use out of the box solutions like Databricks but something that we could leverage on top of our open source data lake. And then continuing on to Twilio, then we have a much larger platform team , it spans everything from data publishing, we do that through Kafka again in AWS that data is published real time into our data lake, and then we have some other data curation, automated data curation processes that help aggregate data for teams to more easily consume. We also have a centralised data engineering team that does further curation specifically for our strategy operations, finance, marketing teams, and we're looking to expand to cover more areas of the business. There we leverage open source presto for access to our data lake for analytics consumers where we have glue available in like formation. So if teams want to fire up EMR serverless Glue Ray. They can fire up those resources as well and leverage data in the data lake. And really expanding on that enablement principle. Right now, we're processing billions of events a day through our Kafka environment. And so we run tons and tons of data through our infrastructure. We're a messaging company. Companies are sending out messages all the time that generate tons of event data. That event data gets everything that enters our lake is being pushed into Kafka and then some resources are being copied from Aurora into our data lake as well. But we have data sources, from our products, from our finance systems, from our internal back office systems as well as, what you described with, working with the other businesses with inside a Twilio as well, I think anybody that has been in the space long enough, realises that, M and A activity kind of brings in new patterns, new technologies obviously new data. And I think one of the challenges that we're looking towards as we start to consolidate all of that is, is how do we start one, helping those other businesses, continue to push data to the lake. But also they work in, they have processes and analytics motions in their own spaces. One of our orgs leverages Snowflake and so we're working with them on pushing their Snowflake data into the lake. But also, like, how do we expose external tables into Snowflake for them to consume while working on a long term strategy of how do we start to consolidate all of the technology pieces. So it's gonna, it's a fun adventure. We have teams that support technologies like elastic search and real time data access using Kudu databases. We support teams to leverage spark infrastructure. We again support Presto. We have multiple visualisation tools that we support as well. Really trying to provide our users with a catalog of technologies to help enable their use of data and help our producers produce data, but also help our consumers more easily leverage data for insights around work.
00:08:57
Amazing. And given that, as you just mentioned Twilio is a big company, right? And it's not just a big company with multiple products. It's literally different businesses are entangled into one moving entity, right? Tell us about the structure of your data teams. Do you have data teams per different business or I assume you guys even made a couple of acquisitions, right? And they would have their own data teams. How does all of those data teams across businesses, across products now work together? We'd love to hear more thoughts on that.
00:09:32
Yeah. So our data teams are in our org or tend to be a little bit more centred around the original Twilio company. And so those other teams, as you mentioned, do have their own data teams. We work closely with them to partner on, on getting data into different systems or helping access data eventually. I think where we need to be is in a place where data is easily accessible across platforms. And I think technologies like lake formation, make that possible in my prior roles. We use a very distributed AWS account architecture that doesn't want to get too much into the weeds on that, but the idea being for security purposes, service level service level management every team and every system had their own AWS account. And so as we publish data into each one of those it's then the challenge becomes how do I get a cohesive view of that data, even if it's living in disparate systems, right? I think Lake Formation makes that super easy to do if you're in the Databricks environment, you have Unity Catalog that helps out quite a bit with that. I think what we're also seeing in the technology space is that, even vendors like Snowflake and Databricks we just talked about are looking for ways to stitch data together that live in disparate systems. Maybe because of M&A activity but also because, these teams may want to work with data inside each one of their work. And if you go back to the data mesh white paper, if you look at the proposal data mesh is more about getting consumable easy to use data into the hands of your consumers. It's not really a technology mandate, so to speak. It's more about that the data can live anywhere. It can live in an Aurora database. It can be in Redshift. It can be in Snowflake. It can be in S3 and open source file formats. And how do we get that data to look like it's a cohesive data lake and enable our customers to be able to access that data, no matter where it lives, and I think that's the big, vendors are trying to take on that challenge. And I think we're getting much better than where we've been before. But I think that's our big challenge with the multiple business pieces is, how do we meet them where they're at? And how do we enable them to work with our data and vice versa? But also they already have data practices in place, analytics practices, data science, machine learning in place, and we don't want to disrupt that too much and so thinking about how to be a better enablement platform, really is my goal moving forward in 2023 and into 2024 is not be the central gate for everything. But how do we support teams to be able to use data wherever they want to use?
00:12:36
And you just briefly mentioned this term or technology called kudu, right? I'm not sure if I've heard of that before. Would you want to explain that further?
00:12:44
Sure. It's a high speed database. You can think of it a little bit like Druid. It makes it a little bit easier to stream data into a database, but it behaves more like a SQL database than things like DynamoDB, for example. So it's a more RDBMS type of database for real time data use cases.
00:13:06
I understood. And you, although you have mentioned a few of the technologies that you guys are using at Twilio, would you help me mapping out the modern data stack of, or the data stack of Twilio right from the ETL layer, data ingestion layer. Till the consumer layer, we'd love to know more about tools that you're using for ETL, maybe, one of the tools that Segment recently introduced is reverse ETL, which allows you to take all of that data in the data warehouse and put it back into the operational systems. I would love to understand tools you're using around data governance, data cataloging, and BI, what kind of BI tools are using, what kinds of tools are using for the modelling layer. So let's dive deeper into that.
00:13:53
Yeah, absolutely. All data starts with data ingestion. And so data producers can publish data through Kafka, but we also support CDC ingestion for Aurora databases as well. That data lands in S3. And then there's some Spark processes that run some high level aggregation data, kind of initial step of data curation. That's on top of that from there it goes in multiple directions at this point. So we do further data curation using spark and and tailor data sets to different areas of the business that we support. From there we also have data that we push into Redshift for several reporting use cases primarily around SOX type use cases where it's a little easier for us to manage and test there. For the consumer layer, we have the Redshift databases as I described, but primarily most users access our data through our managed Presto environment. More sophisticated data science and machine learning users and even some of our product team engineers leverage our data through Spark. And then basically our SQL interface for analysts and for folks that want to do reporting is through Presto. And then for reporting we actually provide two options. We have Tableau and then we also provide Looker. And so the l Looker model was I think, a little bit more of our initial visualization offering. And then we brought Tableau in later. But, we have a lot of, lots sitting in. And we could potentially go look at QuickSight in the future as well, I think we've been talking with ThoughtSpot a little bit. Around some of their offerings and what would be interesting there is some of the chat GPT stuff, but we'll get into it in a minute. And then for all of our orchestration we're using Airflow in almost all of our areas around our tech stack. Data governance is, we practise decentralized data governance. We use Acro Data Hub as our data catalog which provides us a ton of great tooling around being able to not just document our data sets. But also it ingests data from Airflow, so we can see the status of a dataset based on its Airflow schedule. One of the challenges about using the Airflow UI is that if you're building several tables out of that Airflow UI you, it's, it, you can tell that the workflow is finished but, it's harder to map those datasets. So Acro gives us an easy way to be able to break down those pipelines and show the status at the table level versus the pipeline level, which is great. And then on top of that we can trace lineage throughout, a lot of our data sets especially our key data sets around financial reporting and then be able to show like SLA statuses but also be able to show in Looker using their Chrome plug in. Warning day if we have data quality issues, or if we have an S. L. A. Violation like we can show that in Looker. So customers aren't looking at a dashboard that's old or stale or maybe has bad data in it. So we're able to warn our customers ahead of time. And then we're leveraging anomaly to do a lot of our data quality checks. And then from a governance perspective, we don't have a centralized data governance council. I think that's hard to initialize at that level. So what we're doing is we're doing more grassroots data governance. These are things like data naming standards setting those SLAs for your data, setting the standards for, what is complete done production data. Building out documentation around any business calculations that we use in our data. Done both in terms of code, but also in plain language so that way we understand what the intent is, and then we can validate that code a little easier and people understand what we're trying to do with the code that's building out the data set. Setting owners for data, we want to make sure that we have a technical owner that is responsible for running the pipeline. But we also have a business owner who's responsible for the requirements of that data set. And then as that process matures and gets a little bit more traction and people see the value in it, we can start to think about how we define calculations like across the company, right? I think you know that it's a difficult place for companies to get to, like I think one of the things I took from HEB was that because of our, because our profit margins were so slim that there was no room for debate, like this is the calculation for gross profit, that's it. But when you don't have that same criticality around your financials, so like your gross profit, like every company right now in the macroeconomic environment is concerned about the financials, but when your margins aren't as razor thin, and you have a little bit of cushion to play with some of those, some of those governance pieces tend to fall by the wayside. And then when you bring them up, everybody freaks out a little bit. So we don't want people to freak out. We want people to see the value and then, naturally gravitate towards governance. And then, we continue to leverage Apple data to support that motion as well. We're working on setting up ETL as a service. I think that's the next phase of where we want to go. So we have a distributed analytics environment. So those analysts live within the different teams that they support. And so we want to make sure that they have the ability to be able to schedule and orchestrate jobs as well as make it easy for them to write the code to produce those jobs. So that's on a road map, for later this year. And then always tech debt and KTLO reduction, working on, deprecating old processes, old infrastructure and kind of moving more towards AWS serverless technologies, managed AWS technologies to further reduce the the ops overhead on that. And then that should give us the ability to scale out more and be able to do even more data curation and also look at other options for data capabilities that we can provide both to our analytics and business customers, as well as our engineering customers.
00:20:50
And one another thing that now that I have heard your answer on that is there any specific area of, of the entire modern data stack or the data stack that you guys have at Twilio where you are looking to, procure or build a technology on your own? Is that something that is still out in the open?
00:21:16
Yeah, a little bit. You know, there's always the build versus buy discussion. I think the easy answer to that for not just us, but for a lot of companies is like, Hey, if your cloud provider provides you serverless or managed services go there first, probably, right? That they're easy to spin up. You probably even have discount rates, with those cloud providers, depending on your size and scale. I think of when I think about build and buy, one of the, one of the key aspects I look at is is this thing that we're looking at going to be part of our critical infrastructure? Are products going to be down if it breaks? Or, is Twilio as a business not going to be able to function if this piece of infrastructure is unavailable? And at a high level, like if AWS loses an AZ or a region goes down, there's not really a whole lot you can do about that. The world has bigger problems if that's the case. But, when we think about our critical infrastructure, like for us, Kafka is the lifeblood of Twilio. And if it produces data into the lake team, subscribe to Kafka topics to be able to pull data into the product. So we need to ensure that Kafka is alive and operational at all times. And we talk about five, nine reliability that this is the system that we really want to have that level of. As we look at how we make that stronger and better we're looking more towards MSK at the moment but we're also talking to vendors like Confluent who run managed Kafka as well. And we're weighing like, maybe where do we use Confluent for some use cases? Where do we use MSK for some use cases? Is it a combination of both? Is it either or? And those decisions really weigh those decisions based on, of course, ease of use and cost and all of those things, but what happens if something goes down, right? Our messaging product goes down, right? Do we, are we going to, do we feel like we are comfortable working with Confluent to get stood back up if they're the vendor? Yeah. Or is that something that we want to have a tighter piece of control over? We can leverage MSK. It's a serverless service, a managed service. But, we control what, the inside of what's going on with the outage. How do we bring that back up faster? We have a little bit more control over that. And so it's that trade off of if we're working with our customers on whether Twilio is down, like, how comfortable do we feel saying Hey, we're working with a vendor that's providing some infrastructure around that. So depending on the piece of infrastructure, that could be totally fine, other, or it could be a combination, like certain services we may want to run ourselves and certain less critical services we may hand over to a vendor. That's one of the great things about why I push open source technologies is that it gives you a lot of that flexibility pretty easily. Especially at our scale where we're, we're processing, trillions of events a day. Being able to, write that data out to S3 and, either run, EMR on top of that and leverage Spark, leverage Ray or, you can go to a vendor like Databricks and be like, Hey, we want something that's a little bit more managed a little bit more, we have serverless Databricks SQL. But we need it to read off of the data that we're processing into our open source data lake. That flexibility gives us a ton of options for different use cases where it's like, Hey, we need to centrally run a set of infrastructure. But it's, low tier. We want our analytics customers to have data and compute all the time. But, if we're down for 30 minutes there, no big deal. But if our if our Kafka cluster goes down. Then, that is a huge deal. And do we want to be able to, do we want to own more of that or figure out a way to distribute that? So that we can leverage partners like fluent or, or embed more into the AWS managed offerings.
00:25:38
And you mentioned, previously about these macroeconomic tailwinds, and leading to a lot of, almost kind of everyone focusing on profitability and sustainability over the mad rush for growth. And often it's seen in the industry that the data platforms and the data engineering or the data division as a whole is often looked as a kind of a cost center rather than a business centre. And as a data leader, what is the rationale or what is the kind of, what is your pitch to the executives, not just at Twilio, like in general, what's your pitch for executives for a sustained investment in the data platform and data teams?
00:26:27
Yeah that's a great question. So I think that in times like this this is not the time to cut back on data capabilities. We need to be able to understand our data as much as possible in order to look for those opportunities where we can help our customers succeed, look for new opportunities to find customers and then look at ways that we can upsell and cross sell to our customers, right? So if you're just looking at, like, how do I generate more revenue? Data is going to be the answer to that. And, now is not the time to cut data. It's time to put more money into data. Now I'm going to counter that by saying that one thing that would help maximize the investment in data is to think more about enablement versus centralization. If you think about you have multiple analytics teams, you have multiple engineering teams that need to leverage data capability. If you centralize your infrastructure to that, so such that a central team is either managing all of your ETL or they're managing all of your infrastructure. You're going to hit a scale limit, right? You can hire more people. You can put more money into it. But the reality is that you're, you say, if you have 10 people supporting, all of your data ingestion if they're pulling data from your from your sources, then, at some point you're going to max out max out the ability for those 10 people to manage all of that, either they're going to run into KTLO issues, or they're just not going to be able to manage that many pipelines. And really taking data mesh to heart and start thinking about, like, how do I turn my organization into a data community, not just a centralized data platform team? I think, data is so critical to organizations. You can't just throw data over the wall to a central team to say okay these guys just manage everything in data. And then, but we're going to go off, either do the analytics or we're going to do platform engineering or product engineering. Everybody plays a role and I think, when you start thinking about these horizontal teams, like a data platform team, or like a cloud services team, or a cloud platform team, I think the role needs to switch from, we run the infrastructure, we manage all of the processes that run through us, to being more of hey, we're going to start, thinking more of platform as infrastructure. Like, how can we give you the infrastructure that you need to run on your own? Things that you know, things that are data related. So when you think about things like data publishing, how do I build a Terraform module or a CloudFormation template? To that allows a engineering team to deploy that infrastructure and then, but have it already pre configured or have light configuration. So as soon as they deploy it, it's up and running and they don't have to think about, managing it or trying to get it set up and wire it into their product. And so I think our role as like horizontal platform teams shifts a bit from, we used to manage all the hardware we used to make sure that it's up and running and. And moving that to a model of no, like all of these engineering teams also need to handle data. So how do we help them? Handle that data and stand up infrastructure on their own and not be the gatekeeper for it, but be the enabler for all these teams to do that. And then that scales a lot better. And so now we're supportive and saying, Hey, if you're having problems running an elastic search cluster, or you're having problems standing up EMR we can jump in and help you and give you that support you need. But we're going to get out of your way so that, if you want to run EMR or serverless glue. Or you want to leverage Athena or you want to, run data bricks on your own you can do that. We're not going to get in your way but we're instead of competing with you, we're going to support you on that on that mission and help make everybody as data savvy and as powerful as we, as a centralized organization, I think when you look at it that way. You can start to maximize the investment quite a bit in your data platform team. And so your data team isn't just strapped to the gills trying to run a KTLO and not able to give your company and organization more tools. You move that your platform teams and your product teams are already running a lot of KTLO to keep their lights on anyway. And if we're leveraging easy to use infrastructure, then that, that makes it, less load on those teams. But also, if your spark cluster goes down, your EMR cluster goes down, just kill it and restart it, right? Like you don't have to worry I've got to page another team and then they've got to go figure out what's going on. Just, Hey, we're going to beginning of our airflow job. We spin up serverless EMR. We run our job, spin it back down, and you don't have to worry about any KTLO in that, right? And so taking advantage of managed and serverless offerings when you can, so that further even reduces the ops load on those products. But, again, packaging up those pieces so that they work as soon as they're deployed, making it easy for your data producers to publish data making sure it's easy for them to publish high quality data, right? Packaging up any kind of, data quality checks that you might have and making it easy for the company to start becoming a data community. Versus trying to centralize everything on a single platform.
00:32:01
That's an amazing answer, Don. So as we inch closer to the end of this episode, let me leave you with one last question, Don, where and how do you see, using leveraging generative AI as a part of your day-to-day work, or, in, in the broader functioning of the data teams that you force you to come from here?
00:32:23
Yeah, I think from my perspective I'm probably going to give a less than satisfactory answer when it comes to using data, although I think there's a lot of value there. I think on the engineering side of things, I think there's a couple of things like, can we speed up development of data pipelines, like leveraging generative AI to help build the code out? Do we have to do a lot of boilerplate code or very similar code that we can, leverage generative, generative AI to help us build code optimization. And I think, time is money in the cloud. And so the faster we can make our pipelines, the less money. We can spend money there and start saving more money to build more. So that's one aspect of it, and just general code optimization, just across the board, even on the platform side, I think one of the more interesting ways that we can leverage generative AI is really like on the consumer side. If you look at SQL, SQL was developed in the seventies, right? It's still the best language for handling relational data. It's easy to understand, especially for analysts and, maybe less, less engineering savvy folks. But there's also a whole group of customers that aren't analysts that, maybe, are executives or maybe leaders in other areas of the org. Where, they may not know SQL or they just, like they really just want to ask a question of data, right? Like it's hey, what was my revenue year over year for, from, quarter one of last year to quarter two of this year, whatever that question is I think where generative AI can play in there is start to think about, like being able to, move, start moving away from SQL a little bit and allowing those folks to just ask questions, right? And then have the AI trained to be able to answer those questions without the need of calling up an analyst and, Hey, can you run this query for me or I'm looking at this dashboard, but it doesn't have the timeframe that I really want to look at. I think ThoughtSpot and AskData are moving in that direction. I think where I see a limitation with both of those solutions is that you're really building a Tableau dashboard on the back end. And the questions are really limited to that data set that it has in front of it, right? And there's like a lot of heavy old school microstrategy data architecture mapping in the background to make that work. I think where generative AI can move this to a whole nother level is the ability to make it flexible and to train it on all of the data that's available or train it on a much larger swath of the lake. In order to be able to ask any question, and I think that's going to be super powerful when, you're, we are, we already have a lot of very savvy, analysts and data scientists and machine learning engineers, but you start putting that level of power of data into the hands of people who can operationalize it and, take business action very quickly. You shorten that, that lead time. So as soon as that executive knows the answer to that question, they can take an action on it immediately versus waiting for data to be pulled and reports to be built and, discussions to be had, you can shorten that timeframe by a ton. And I think that's going to be super powerful for businesses, leveraging data, moving.
00:35:51
Amazing. So thank you so much, Don, for such an insightful episode. There are a lot of things that I suppose a lot of our listeners and we took out of this episode. So thank you again so much for giving your time, Don.
00:36:03
Oh yeah. I really appreciate it. Thanks for having me on. This was fun. Thank you so much.