Nov 08, 202229 min

Share via:

S01 E09 What the heck is Headless BI with Igor Lukanin, Head of Developer Relations at Cube

Headless BI is one of the new and emerging category of the Modern Data Stack. Although the concept of Headless has existed for quite long in terms of Headless CMS, why is there a need for a Headless BI tool? Why should anyone care about Headless BI? To answer these questions and all the other technical complexities around Headless BI we have Igor Lukanin from Cube - a Headless BI solution for building data apps.

Available On:

spotify

google podcast

youtube

Amazon Music

apple podcast

About the guest

Igor Lukanin

Head of Developer Relations

Igor Lukanin is the head of developer relationship at Cube . Cube is a headless BI tools that empowers business to build powerful, fast and consistent data applications. At Cube Igor is leading the team which is responsible for the growth and adoption of the Cube platform. He has helped Cube flagship open source product earn 7,000 community members in Slack, and 14,000 stars on GitHub while driving the company from C to series A in less than one year.

In this episode

The role of head of developer relations.
What does Cube platform do?
Evolution of Headless BI.
The rise of semantic layer.
Different features of Cube platform.

Transcript

00:00:00

Hello everyone. Welcome to another episode of the Modern Data Show, and today we have Igor Lukanin, who is the head of developer relationship at Cube . Cube is a headless BI tools that empowers business to build powerful, fast and consistent data applications. At Cube Igor is leading the team which is responsible for the growth and adoption of the Cube platform. He has helped Cube Flagships open source product earn 7,000 community members in Slack, and 14,000 stars on GitHub while driving the company from C to series A in less than one year. Welcome to the show, Igor.

00:00:32

Hey, thanks for having me.

00:00:33

Perfect. So, Igor, let's just start with basics. So tell us a little bit more about Cube and your role as a head of developer relationships at Cube .

00:00:40

My job is just basically it's making technology accessible to data engineers and application developers. And here at Cube where we have this mission to to supply the data engineers with toola that would help them build powerful, fast, performant data applications and that requires onboarding data engineers and developers to, to a whole bunch of new concepts . What I'm doing on a daily basis, what my team is doing is basically making sure that engineer or app developer has the tools to build modern data apps. We're doing sort of blogging through, participating in events, committing to open source and doing whatever we can to spread the word about the modern data stack, how to build modern data. Right.

00:01:30

And let's go a little bit deeper in terms of the Cube as a product. Like if you were to explain it to someone you know, as a beginner, like what does Cube as a platform does, as a product does? How would you explain that?

00:01:42

We refer to Cube as a headless business intelligence or just headless bi, and that might be quite an unusual term. Like, like what do you mean by headless? And if we try to pick an analogy from the software engineering world the closest one would be, there's just other headless tools or headless technologists, like say headless cms. This is your content management system that would helping in store all their kinda information might wanna display on your website, on a mobile app, but that would deliberately just skip the task of visualizing, of displaying the data because that's done by blogging platform, by custom, front enf application that you might wanna build on top of that. But the headless cms wouldn't be doing that and just building on that analogy, the headless BI is a business intelligence tool that deliberately refuses to take part in data presentation and visualization. And for reason because Right now we see their kind of proliferation of tools in the data stack. So, when we talk about data, consumers, right? So just go to their modern data stack repository you'll meet full-fledged BI tools. You'll see their data notebooks or data workspaces that some folks call them. You'll see even things like spreadsheet based BIs. And other stuff. Nowadays if we take any company, even really small ones will see that there are a lot of teams with distinct needs. Wanna use different tools. It's not the story that one size or one BI tool fits all needs, right? So data analysts would love to use tools like Jupiter or Deepnote to Hex to crunch the data and dig for insights and business users they just have their dashboards, right? In the same company, Product engineers, they might wanna build the product using that data for the end user of the company. So we see that Folks with different needs that would use their own tools to to access the data. And the goal of our headless BI solution would be to deliver the data to those apps. And that's a tiny bit of what headless BI tool might do because before the data is delivered to the data consumers you can do a lot of valuable things with that and a couple things that I would mention right now would be will be organizing the data into the metrics definitions or just doing data modeling as we call it, and also making sure that the data access would be will be performant . If you ask me like, what's the single value that the headless BI that the tool provides? I would say consistency because it takes a day or a week to, to build a data pipeline and deliver data to the consumer. But it also takes weeks and months or even more to regain the trust from your business user for me and users, if your data pipeline is flawed. So, it's crucial to deliver consistent data that just make sense. And there's just the same in different data consumers. And also it it's crucial to make sure that whatever BI tool you're using or the dashboard would be appearing within seconds, not like minutes. And also the front end apps would be performant, would be responsive to something. The end users would be careful, freaked out using them. And ths that's exactly, that's a level of consistency in terms of what data you're getting and consistency in terms of the performance that you're getting. That's what a headless BI tool might bring.

00:05:32

So before we even you know, kind of dive deeper into the technical components of the Headless BI tool, let's talk a little bit about the evolution of Headless BI.So I think, so one of the widest known commercial implementation or the early version of headless BI is something that we can attribute to LookML right? So Looker was one of the, you know, there were certain, like SAP BO was. Also an implementation of Headless Bi, but I think, so from a popularity perspective, I don't think so. There is anything as popular as, you know, LookML in recent days. This concept has been around for a while, right? Even the concept of headless crm, headless cms all of those headless stuff is there for a while. Why do you think is a need for a tool, Why should anyone care about headless bi? Now, you mentioned consistency, right? But from a, the whole modern data stack perspective, why should anyone care about headless bi now?

00:06:31

Yeah. That's a lot to unpack here. So let's let's just go one by one. So, first of all, I, yeah, I would like to recommend everyone learn more about the kinda historical perspective on the headless BI and semantic layers and the metric store. That kind of stuff. I would like to recommend the recent blog post by Simon Spati from Airbyte it's called the Rise of Semantic Layer, and I believe that might provide that historical perspective on the topic. What Looker did it was one of the most impactful steps backing things to popularize their kinda the semantic layer. I would say that for the current generation of data practitioners and data engineers, Looker is a way to define metrics because it popularized language like LookML right? Where you can where you can declaratively, define what you can calculate what you wanna do on top of your, with data, what you can derive from that. And that's, I would say that's very catchy concept because otherwise you need to resort to inappropriate means to define your metrics. We know that there's kinda a whole category of tools like dreamio that in a way solve the same problem. Kinda allowing to, to get the access to diverse data store in different data sources. You can certainly build something that can be called semantic layer. You're using those tools, however you, it, it would certainly lack this kinda the property of what looker provide where metrics are defined declaratively what in your data model. But yeah. The role of looker is undeniable here. I've been kinda numerous conversations with Looker users and they were kinda 100%, like, 200% happy with the way they can declare metrics the way they can explore them. But totally unhappy the way the data is visualized in there. So it's kinda an uncomfortable vendor lock-in definitely love a part of what you're getting from the tool but you hate how the data is accessed there. And that's what lot of tools in the kinda headless bi category are trying to do right now while retaining this ability to define metrics in declarative way to provide the performance to provide the consistency. We see this rise of semantic layer or just a lot of buzz around headless BI tools and a couple of episodes back on this podcast, you were talking about the realtime data stores, right? Like, like Panda, Materialize, Kafka et cetera. We, we have a lot of cloud based data warehouses like, like Snowflake, Bigquery et cetera. Where the data is stored and a lot of other options like on databases cetera, a lot of other options to To collect the data, to store it to to keep it off to the transformation. And that's the first part. The second part, and I've been on this one, kinda that part, but we've got a lot of different means to consume the data. We've got a lot of Users a lot of different teams within every company that would like to get their. Kind of to to fulfill their own skin areas with those tools with the data delivered from those data stores. And that kinda creates situation of kinda, of of explosion, right? Because well, one option is to to be building custom data pipeline delivering the data from data stores to particular data consumers. But then, you can only wish that the metric definition would be the same, and also it won't get outta sync while you are evolving the data pipelines or the other option is to seek the way to make sure that you still, you'll still have your data source in place. You still have your data transformation tools. You still have your tools in the pipeline. You use all the consumers you'd like to, but somehow maintain the systematic definitions and ensure kinda consistent performance of the data. And that's exactly what the Head BI tools can solve. And I believe that's probably debatable topic, but I believe that we'll be seeing kinda more categories in the data stack emerging. I wouldn't even refer this as a modern data stack because, well, now it's questionable what is modern and kinda what comes after what. But there, there would be a need for a glue or for a middle layer to tie them all that would be where semantic layers Headless BI tools would would fit perfect.

00:11:13

One other question that comes to my mind is although the term the headless bi, conceptually, it's very interesting. It's very promising. Despite the technical and conceptual promising nature of headless bi, it's yet to see a lot of commercial validation from the market. What are your thoughts on that? Is headless bi still early? And a follow up question to that, dbt recently announced syemantic layer, what are your thoughts on that?

00:11:41

Cool. That's a cool question. So, well in terms of thw adoption and financial success, right? Most of the information I have, it's Cube and I would say Cube users which are self hosting it or using the Cube cloud where they can just use Cube in a managed environment. From what I see, they're pretty much successful and they don't doubt they need the headless bi to in their data stacks. So That's for sure. And some of the numbers mentioned like number of stars on GitHub cetera that's obviously vanity metrics, but some sign of validation So, as I said, I believe that Headless BI is here to stay. And those data engineering teams who introduce such a tool to their stack. Don't wish kinda to get rid of it. You mentioned semantic layer and I believe that everyone of us excited about their recent announcement on the coalesce conference. We should all be grateful to the dbt team for what they're doing to kinda socialize and popularize whole concept of of semantic layer, right? Because this concept is still, I would say really narrowly popularized, right? I personally would love the kinda this way of thinking to be kinda the norm and yeah I really enjoy seeing that's catching up. Probably the last note that I wanna make here is that I was really happy to see DBT team sharing. Some things that are there that are on their backlog, right? What they're thinking about. There were questions about the performance of accessing the data through, through semantic layer. There were questions about their support or different tools in different integrations. And while that was really great to hear because I believe what they are building that's pretty much consistent with what we have here at Cube , right? All those integrations and different skins of Cube support. So I really see the. Different headless bi tools have different seman layers converging to the same set of features, to the same solution that would naturally fit kind in between the data consumers and data source.

00:13:44

You guys have work done a lot of work around various components of a headless bi, from I think so, you have features and support around the access control caching. Walk us through that. Tell, walk us through the, first of all the technical complexities around these things and how you guys have achieved all of that.

00:14:03

Yeah, it's a good question and I can try to break down what we at Cube with the product and kind of present it number of layers. But the thing is that kind of on the Cube side and Cube is a product is available under open source license, which you can self host, run it in Docker or Kubernetes, or you can use a hosted version in Cube Cloud, but there are a lot of moving parts. So everything that I will be referring to down the road, this are more logical parts, right? It's helps to think about Cube as a combination of four layers. This is data modeling layer, the access control layer, the caching layer, and API layer. And that's and they are even arranged in, in an order that you would probably used configure them and to make sure the data from your data stores is delivered to data consumers. So the data modeling layer, this is, well this is just another term the metrics layer or semantic layer. So I believe what Cube has in data modeling layer, this is very close to what you would probably do if you, you define their metrics and looker or what you would do when you when you configure metrics in dtb semantic layer. So data modeling layer is just the same thing. The goal here is to provide a way to declaratively, define what kind of quantitative data, what kind of metrics need to be calculated on top of data that you have in data warehouse and the Cube s data modeling language is really similar, conceptually to one Looker right? So, so you define just logical entities. We call them Cube s that contain measures and dimensions. And measures are basically what you wanna calculate. This aggregates already data and dimensions are the Quality of characteristics of data that you break your metrics with. And yeah, and you can define it many Cube s as you want. You can define joints between them and kinda model your domain with the Cube s. So most of the times, a Cube might be defined as reflect a single table in data warehouse. So just define it as like select star a table in data warehouse, but it doesn't need to. So if you already have kind a data transformation layer in front of you, most of it might be just select star from table. And if you don't, ask Cube to model the Cube s on top of pretty much any SQL that might, get complex at times but still, if it makes sense for your for your business domain then you should feel free to, to do so. So that's data modeling. You make sure that you just have your medicine dimensions and group them into entities, which are called Cube s. And then you have your the access control layer and that makes sense to have this just next to data modeling because it allows to provide consistent access controls. Right? So regardless of whoever would be accessing those metrics that, data analysts from a data notebook or a CEO in their in a BI tool such as, Metabase or Superset or Tableau their query would need to pass through their access control layer. It would make sure That their all level security or role based access be enforce. Cube provides provides tools to make sure that meta information is passed from the data consumers. So you can support all kinds of multi tenancy scenarios, right? Where you have different groups of users or users whereas different. So, so you can restrict access to certain metrics or to certain rows within the data, within your tables in the data warehouse for some groups of users or for some users in particular. And, allow some, allow everything, fulfill the rest of them. Then there is this, caching layer, which and I'm not a fan of that title because I prefer this to be called the acceleration layer. And the goal of that is to make sure that every query, which would be executed by Cube it can be fulfilled within. A set period of time. So, so most of the times it's just no more than two or 300 milliseconds. And Cube allows for concurrencies up to, requests per second. Maybe one thousands per second, No problem. And there is an amazing technology kinda under the, of that probably, probably it would take some time to really dig deeper into this, but just deriving the knowledge from what's currently hyped in the data community. I would just say that how their, the caching layer is kinda built under the hood. It's really similar to how Duckdb works. So inside Cube there is this custom built Data store. It's coloumnar storage and they're, the only kind of prominent difference Duckdb is that in Cube 's case is distributed. So, it allows to parallelize the calculations when you have lots of data, but the rest is the same. So it's columnar store written in rust and what Cube does having your metrics defined in the data modeling layer? It just preemptively a synchronously caches that the data that would be needed to fulfill the request and just stores in intermediate format in Paarquet files and inside Cube store there is a I would say proven technologies used. So, as I said, data store in paarquet files, we have we use Apache Arrow format to doing memory processing. Just basically transfer the data. And we use Apache Arrow Data Fusion library to do the cray orchestration and creative planning. So this is, would say this is more like gluing together some of their well known and kinda to proven pieces of technology in the data space, right? And that, that allows basically any query which isn't coming to Cube to be fulfilled within two or three milliseconds. And here we comes to the last piece, well, from where this queries are coming, right? So, the last piece of the API layer. And that's interesting. So, in the very beginning there were Cube only had a single api, which was the REST api. And that one basically allows to, to access the data through HHTPS requests, right? That's something that you would do when you are building a front end app, or if you're building some kind automation, right? That would just do HTTPS requests. But for more than a year already, Cube also has a couple of other APIs. Front end developers. Right now, this is Graphql so Cube also has Graphql api, but what? I'm excited about that. For more than the Year. Ready Cube also has a sequel api, right? And the SQL API that Cube provides is postgres compliant. Just represents Cube as a postgres database and the Cubes and measures and dimensions that you define in your data modeling layer, they would be available as postgres tables and columns within those tables. So, this is, yeah, this is. It's really great to have this part compatible with the most I would say the most widespread, the most popular kinda wire protocol the most and being present is the most popular database out there. I mean postgres, because that instantly gives ability to connect you to whatever tool you have if that tool supports psotgres. Right. So I remember when we just launched our sequel api we tested it with Apache Superset. Which is an open source really cool bi platform. And its work right? So you take metabase, take Power bi, take Tableau, take whatever you have. And if your tool can interface with a postgres, it can talk to Cube and it means that your data can be driven from Cube to the tool and. And as I said, all those kind, access control configuration and also whatever you configured in the caching layer, that would take effect as well. So, yeah, and here we come to turn the standing. So if you have Cube in data pipeline that basically you have universal connectivity to whatever data stores you have and you have consistency method definitions the performance of the queries. And you can connect it to literally every tool out there that you have to display or represent your data.

00:22:16

Wow. That's super insightful. Igor and I can totally relate why the SQL interface could be so game changing for you, because I see you have integration with Hex DeepNote streamlit and all of these integrations are like the, our native database or database integration that you can just easily support of the shelves. Any new integrations or any future integration that you are really excited about?

00:22:42

I'm still excited about because it's what we recently launched. So a few weeks ago we launched our integration with Kafka through Ksqldb. So Ksqldb, it's really similar to Materialized in the way it's streaming database, right? Designed to work with Kafka. So, the keeps integration with Ksqldb allows to just basically connect you to stream or Kaka topic and calculate metrics, which would be just instantly out to date their delay or latency between the new data being posted to a topic and to the data peering on your dashboard or in the front and top. Would be just seconds. And everything that we've be, we've been talking about their, the data modeling or the access control data that applies as well. And why I'm excited about that in particular is because that integration also gave keep the ability to introduce a couple of other Features a couple or enable a couple of other. So we are really happy to kinda finally deliver. An implementation of Lamda architecture within Cube , so I'm not sure everyone's familiar with this concept, but basically lambda architecture is just being able to process both batch data and streaming data. So basically what Cube is now is capable to do is to be merging in real time while you querying Cube data from data warehouses and streaming databases. And that would be, Completely transparent to data consumers and also, well, not everyone has Kafka or RedPanda or streaming data stores in their pipelines. But probably a lot of folks have data warehouses or data lakes where some of the data is pre transformed or kinda pretty much Not changing all the time, there is fresh data which is incoming kinda on real time basis. And what Cube can do, it can also kinda merge the patient streaming data in the, from the same or different date of warehouse. And then will give you ability to say read only the last kinda piece of day, and probably in the last five minutes, last 10 minutes of data in a data warehouse. And calculate the metrics while joining the rest of the data that was already kinda ingested cached within Cube. So basically, you get it with this, you get the ability to kinda query your billions or trillions of rows of data while paying only for, you. It's the data warehouses, they're mostly consumption based, right? So you, so you need to pay for what you read from them, right? But with this, you can, access metrics, calculate on all your data, including the most recent realtime part of it, while paying only for access and getting the last five minutes, the last 10 minutes. So, this is a new scenario, just that it's enabled by Cube . Speaking about kinda new integrations, et cetera, I would certainly love to see more data engineers data analysts, trying to connect with tools that they have to Cube and just getting some experience about that and then sharing that in our Slack.

00:25:44

That would be, perfect. I just saw on the website that you have guys have recently introduced views for metric management and I think so you would be hosting a workshop on that also.

00:25:53

Um, yeah, that's a cool, but. It might be a tough one to unpack. So let me, lemme approach this way. I think that one of the way to think about this, this is how our CTO refers to use, is just views make the data modeling layer that Cube has complete kinda Enabling the very lasts. Which which might be, you know, not very frequent, but that we needed to, to as well. But kinda in layman terms, they allow you to create another level on top of existing semantec layer. So if you have kinda Cube s that contain measures and dimensions, you can create views on top of them. That would be picking only certain measures and dimensions from the Cube s that you have, and then you are able to. To organize them in a way that those kind of, those views make sense for particular use cases. So kinda allow you to make it even more semantic if you need to. Views are completely opting, so if you need it, don't use them. But if you really wanna make sure. What data is in coming to bi, what is available in your BI tool, What's available in your data notebook. It really kinda makes sense on your domain level that you can model them with Cube with views and also and also does they allow to really. Make it evident metrics are available to different groups of users that you have. So just another another means to do that access control. And probably the last thing here which is really which is really something that in the talks in the data community recently is just views in the Cube way. Define the single measure metric. So there's a lot of kinda conversations about what a metric is, right? Is this, is it just, some kinda a number that you can fetch from somewhere? Or when you break it down by some that matches, Is it still the same metric or is it another one? Or when you change a time demand again, is it the same metric ? Or just another one? So with views that can be really neat. Organized with Cube and. And for those ones who are, interested I'll be able to unpack this and kinda show this and demo this and upcoming webinar that we have in a few weeks next week, right? So, so, yeah, to if you'd like to learn.

00:28:16

Perfect. Perfect. So thank you. Thank you so much, Igo, for such a lovely episode recording. I think. So there were a lot of things that we learned about Headless BI today, so thank you very much for sharing all of that as with us. It was such a pleasure having you on the show.

00:28:30

Amazing thank you. And hopefully that will give their, data engineers and analysts and app developers top view. Right. And would allow all of us to build better I'm sure.

00:28:44

I'm sure it will. Thank you so much Igor.