Data Data Platform Data Infrastructure Data Management

10 tips for building an advanced data platform

kirill denisenkoJan 23, 202320 min read

Build your data platform as a product
Use asset-based approach to data instead of pipelines
Manage your data platform infrastructure as code
Set moderately strict limits on data platform and relax them when needed
Partition and cluster tables
Consider when to use a SaaS, an open-source solution, or build your own
Create an internal data catalog
Create a monitoring center for the data platform
Keep things simple for data platform users
Periodically check your data inventory and do a clean-up

Tip #1: Build your data platform as a product

This, in my opinion, is the most impactful thing you can do for your data platform users.

When a company builds an application or a service, it builds a product. Such product has:

Customers
Technical infrastructure
Product planning to define what to build
Marketing efforts to get customers
Customer support to handle problems and incidents

Now, what if we apply the same way of thinking to an internal data platform?

Customers: every data platform has users, who are most likely company employees
Technical infrastructure: you need resources and technologies to create a data platform
Product planning: building a data platform requires a lot of planning like choice of tools, functionality, data warehouse design etc.
Marketing: you need to promote your data platform’s functionality to on-board new users and explain added functionality to existing users
Customer support: any data platform team needs to respond to users’ requests and handle data-related incidents

If we can apply product characteristics to a data platform, then why don’t we build data platform as a product instead of creating a set of connected technologies that make sense only to data platform teams?

Consider data platform as an internal product, its target market as your company and target audience as employees. Create name and logo for the data platform and advertise it internally under its name instead of a generic “data platform” term or the name of your data warehousing solution. Get all your data tools under this name and refer to them as a functionality.

For example, if your data platform product is named “Helios” (Greek word for “sun”), then your Tableau Server is the data visualization functionality of Helios. From the perspective of a user, they will be using a “Helios dashboard that aggregates Helios data”. It’s much easier for everyone to remember one product name and get the idea of its functionality, rather than trying to figure out what different data technologies do and how they are interconnected. When using Google Spreadsheets, you generally don’t think about what modules, resources and APIs it consists of, right?

I also recommend creating a landing page for your data platform product. A simple internal webpage with data platform’s name, logo and a list of links to functional components is sufficient. For better access and visibility, assign it a simple internal URL address (using the same Helios example from above, something like internal-domain.com/helios or helios.internal-domain.com would look great). Create a user manual for your data platform product and put the link on the landing page.

The most important thing here is having a single entry point to all functionality of your data platform. No need to research countless internal wiki articles and ask teammates for the information — anyone can start using the data platform just by visiting a link.

At Eureka, we used this method to build a new modern data platform product that we named Metis.

Thanks to the creation of Metis, we now have a simple, fast, resilient, reliable, cost-efficient and easy-to-use internal data platform. Our data teams actively use the name “Metis” instead of “data platform” or “BigQuery” or “GCP”, and many employees who do not directly work with the data platform have at least heard about Metis, its functionality and role in the company.

Check out last year’s article for a (slightly technologically outdated) overview of Metis!

Tip #2: Use asset-based approach to data instead of pipelines

In comparison with tip #1, this, in my opinion, is the most impactful thing you can do for yourself as a data platform engineer or manager.

Traditionally, when data engineers create data pipelines, they write instructions (code) that declare what operations to execute. Imagine a daily data pipeline in Airflow or cron-scheduled shell script that gets data from source A, transforms data into file B, uploads file data to table C, aggregates it into a table D and runs a refresh via API on server E. We have a clearly defined process consisting of 5 operations and expected output of this process.

What if this pipeline fails on step D? We need to fix the process and re-run it from the start.
What if one pipeline depends on an output from the other? Now we need to create a mechanism that connects those two pipelines. And if one of data outputs in the second pipeline doesn’t actually depend on the result of the first pipeline, then we have an unnecessary blocker when the first pipeline fails.

As many data engineers know, dealing with such pipelines is a cumbersome process that costs a lot of time, effort and makes data engineering unnecessarily hard.

What if instead we put our focus not on what operations to execute, but what we want to exist? What if we embrace the declarative approach to data and start declaring data assets instead of pipelines?

Much like Terraform’s approach to infrastructure resources (discussed in tip #3), we can just define what data asset (table in a data warehouse, file, ML model etc) we want to create, how to create it and what dependencies on other data assets it has. This gives each data asset its own entity so we can work with them independently. When we want to materialize several data assets at once, we just need to specify which ones to materialize without the need to worry about their relationships. Since dependencies are already defined, each data asset knows from what other data assets to pull the data for processing, or to what data assets push processed data.

Each block is a separate data asset (Source)

Moreover, declared dependencies enable automatic data lineage and operational data catalog without any extra effort. Operational data catalog is the one that the data platform team uses to work with data assets and their lifecycle.

dbt is one such popular tool that enabled asset-based approach to the data warehouse. Instead of creating data pipelines, you just write a SELECT SQL query for the table you want to create, give this table (model) a name, create optional YAML configuration and specify what other tables it depends on. Then just schedule your jobs and let dbt do the rest. Simple to create, maintain, and you get the data lineage out of the box.

But I believe the real breakthrough in this field is Dagster with its Software-Defined Assets approach. The general idea is similar to the one of dbt, but instead can be applied to any data asset. Whether it’s a table in a database, a data extract from an API, a file in a cloud storage bucket or a machine learning model — you just need to define it in Python code and specify dependencies. Create a job with a selection of assets you want to materialize and add a schedule or a sensor. Dagster will automatically materialize selected assets while preserving order of dependencies. Moreover, you can import data assets from other tools, including dbt!

Dagster and dbt are probably two most powerful tools that we have implemented in Metis. They replaced complex and fragile Airflow pipelines, custom shell scripts and a couple of detached data tools. We got clear observability over our data assets thanks to the automated operational data catalog, our daily data operations got much easier, and we visibly increased the reliability of our new data platform.

I really recommend spending some time studying Dagster and understanding asset-based approach to data to see how it can improve your current data workflows. It was such a huge positive upgrade for us so that it’s hard to imagine going back to previous ways of doing data engineering. There are detailed explanations of all features in docs and blog posts such as this one.

Tip #3: Manage your data platform infrastructure as code

If your tech stack allows it, it pays off to define data platform resources and automate their lifecycle in an Infrastructure as Code (IaC) tool, such as Terraform. The amount of clarity, simplicity and peace of mind this approach brings is difficult to overestimate.

It’s easy to start with the IaC approach to data platform resources. Spend some time studying the tool, configure CI pipeline that runs the tool and import existing resources. Next time you need to add, modify or delete something in your data platform, just modify the code as needed, create a Pull Request, merge approved changes and confirm that the deployment was a success.

Note that as with everything, balance is important. For example, most likely it won’t make sense to define a resource for absolutely every table that you want to create in your data warehouse. Instead, consider letting other data tools (such as dbt and Dagster) to create table resources on the fly. Going too granular will add a huge operational overhead, and not putting all main resources in code makes the data platform less reliable, messier and prone to incidents. But if done right, such IaC repository becomes both your data platform’s technical inventory and management plane.

In case of Metis, we went with defining all technical infrastructure resources down to a dataset level and access controls with Terraform and putting BigQuery table definitions in our dbt repository. This creates a simple abstraction: when we need to add infrastructure or configure access controls — we do it in the Terraform repository. When we need to add data to Metis, then it’s either dbt repository (for SQL-generated data) or Dagster repository (for non-SQL-generated data).

Tip #4: Set moderately strict limits on data platform and relax them when needed

When you’re in charge of a data platform, you’re also in charge of data security, data access and data processing costs, among other things. To keep the data safe and costs down, make sure to set:

Access controls on who and how can access what data
Cost controls on who can query and use how much data

As always, balance is important. Restricting your internal users too much will halt a lot of work and giving them too much freedom to use the data platform creates security and financial risks.

I recommend going with a moderately strict limits first and create a simple way to relax those limits when needed. Changing current restrictions might need approvals from upper management or other stakeholders, so consider how to make that communication simpler as well.

Also consider how to apply the controls above. For example, you can control access to data on a data warehouse, dataset, table and/or column level. Consider which option makes the most sense for your organization and implement accordingly. In our case, we found that controlling access on a dataset level while protecting certain columns with column-based policy tags works well as a balanced solution.

For cost controls, having a moderate total daily data scan limit and making sure users know about it provides a good starting ground. Separate on-demand workloads into its own quota pool and set the limit accordingly. On-demand workloads in a data platform are usually not critical, so the quota will protect your data platform budget. And when a critical use-case to scan a lot of data comes up, you can always temporarily increase the quota limit.

When implemented right, your security and financial teams will be very grateful to you, and your data operations will be easier, safer and cheaper.

Tip #5: Partition and cluster tables

This tip is written based on BigQuery, but I believe other data warehouses have the same or similar functionality.

Partitioning and clustering of tables is the one method that is easy to implement and gives you huge performance and cost saving benefits right away. When you have a lot of data in a table, there is no reason not to use either or both.

Partitioning

Partitioning allows you to logically split the data into different partitions based on a date or a number. By partitioning a table by day using a DATETIME column (for example, created_at), the data for, let’s say, December 18, 2022 (2022–12–18) will automatically land in its own partition. Then, when you want to retrieve the data for this day and specify the condition WHERE DATE(created_at) = DATE("2022-12-18") , the SQL engine knows that it needs to scan only the “2022–12–18” partition for data. Since only one partition of the whole table needs to be scanned, the query becomes much faster and much cheaper.

Same goes for uploading data into a table. No more time-consuming and costly UPDATE or MERGE operations — when you need to rewrite data for a specific date, just delete the partition and re-upload the data!

If you have data in Cloud Storage, you can also similarly partition the data using Hive partitioning layout.

You can also partition existing tables by temporarily stopping data ingestion into a table, moving all data into a newly created partitioned table, deleting the old table, and renaming the new table to the previous name.

Clustering

What if you have a lot of data in a single partition or a big table without a column to partition on, but still want to improve performance and reduce costs? Clustering allows you to further split data into clusters. Let’s say you have a table with purchase transactions per store in different cities, and your most common use case is to calculate total transactions per specific city per store. By clustering the table by city and store_name columns you automatically optimized all queries with WHERE city = X AND store_name = Y conditions, since now they scan only specific clusters of the table. Combine that with partitioning and now you have very efficient queries without rewriting any SQL!

Before Metis, we only had a manual partitioning method and no clustering configuration on our tables. By partitioning and clustering large and heavily used tables we achieved huge boost to performance and huge cost savings. In one case, we managed to bring down the cost of a specific use-case by 99.8% and reduced query runtime by 75% !

That said, there is no need to partition and cluster absolutely every table. If you have a table with 1000 rows of data that you don’t expect to grow, then you won’t notice any benefits from partitioning or clustering. Identify the tables which will give you most gains and start from them.

Tip #6: Consider when to use a SaaS, an open-source solution or build your own

There is no tool that fits all use-cases. Whether it’s a data warehouse solution, ETL or visualization tool, the choice of which one to use will depend on the goals, needs and current situation in your organization and data department.

Save your team and company from future troubles by thoroughly evaluating your situation and available tools in advance.

List your use-cases and requirements
Summarize advantages and disadvantages of each tool
Evaluate compatibility with your data and tech stack, consider usability
Narrow the list down to possible matches
Subscribe to the trial period of a SaaS service or install an open-source solution on a test machine, and thoroughly evaluate each tool on use-cases that are as close to real data volume and velocity as possible

Most likely there will be a need to make a trade-off — consider the one that has the least impact. For example, existing ETL tools such as Fivetran or Airbyte might easily handle many various simple batch data sync workflows, but they might not be able to process huge amounts of data in short time fast enough. Research if there is a solution that can handle both use-cases, evaluate if it makes sense to add a custom solution, or go with a completely different option.

This is the part where it’s easy to go wrong and then suffer from the consequences of that choice when the tool already has too many dependencies in your stack. A well thought out choice from the start can save you from a costly and lengthy migration later.

Tip #7: Create an internal data catalog

Operational data catalog mentioned in tip #2 is very useful for the data platform team, but it’s not suitable to be shared with users. After all, it’s probably not a good idea to give the keys to managing all data assets to users.

Instead, it’s a good idea to create a simple data catalog that focuses on describing the data you have and enables data self-discovery for your users. Having a centralized data catalog will greatly help in reducing the amount of “Where is X data?” questions, and will give a complete visual picture of all existing data in your company, thanks to the data lineage and descriptions of your data assets.

What is the difference between the two types of data catalogs?

Operational data catalog — describes the desired state of data assets based on code definitions.
User-facing data catalog — describes the actual state of data assets based on the scan of your current data inventory.

Getting started with a user-facing data catalog is a straightforward process:

Research available data cataloging solutions. There are plenty of options — find the one which fits your goals, requirements and tech stack the best.
Install or subscribe to the data catalog of your choice, and connect it to all your sources and data assets to ingest metadata about them.
Create a schedule or triggers to periodically scan and update the contents of the data catalog.

Don’t forget that this is an active catalog — it needs to be maintained, cleaned and categorized periodically. It might look like an additional operational burden on the data team (in some cases requiring putting on a completely new ‘data steward’ hat), but in reality it can greatly help to reduce operational overhead and the number of data related questions. It can also improve overall data operations and knowledge about the state of data inside the company.

Tip #8: Create a monitoring center for the data platform

Do you have a screen that you can open and understand the current health of the data platform just at a glance? It’s common for infrastructure teams to have dashboards for checking an application’s or service’s infrastructure health, so it should also be common to do the same for data infrastructure.

Sample dashboard for data processing only (Source). Usually you need to create a more encompassing monitoring center for data platform resources.

Creating a unified monitoring center for a data platform doesn’t take much time, but provides great observability, makes data operations much easier and helps prevent or recover faster from incidents.

Spend a couple of days to learn a monitoring solution of your choice (we use integrated Google Cloud Monitoring)
Spend a couple more days creating dashboards with metrics for your data platform resources like pipelines and virtual machines. Data ingestion rate, CPU\RAM usage rate for your Tableau Server VMs, data insert errors etc.
Add thresholds and alerts that will automatically notify you when something is wrong

This can be done either manually or specified in code (with tools like Terraform). Now you not only have automated notifications when something went wrong but can always get current status (and peace of mind) just by opening the monitoring center’s dashboard.

Tip #9: Keep things simple for data platform users

Data platforms are complex structures. The bigger the company the more so. Many stakeholders across different teams and departments directly or indirectly use internal data platform. Make this complex structure as easy for them to use as possible!

There are many ways to simplify a data platform and it largely depends on the company, including current tools, methods and approaches. The secret is to think from the perspective of your internal users — same way how product’s UX is built from the perspective of a customer. Some ideas:

Have a clear, defined, easy to understand structure for organizing data. For example, what dataset is used for what data or purpose. Add necessary descriptions and/or simple documentation so users can understand what data is located where.
Create naming conventions for data assets. Often mentioned example is adding dim_ or fact_ prefixes to tables if you are using star schema design. Do the same for code, for example having all source assets to follow a naming pattern like <source>_<region>_<table_name> or similar.
Less available tables often mean more productivity. Before Metis we had a lot of tables in a format like events_20221218 that were created with a date suffix per each day. Imagine how many tables you need to go through when you have a handful of such tables and a couple of years of data, each with a suffix!
We moved to single partitioned tables, i.e., events table partitioned by day. Much easier for users to use the data from just one table, and for the data management team to manage data assets.
Make it easy for users to contribute to the data platform. For example, before Metis, when our data analysts needed to add a new table to BigQuery or modify a current one, they had to modify an often complex data pipeline spanning several data tools, each requiring an understanding of how it works.
Now in Metis, they just need to fill in a table configuration YAML template, create a file with a SQL query, and put both files in a internal dbt git repository. Everything else is automated. The only knowledge required is basic dbt concepts and commands, and BigQuery SQL.
Use visual explanations where possible. Create simple diagrams with a diagraming tool. This saves a lot of time for both you (explanations are faster, less questions) and data platform users (understanding is faster).

Each company is different, so get to know your internal users well and think about how you can make their data operations easier!

Tip #10: Check your data inventory periodically and clean it up

This is a universal tip for any process where you manage a collection of something — data assets in a data platform, personal notes, task lists, or just items in your fridge. Check the inventory of things that you have and remove the ones that are no longer needed. It’s normal to throw away an expired can of beans found in a fridge — why not do the same and delete all those test_<something> tables in the data warehouse that you created a couple of years ago and which are no longer needed?

As a data manager, create a list of data assets that don’t look like being used, and schedule periodical sessions with data platform users (or just ask them on Slack\Teams) to check whether those assets are needed or can be deleted.

To keep your development environment clean and tidy, another useful approach is to attach a Time-To-Live (TTL) configuration to personal datasets or other resources to automatically delete stale tables. Just make sure that your data platform users know about it!

I hope the tips above can help you create a more efficient, simpler and better data platform. They worked for us, and I believe can work for many other companies. However, remember that each situation is different, so consider your current state of the data platform and the situation in the company before implementing these tips.

Bonus tip: No matter how technically good your data platform is, in the end, communication is the key to success. You must advertise the data platform and explain the benefits of using it to get the buy-in from users and make it a successful internal product. Good luck!

Originally posted here

Similar Journal

ELT for the Data Consumer

You’ve likely heard about ELT — Extract Load and Transform… the Modern Data Stack’s evolution on ETL. This is a game changer by nature in that it enables organizations to ingest raw data into the data warehouse and transform it later. ELT gives end-users access to the entirety of the datasets they need by circumventing downstream issues of missing data that could prevent a specific business question from being answered.

jared parker8 min read

+2 more

The Unbundling of SaaS Analytics

The modern data stack is on the rise. Many companies use raw data from their SaaS analytics tools as input for their data warehouse, but this introduces problems downstream. Are there better ways?

vincent hoogsteder5 min read

What is Data Onboarding? And 3 Ways It's Overwhelming Your Teams

Without a clear and quick process your dev, sales, and customer success teams can become overwhelmed by the amount of work required to delight new customers and ingest clean validated data.

JD Prater5 min read

The Modern Data Stack Ecosystem: Spring 2022 Edition

Without a clear and quick process your dev, sales, and customer success teams can become overwhelmed by the amount of work required to delight new customers and ingest clean validated data.

Jordan Volz25 min read

Beyond Observability for the Modern Data Stack

The term “observability” means many things to many people. A lot of energy has been spent—particularly among vendors offering an observability solution—in trying to define what the term means in one context or another.

Avadhoot Patwardhan8 min read

+1 more

What's the Difference Between Data Wrangling vs Data Cleansing vs Data Transformations

As the amount of data rapidly increases, so does the importance of data wrangling and data cleansing. Both processes play a key role in ensuring raw data can be used for operations, analytics, insights, and inform business decisions.

JD Prater6 min read

How All-in-One Tools Are Accelerating Data Democratization

A majority of business leaders believe data insights are key to the success of their business in a digital environment. However, many companies struggle to build a data-driven culture, with a key reason being the lack of a sound data democratization strategy.

jonas thordal7 min read

How to Make Better Decisions Together with Collaborative Analytics

Breaking down some of the problems I’ve seen in data collaboration and offering advice on how to make better, faster decisions with collaborative analytics.

Ryan Buick5 min read

What Is Active Metadata, and Why Does It Matter?

Just like data mesh or the metrics layer, active metadata is the latest hot topic in the data world. As with every other new concept that gains popularity in the data stack, there’s been a sudden explosion of vendors rebranding to “active metadata”, ads following you everywhere and… confusion.

prukalpa ⚡ 9 min read

What is Data Reliability?

“There must be something wrong with Excel. I can't get these numbers to make sense.” For anyone who has had a similar experience of staring at a spreadsheet for far too long, we have news for you: Excel isn’t the problem; your data is.

cody carmen5 min read

Why Can't I Find the Right Data?

The modern data stack has helped democratize the creation, processing, and analysis of data across organizations.

pardhu gunnam6 min read

What is Data Observability?

Do you know the current status — quality, reliability, and uptime — of your data and data systems? Not last month or last week, but where they stand at this moment. As businesses grow, being able to confidently answer this question becomes more important. That’s because data needs to be clean, accurate, and up-to-date to be considered reliable for analysis and decision-making. This confidence comes through what’s known as data observability.

cody carmen4 min read

Data Governance - A Thought Leader's Perspective

If you are a Data Leader in 2022, Data Governance is most definitely on your radar. Regardless of your organization's data maturity stage, chances are, you have already implemented or started implementing a Data Governance Strategy.

benedetta cittadin6 min read

Getting Started with Data Observability

In the past years, organizations have been investing heavily to convert themselves into data-driven organizations with the objective to personalize customer experiences, optimize business processes, drive strategic business decisions, etc. As a result, modern data environments are constantly evolving and becoming more and more complex. In general, more data means more business insights that can lead to better decision-making. However, more data also means more complex data infrastructure, which can cause decreased data quality, a higher chance of data breaking, and consequently erosion of data trust within organizations and risk of not being compliant with regulations. The data observability category — which has quickly been developing during the past couple of years — aims to solve these challenges by enabling organizations to trust their data at all times. Although the category is relatively young, there are already a wide variety of players with different offerings and applying various technologies to solve data quality problems.

benedetta cittadin13 min read

How to Create a Data Governance Team? 3 Essential Steps

Data governance is more than just having a strategy – it is about establishing a culture where quality data is achieved, maintained, valued, and used to drive the business. Modern-day businesses are supported by data and information in many ways and forms. In recent years, data has become the foundation for competition, productivity, growth, and innovation. We are seeing successful organizations shift their focus from producing data to consuming it, and data governance strategies becoming increasingly important to support their crucial business initiatives. Executives and shareholders are starting to realize that data is a strategic asset and data governance is a must if they want to get value from data.

tanmay sarkar15 min read

+2 more

Snowflake is the new Salesforce

Salesforce has been considered for a long time as the source of truth. However, in the last 5 years, the number of SaaS tools used by companies has been multiplied by 10.

maxence de villepion3 min read

+1 more

Data warehouse native applications

The CRM is no longer seen as the definitive source of trust for enterprises when it comes to collecting customer data. Instead, it has become just another SaaS tool that is unable to handle the complex data architectures that modern enterprises have created.

maxence de villepion5 min read

+3 more

What is Data observability, do I need it?

I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.

Jatin Solanki6 min read

10 tips for building an advanced data platform

Jump To

List of tips covered

Tip #1: Build your data platform as a product

Tip #2: Use asset-based approach to data instead of pipelines

Tip #3: Manage your data platform infrastructure as code

Tip #4: Set moderately strict limits on data platform and relax them when needed

Tip #5: Partition and cluster tables

Partitioning

Clustering

Tip #6: Consider when to use a SaaS, an open-source solution or build your own

Tip #7: Create an internal data catalog

Tip #8: Create a monitoring center for the data platform

Tip #9: Keep things simple for data platform users

Tip #10: Check your data inventory periodically and clean it up

ELT for the Data Consumer

The Unbundling of SaaS Analytics

What is Data Onboarding? And 3 Ways It's Overwhelming Your Teams

The Modern Data Stack Ecosystem: Spring 2022 Edition

Beyond Observability for the Modern Data Stack

What's the Difference Between Data Wrangling vs Data Cleansing vs Data Transformations

How All-in-One Tools Are Accelerating Data Democratization

How to Make Better Decisions Together with Collaborative Analytics

What Is Active Metadata, and Why Does It Matter?

What is Data Reliability?

Why Can't I Find the Right Data?

What is Data Observability?

Data Governance - A Thought Leader's Perspective

Getting Started with Data Observability

How to Create a Data Governance Team? 3 Essential Steps

Snowflake is the new Salesforce

Data warehouse native applications

What is Data observability, do I need it?