How to Create A Data Catalog, A Step-by-Step Guide

Etai
Jun 23, 20226 min read

Jump To

From personal experience, I have always found it interesting to learn how to create an organized catalog of data. However, this interest was transformed into a passion when I began to realize the amount of time and effort it could save me within my job responsibilities. Creating a data catalog can greatly help you with organizing the data they collect, therefore making it easier to find what you need when you need it.

By the end of this post, we should’ve shed some light on the following questions:

Why all data teams need data discovery tools?

What kind of data catalogue should teams look for?

How to get buy in for a data catalogue?

Which tools are here to stay?

Why do you need a data catalog?

Simple data cataloging starts with a great organization. A data catalog is a collection of metadata and documentation that helps make sense of the data sprawl that exists in most growing companies. Getting together and starting to use a data catalog is a simple process, but starting to get adoption and having the catalog exist as part of your workflow is a little bit more difficult. 

Even though it may seem like an easy task, getting different stakeholders to change their routines and start using a new tool can be very challenging. An example of the data catalog problems shared by one of the delivery companies we spoke with. At this company, it was difficult to get aligned on which tables were commonly used, joined, how they were used together and what columns meant. Similarly, it’s difficult to monitor the number of data assets that exist across different departments, especially when the number of resources grows at a faster rate than people. Why is this the case? 

Data is becoming more decentralized through concepts like the data mesh. As more teams outside of the data function start to use data in their day-to-day, different tables, dashboards and definitions are being created at an almost exponential rate. Data catalogs are important because they help you organize your data whether you are working with structured or unstructured data. They help you identify what kind of data you have, how it is related to each other and what the best means to store it is so that you can quickly find it when needed.

It happens to all data-driven companies as they grow both their data and their people. While it sounds like an easy problem, which might require a meeting, converging the business and data to eliminate confusion can be a very challenging task. That's why a data catalog can be one of the most valuable tools that a data team can use to start measuring their impact. 

Below are the steps that teams need to take when creating a data catalog:

1. Gather sources from across the organization 

The first step data teams need to take is to collect the different resources that are scattered across different tools in the organization. This may require multiple meetings and stakeholders to come together and figure out which resources need to be in the catalog. This collection could be done in a spreadsheet with an ongoing list of all resources and how they connect.

It is common for teams to work with a tool that connects to multiple sources and collects the metadata from those sources into a central location. Using data catalogs makes this extremely easy and is a vital component of any data-driven organization. With the volume of data collected by businesses increasing, properly organizing it is becoming increasingly important. Priority should always be given to ensuring that the data is useful, but we should also take into account gains to be reaped through improved efficiency and collaboration.

2. Give each resource an owner

After data teams have identified all the resources from across the company that they would like to include in their data catalog, we recommend assigning ownership to each resource. Teams that we’ve worked within the past have assigned ownership based on the source, schema or even domain. Teams that start assigning ownership should look for people who are familiar with the data knowledge they are responsible for managing and are willing to help others who want to learn how to use it. 

Ownership doesn’t have to reside in the data team. Many products, operations and growth leaders will likely want to own certain pieces of company data knowledge. By empowering these leaders as data stewards, the data team can start automating the repetitive questions they get from employees about how to use company data. 

3. Get support and sign off

Once these meetings conclude and owners are on the same page, have the owners sign off on their responsibilities. The owners should be in alignment with the documentation and feel like the data team worked collaboratively with them to come to this ownership structure. One effective strategy is to involve the leadership team in the exercise early to make sure that their team leads are signing off on the owners of data. This way, leadership can see how widespread the understanding of data is across the company. If the team leadership team sees the value of a data catalog, this can move at a much faster pace.

4. Integrate the catalog base into your workflow

After data teams have received support for their data documentation process, they should look for ways to integrate this tool into their workflow. This step is critical for maintenance and upkeep. Without a tool that allows teammates to receive notifications on Slack, it will likely be forgotten. By creating a process around the data catalog, teams can ensure that it is not left behind as the team grows

One additional piece of context that teams should look for is the usage and adoption of the data catalog across the company. If a tool can display the way that employees are searching and using the knowledge base, it can be a great source of information for the data team as they continue to iterate on the information in this central repository. By ensuring that there are triggers and notifications that prompt users to enter the application, teams can make sure that they are taking advantage of the data catalog and improving it like a product. 

5. Upkeep the data catalog

Although the documentation should be stable, it may need to change over time. One instance that might require documentation to change is when a new revenue stream is introduced or when the pricing of an existing revenue line changes. These changes traditionally come from the business team and might require the data team to implement the changes into the data catalog.

Documenting your data should be essential to any data-driven business process. It allows data practitioners to see how their work is used, by whom, and for what purpose. Just the act of documenting the data will help you understand what you have, where it is, and how it can be used. It may seem like an intimidating task at first, but I encourage you to start small, with a few datasets here and there, until the process becomes second nature. And once it does, you'll never look back!

Final thoughts

To sum it up, data catalogs are valuable for any organization, especially those growing quickly. No matter the size, data documentation is important in the pursuit of data operations. Having a tool that can track, document, store and provide metadata insights for systems beyond just your current data stack will help you identify your blind spots in your current architecture and integration points moving forward. 

Teams that invest the time to get alignment using a data catalog can see major benefits in the long term as they make faster decisions as a team. Creating a data catalog is not a small undertaking, that's why we make it easier with Secoda. The process requires patience and alignment with leadership. 

Originally posted here