Data Cataloging

Data Catalog acts as a central location of your data by scanning and mapping all metadata for each data system. The data catalog creates living, single source of reference for finding and understanding your data.

What is a data catalog?

A data catalog is metadata and data management software that companies use to inventory and organize the data within their systems so that it’s easier for people to discover and understand. Data catalogs are commonly thought of as a data governance tool that provides insight into how data is used throughout an organization. Modern data catalogs, however, extend far beyond governance to become a key component of any modern data stack. Think of a data catalog like an operating system for your data and the applications that produce and consume data. Below, we’ll explore the most common use cases for an enterprise data catalog.

How does a data catalog fit into a modern data stack ?

In a survey conducted this year by NewVantagePartners, only 24% of companies say they’re data-driven. That’s down from 37% in 2017. The primary reason these numbers have been declining each of the last four years is because data leaders are losing faith that their investments in big data are paying off. The modern data stack certainly makes it easier for companies to establish a cloud-native data architecture that’s scalable, innovative, and accessible. However, if data producers and consumers can’t find the data they need in a timely manner and don’t understand the data when they find it, much of the promise of the modern data stack is lost. To become data-driven, your entire company must be able to use data to answer business questions with clarity, accuracy, and speed. A modern data catalog makes it easier to manage your data resources, steward your metadata, define key business terms, provide access to datasets, and make data more discoverable, trusted and understood.

Data catalogs help you manage data resources

A great starter use case for data catalogs involves understanding what data assets you have within your organization, who owns them, where they live, and when they were last used or updated. A data catalog, particularly one powered by a knowledge graph, can help you map and organize your data resources. Data catalogs can connect to and crawl the other applications within your stack, pulling in data and associated metadata to provide a holistic picture of your data resources. Users can quickly and easily tag assets and ensure key terms are clearly defined in a business glossary that’s accessible to the entire organization.

Data catalogs help govern and steward your metadata

One of the great missteps companies make when it comes to data is focusing on “command and control.” That is making data unavailable for all but the few, either by limiting access or by managing data with a tool that only a handful of IT and governance people understand. This approach leads to a number of challenges, outlined in the book, “Winning with Data,” including

  • Data breadlines- data consumers make request after request to IT which is unable to meet the time crunch, forcing critical analysis to wait.
  • Data silos and rogue databases- To get around the “red-tape” data consumers often download data to their personal devices, creating security and version control problems
  • Data brawls- When data work isn’t transparent, people don’t trust it. This is especially true when people realize the analysis they’ve been working with for weeks may have originated from a rogue database.

Modern data catalogs can bring agility to data governance and stewardship. Catalogs can show who owns what data assets when they were created, what analysis has been derived from this resource, and other resources that may be of interest. If you want to go deeper, catalogs can provide rich lineage of the artifact, and allow you to crowdsource metadata tags. And modern data catalogs don’t require users to wait on IT. Users who need access to a restricted dataset can request access directly from within the catalog, and depending on company policy, will either be granted or denied access. One key criteria is availability. To avoid the rogue database problem, select a data catalog that can be easily launched in the cloud and has an intuitive user experience so that anyone can use it, not just IT and governance teams. Modern data catalogs that fit in modern data stacks should make data discovery and fulfillment as easy as online shopping.

Data catalogs help you search and discover data

Modern data catalogs should help you find what you need when you need it, bringing a Google-like experience to data search and discovery. And because these new catalogs are no longer just the domain of IT and governance, they should also provide deep query, virtualization, and collaboration capabilities

If a modern data catalog is the “operating system for your data” then your data people need to be able to do actual work within the catalog and simple exploration isn’t enough. A modern data catalog should connect people with data resources wherever they exist, either via ingest or virtualization. Data consumers should be able to run federated queries using familiar languages like SQL and visualize the results in their BI tool of choice. That data and analysis should then become discoverable for everyone in the organization to share, re-use and comment on. Because a catalog connects your entire data ecosystem, it’s a fundamental part of your modern data stack and essential for establishing a DataOps framework.

Data catalogs help you understand and trust your data

A modern data catalog should document not just data definitions and ownership, but also the relationships between your data, metadata, people, and applications. This foundational understanding requires an underlying knowledge graph architecture.

Knowledge graph-based data catalogs allow you to build a curated, connected data hub that fosters context-enabled data and analytics, a streamlined steward experience and workflow, and intuitive data health indicators from data profiling, sampling, and status processes. Add all of this to the ability to collaborate directly with subject matter experts within your data catalog and document the findings, and you have a data catalog built for a modern data stack and beyond.

Conclusion

Data catalogs should be an information radiator, collaboration hub, and operating system for the modern data stack. If you’re in the market for a data catalog, consider the following:

  • How quickly and easily can it deploy and roll out to your organization?
  • Is the catalog primarily an IT tool, or do you want to build a data-driven culture?
  • Do you understand how a catalog can deliver business value beyond data governance? Remember the use cases described above:
    • Data resource management
    • Agile data governance
    • Data search and discovery
Thaise Skogstad
Director of Product Marketing
data.world

Featured Companies

Here are some amazing companies in the Data Cataloging.

Castor is a collaborative, automated data discovery tool. With Castor, ...

Metaphor is a search and discovery tool built for data scientists, dat ...

Secoda is a collaborative workspace for data teams that makes it easy ...

Acryl Data brings clarity to your data.