Data Discovery

Data discovery is the process of centralizing data and managing it from one place. The data discovery tool sits on top of the warehouse, BI tool and ETL layer, acting as an aggregator for your data.

Over the last few years producing and storing data has become increasingly cheaper and easier. Organizations are now flooded with data in disparate data warehouses and data lakes and are starting to notice that these assets are becoming more difficult to discover and manage. For consumers outside of the data team, this complexity has made it increasingly more difficult to understand what data exists, what to trust, and how to use data. Even with great data practices, many organizations still struggle to get value out of their data - up to 73% of all enterprise data goes unused. One of the big contributors to this problem is that organizations create data silos by not documenting and centralizing their data in a place where every employee can access the information. While data visualization tools have made strides to bridge the gap between data producers and consumers, there is still a missing piece of the puzzle. Data engineers and analysts report getting tons of questions from employees about the quality, applicability, trustworthiness and relevance of the data. Answering these questions takes up a lot of time for data analysts and engineers. In some larger organizations, 30% of an analyst's time was spent finding, understanding and validating data. As organizations continue to invest in self-service, more people use data, and more data is produced, these problems become more painful for the data team. Organizations that want to solve these problems should consider investing in a data discovery tool.

What are data discovery tools?

Data discovery tools are built to centralize data and manage it from one place. The data discovery tool sits on top of the warehouse, BI tool and ETL layer, acting as an aggregator for your data. Data discovery tools extract metadata from siloed tools and allow data consumers to search through this metadata without jumping to different tools. Additionally, data discovery tools automatically document data and allow data teams to add additional documentation such as tags, issues, likes and bookmarks. By extracting the metadata automatically, a data discovery tool should strive to make data documentation simple and proactive. Most data discovery tools enrich the data with a social layer, which shows who uses the data most frequently and allows teams to collaborate in the tool. These social features help users navigate and organize the data in a way that is logical for their team. With a good data discovery tool, users answer the following questions without a data analyst:

  • How do I use this data?
  • Who uses this data frequently?
  • What data should I use?
  • Can I trust this data?
  • When was this data last updated?
  • What are some similar resources?
  • Does this data contain sensitive information?
  • What does this column mean?
  • Where is this data from?
  • What does this data impact?

Benefits of incorporating data discovery tools

There are a few primary benefits of incorporating a data discovery tool. The first, and most referenced benefit is the reduced time spent on data discovery and management. When teams have a central place to search, manage and comment on data, the time they spend trying to understand what data means should decrease. Most data teams spend about 30% of their time on these tasks. With a data discovery tool, they can expect the time spent on discovery, documentation and management to decrease by up to 95%. This opens up the data team to tackle new projects and work strategically with business teams on longer-term goals. The second major benefit of adopting a data discovery tool is that employees are less likely to make mistakes by using the wrong data. While this may seem like a small problem, it is an extremely common - and anxiety-provoking - experience that many data teams face. When employees can verify that the resource that they are using is the correct resource, data scientists, analysts and product managers can all confidentially use data to make decisions. This creates an additional benefit, which is increased transparency around excision making. Many times, teams are confused about key metrics and how they are measured. With a good data discovery tool, the knowledge that was once only available to a few employees will be available across the organization. Lastly, there's an additional benefit to a data discovery tool around employee engagement. When teams adopt a data discovery tool, they should be able to onboard new employees faster and off-board old employees with less lost tribal knowledge. This is because all the important information is documented and searchable in one central place. With a data discovery tool, a new employee doesn’t have to spend weeks booking meetings with stakeholders to learn the nuances around the naming conventions. Instead, they can spend time in the companies data discovery tool and learn all the key information in days. The benefits of a data discovery tool are more efficient, transparent and self-sufficient teams. As teams continue to embrace remote work, data discovery tools become an important tool to help teams get on the same page when they aren’t in the same place.

Best practices of data discovery

Teams that are adopting a data discovery tool should consider a few important factors. The first is to make sure that their data discovery tools create a holistic picture of the data stack and make it available to anyone looking for information. Instead of having multiple apps and tools open while trying to find the right information, the data discovery tool should become a central source of truth about your team's data. Beyond the data warehouse and BI tool, this means that a data discovery tool with integrations to Airflow, dbt, great_expectations, Amazon S3 and many other critical pieces of the data stack could add more value. Additionally, teams should adopt data discovery tools that are easy for everyone to use. The goal of the data discovery tool is to allow anyone to find data, meaning that the tool should not overcomplicate the discovery process. One way that some tools have done this is by allowing teams to connect the data discovery tool to Slack. This allows users the ability to stay updated about changes, new documentation and even search for the data right in Slack. These kinds of features can help data discovery tools bridge the gap between data producers and consumers. There are a few vectors which teams should use to evaluate data discovery tools, below are the main drivers:

  • Number of integrations
  • Price
  • Amount of automated documentation
  • Governance functionality
  • Intuitiveness
  • Search functionality

Data discovery in 2021

Over the last few years, many large tech companies have open-sourced their data discovery tools. These tools have allowed more teams to consider adopting data discovery as part of their data stack. Looking ahead, once teams set up their data infrastructure, data discovery will become a bigger priority. As I look ahead at the decade to come, I believe more companies will become data companies. As more teams look to unlock data at their workspace, data discovery will create a necessary central hub that equalizes the playing field. I am excited about the future of data discovery and how tools like Secoda will help more teams become data-driven at an early stage while protecting their data.

Etai Mizrahi

Featured Companies

Here are some amazing companies in the Data Discovery.

Castor is a collaborative, automated data discovery tool. With Castor, ...

Secoda is a collaborative workspace for data teams that makes it easy ...

Amundsen is a metadata driven application for improving the productivi ...

Metaphor is a search and discovery tool built for data scientists, dat ...