What Is Data Lineage?

Eitan ChazbaniSep 02, 202212 min read

Some pipeline tasks, probably running by Airflow, are scraping external data sources and collecting data from there.
Those tasks are saving the extracted data in the data lake (or warehouse or lakehouse).
Other tasks, probably SQL jobs orchestrated with DBT, are running transformation on the loaded data. They are querying raw data tables, enriching them, joining between tables, and creating business data – all ready to be used.
Dashboarding tools such as Tableau, Looker, or Power BI are being used on top of the business data and providing visibility to multiple stakeholders.

What’s the difference between data lineage and data provenance?

Data lineage and data provenance are often viewed as one and the same. While the two are closely related, there is a difference.

Whereas data lineage tracks data throughout the complete lifecycle, data provenance zooms in on the data origin. It provides insight into where data comes from and how it gets created by looking at important details like inputs, entities, systems, and processes for the data.

Data provenance can help with error tracking when understanding data lineage and can also help validate data quality.

Why is it important?

As businesses use more big data in more ways, having confidence in that data becomes increasingly important – just look at Elon Musk’s deal to buy Twitter for an example of trust in data gone wrong. Consumers of that data need to be able to trust in its completeness and accuracy and receive insights in a timely manner. This is where data lineage comes into play.

Data lineage instills this confidence by providing clear information about data origin and how data has moved and changed since then. In particular, it is important to key activities like:

Data governance: Understanding the details of who has viewed or touched data and how and when it was changed throughout its lifecycle is essential to good data governance. Data lineage provides that understanding to support everything from regulatory compliance to risk management around data breaches. This visibility also helps ensure data is handled in accordance with company policies.
Data science and data analytics: Data science and data analytics are critical functions for organizations that are using data within their business models, and powering strong data science and analytics programs requires a deep understanding of data. Once again, data lineage offers the necessary transparency into the data lifecycle to allow data scientists and analysts to work with the data and identify its evolutions over time. For instance, data lineage can help train (or re-train) data science models based on new data patterns.
IT operations: If teams need to introduce new software development processes, update business processes, or adjust data integrations, understanding any impact to data along the way – as well as where data might need to come from to support those processes – is essential. Data lineage not only delivers this visibility, but it can also reduce manual processes associated with teams tracking down this information or working through data silos.
Strategic decision making: Any organization that relies on data to power strategic business decisions must have complete trust that the data they’re using is accurate, complete, and timely. Data lineage can help instill that confidence by showing a clear picture of where data has come from and what happened to it as it moved from one point to another.
Diagnosing issues: Should issues arise with data in any way, teams need to be able to identify the cause of the problem quickly so that they can fix it. The visibility provided by data lineage can help make this possible by allowing teams to visualize the path data has taken, including who has touched it and how and when it changed.

What are common techniques?

There are several commonly used techniques for data lineage that collect and store information about data throughout its lifecycle to allow for a visual representation. These techniques include:

Pattern-based lineage: Evaluates metadata for patterns in tables, columns, and reports rather than relying on any code to perform data lineage. This technique focuses directly on the data (vs. algorithms), making it technology-agnostic; however, it is not always the most accurate technique.
Self-contained lineage: Tracks data movement and changes in a centralized system, like a data lake that contains data throughout its entire lifecycle. While this technique eliminates the need for any additional tools, it does have a major blind spot to anything that occurs outside of the environment at hand.
Lineage by data tagging: A transformation engine that tags every movement or change in data allows for lineage by data tagging. The system can then read those tags to visualize the data lineage. Similar to self-contained lineage, this technique only works for contained systems, as the tool used to create the tags will only be able to look within a single environment.
Lineage by parsing: An advanced form of data lineage that reads logic used to process data. Specifically, it provides end-to-end tracing by reverse engineering data transformation logic. This technique can get complicated quickly, as it requires an understanding of all the programming logic used throughout the data lifecycle (e.g. SQL, ETL, JAVA, XML, etc.).

What are data lineage best practices?

When it comes to introducing and managing data lineage, there are several best practices to keep in mind:

Automate data lineage extraction: Manual data lineage centered around spreadsheets is no longer an option. Capturing the dynamic nature of data in today’s business environments requires an automated solution that can keep up with the pace of data and reduce the errors associated with manual processes.
Bring metadata source into data lineage: Systems that handle data, like ETL software and database management tools, all create metadata – or data about the data they handle (meta, right?). Bringing this metadata source into data lineage is critical to gaining visibility into how data was used or changed and where it’s been throughout its lifecycle.
Communicate with metadata source owners: Staying in close communication with the teams that own metadata management tools is critical. This communication allows for verification of metadata (including its timeliness and accuracy) with the teams that know it best.
Progressively extract metadata and lineage: Progressive extraction – or extracting metadata and lineage in the same order as it moves through systems – makes it easier to do activities like mapping relationships, connections, and dependencies across the data and systems involved.
Progressively validate end-to-end data lineage: Validating data lineage is important to make sure everything is running as it should. Doing this validation progressively by starting with high-level system connections, moving to connected datasets, then elements, and finishing off with transformation documentation simplifies the process and allows it to flow more logically.
Introduce a data catalog: Data catalog software makes it possible to collect data lineage across sources and extract metadata, allowing for end-to-end data lineage.

What is end-to-end lineage vs. data at rest lineage?

When talking about lineage, most conversations usually tackle the scenario of data “in-the-warehouse,” which presumes everything is occurring in a contained data warehouse or data lake. In these cases, it monitors data executions that are performed on specific or multiple tables to extract the relationship within or among them.

At Databand, we refer to this as “data at rest lineage,” since it observes the data after it was already loaded into the warehouse.

This data at rest lineage can be troublesome for modern data organizations, which typically have a variety of stakeholders (think: data scientist, analyst, end customer), each of which has very specific outcomes they’re optimizing toward. As a result, they each have different technologies, processes, and priorities and are usually siloed from one another. Data at rest lineage that looks at data within a specific data warehouse or data lake typically doesn’t work across these silos or data integrations.

Instead, what organizations need is end-to-end data lineage, which looks at how data moves across data warehouses and data lakes to show the true, complete picture.

Consider the case of a data engineer who owns end-to-end processes within dozens of dags in different technologies. If that engineer encounters corrupted data, they want to know the root cause. They want to be able to proactively catch issues before they land on business dashboards and to track the health of the different sources on which they rely. Essentially, they want to be able to monitor the real flow of the data.

With this type of end-to-end lineage, they could see that a SQL query has introduced corrupted data to a column in a different table or that a DBT test failure has affected other analysts’ dashboards. In doing so, end-to-end lineage captures data in motion, resulting in a visual similar to the following:

What are the benefits of end-to-end data lineage?

Modern organizations need true end-to-end lineage because it’s no longer enough just to monitor a small part of the pipeline. While data at rest lineage is easy to integrate, it provides very low observability across the entire system.

Additionally, data at rest lineage is limited across development languages and technologies. If everything is SQL-based, that’s one thing. But the reality is, modern data teams will use a variety of languages and technologies for different needs that don’t get covered with the more siloed approach.

As if that wasn’t enough, most of the issues with data happen before it ever reaches the data warehouse, but data at rest lineage won’t capture those issues. If teams did have that visibility though, they could catch issues sooner and proactively protect business data from corruption.

End-to-end data lineage solves these challenges and delivers several notable benefits, including:

Clear visibility on impact: If there’s a schema change in the external API from which Python fetches data, teams need true end-to-end visibility to know which business dashboard will be affected. Gaining that visibility requires understanding the path of data in motion across environments and systems – something only end-to-end data lineage that tracks data in motion can provide.
Understanding of root cause: By the time an issue hits a table used by analysts, the problem is already well underway, stemming from further back in the data lifecycle. With data at rest lineage, it’s only possible to see what’s happening in that particular table, though – which isn’t helpful for identifying the cause of the issue. End-to-end lineage, on the other hand, can look across the complete lifecycle to provide clarity into the root cause of issues, wherever they turn up.
Ability to connect between pipelines and datasets: In a very complex environment where thousands of pipelines (or more!) are writing and reading data from thousands of datasets, the ability to identify which pipeline is working on a weekly, daily, or hourly bases and with which tables (or even specific columns within tables) is a true game-changer.

What should you look for in a data lineage tool?

As it becomes increasingly important, what should you look for in a data lineage tool?

Above all else, you need a tool that can power end-to-end data lineage (vs. data at rest lineage). You also need a solution that can automate the process, as manual data lineage simply won’t cut it anymore.

With those prerequisites in mind, other capabilities to consider when evaluating a data lineage tool include:

Alerts: Automated alerts should allow you to not just identify that an incident has occurred, but gain context on that incident before jumping into the details. This context might include high-level details like the data pipeline experiencing an issue and the severity of the issue.
View of affected datasets: The ability to see all of the datasets impacted by a particular issue in a single, birds-eye view is helpful to understanding the effect on operations and the severity of the issue.
Visual of data lineage: Visualizing data lineage by seeing a graph of relationships between the data pipeline experiencing the issue and its dependencies allows you to gain a deeper understanding of what’s happening and what’s affected as a result. The ability to click into tasks and see the dependencies and impact to each one for a given task provides even more clarity when it comes to issue resolution.
Debugging within tasks: Finally, the ability to see specific errors within specific tasks allows for quick debugging of issues for faster resolution.

Getting it right

Data lineage isn’t a new concept, but it is one that’s often misunderstood. However, as data becomes more critical to more areas of business, getting it right is increasingly important.

It requires an understanding of exactly what data lineage is and why it’s so important. Additionally, it requires a thoughtful approach to addressing data lineage that matches the needs of a modern data organization – which means true end-to-end data lineage. And finally, it requires the right tool to support this end-to-end lineage in an automated way.

Originally posted here

Similar Journal

How important is data lineage in terms of data observability ?

This is a long due post, predominantly because there is a lot of confusion around data lineage , data observability and the interdependencies.

mona rakibe4 min read

What is Data Observability?

Do you know the current status — quality, reliability, and uptime — of your data and data systems? Not last month or last week, but where they stand at this moment. As businesses grow, being able to confidently answer this question becomes more important. That’s because data needs to be clean, accurate, and up-to-date to be considered reliable for analysis and decision-making. This confidence comes through what’s known as data observability.

cody carmen4 min read

Getting Started with Data Observability

In the past years, organizations have been investing heavily to convert themselves into data-driven organizations with the objective to personalize customer experiences, optimize business processes, drive strategic business decisions, etc. As a result, modern data environments are constantly evolving and becoming more and more complex. In general, more data means more business insights that can lead to better decision-making. However, more data also means more complex data infrastructure, which can cause decreased data quality, a higher chance of data breaking, and consequently erosion of data trust within organizations and risk of not being compliant with regulations. The data observability category — which has quickly been developing during the past couple of years — aims to solve these challenges by enabling organizations to trust their data at all times. Although the category is relatively young, there are already a wide variety of players with different offerings and applying various technologies to solve data quality problems.

benedetta cittadin13 min read

What Is Data Reliability And How Observability Can Help ?

Data matters more than ever – we all know that. But at a time when being a data-driven business is so critical, how much can we trust data and what it tells us? That’s the question behind data reliability, which focuses on having complete and accurate data that people can trust. This article will explore everything you know about data reliability and the important role of data observability along the way, including:

Eitan Chazbani7 min read

+3 more

What is Data observability, do I need it?

I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.

Jatin Solanki6 min read

What Is Data Lineage?

Jump To