This is a long due post, predominantly because there is a lot of confusion around data lineage , data observability and the interdependencies.
As many of us know, Data Lineage is one of the most discussed topics today and so is data observability. This article covers the applicability of data lineage in Data Observability.
PS: If you are specifically looking to understand lineage better, these are few of my favorites on this topic.
Building Data Lineage: netflix & Leveraging Apache spark for Data Lineage : Yelp
Data lineage tracks the changes and transformations impacting any data record. Data lineage tells us where data is coming from, where it is going, and what happens as it flows from data sources and pipeline workflows to downstream data marts and dashboards.
In short, it enables better data governance by giving you a more complete, top-down picture of your data and analytics ecosystem.
Data lineage is essential for understanding the health of the data because it ensures transparency around the source of data, ownership, access, and its transformation. These are excellent indicators of the reliability and trust of the data. However unlike other data quality indicators like Completeness, Validity, Accuracy, Consistency, Uniqueness, and Timeliness.(Read here) Data lineage can not be metricized.
The insights provided by Lineage enable data users to solve all kinds of problems encountered within mass quantities of interconnected data like Data troubleshooting, Impact analysis, Discovery and trust, Data privacy regulation (GDPR and PII mapping), and Data asset cleanup/technology migration, and Data valuation.
Data Observability is a set of measures that can help predict, identify, and resolve the data issues. This is often done by leveraging statistical analysis and machine learning.
Data Observability aims to reduce the mean time to detect (MTTD) and mean time to resolve (MTTR) data issues.
I MG SRC : Telm.ai
Data lineage can be very effective in getting good Observability outcomes.The insights provided by Lineage enable data users to quickly find the root cause of issues , conduct impact analysis and also invoke a remediation.
So your data observability should leverage lineage information.
Today, most data observability companies leverage Data Lineage for mean time to resolve (MTTR) issues by doing two things,
Root cause analysis (RCA)
Once a monitoring system detects an issue, Lineage can help investigate the previous steps in the pipeline and changes. Specifically, if you are using Datawarehouse monitoring tools like Montecarlo data, Metaplane etc you can find table/view lineage that can help shrink the root cause analysis time.
Downstream Impact Analysis
Often Data Lineage can help find downstream impact and invoke remediation flow. Remediation by definition is an reactive approach and can become a nightmare.
If you have a datamesh architecture, and have multiple consuming products using the data, leveraging lineage helps you identify which products are impacted and who own those products/systems. Now you can systematically notify those users. Another use case is using Lineage where you can identify the impacted dashboards.
The above methods are definitely useful but still suboptimal for data observability.
Data Lineage for RCA and Impact analysis
Data Observability, by definition, should be proactive. Data engineers should use Lineage to check downstream impact before adding any changes. For Example, am I making schema changes? Let me automatically check the downstream implications.
For data in motion a full pipeline monitoring approach will not only help with (resolve) MTTR but also MTTD(detect) issues. Design a metric monitoring system that monitors every step in the data pipeline. So users get alerted when there is a drift or outlier in a specific step of the pipeline, and multiple data points will help detect issues even before they have downstream impact.
Example : Get alerted that the CRM system got partial data at 2pm PST.Such an alert can automatically be sent to the CRM system owner, so data engineers don't have to find out about these issues inside Snowflake or BigQuery and reverse track them back to the CRM system. Often this approach will also help orchestrate the date pipeline flow refer: Circuit breaker pattern.
A good analogy would be monitoring vs. tracing and logs. If you have a lot of monitoring, you may not need tracing and log review because tracking at every level will highlight the same problems but sooner. This is because all methods highlight the same issues just from different angles. So I would say that Lineage is a good tool for root cause analysis and impact analysis (MTTR) but your observability tools should focus higher on for mean time to detect (MTTD) issues i.e be more proactive.
Data Lineage is a crucial aspect of Data reliability. However, to effectively reduce both the MTTD and MTTR data issues :
1: Leverage lineage to understand the downstream impact before making changes.
2: Monitor important data metrics every step of the pipeline and leverage Lineage to identify exactly where(step of the pipeline) the which metric has drifted.
This approach enables users to reduce both time to detect and time to resolve data issues.
Do you know the current status — quality, reliability, and uptime — of your data and data systems? Not last month or last week, but where they stand at this moment. As businesses grow, being able to confidently answer this question becomes more important. That’s because data needs to be clean, accurate, and up-to-date to be considered reliable for analysis and decision-making. This confidence comes through what’s known as data observability.
In the past years, organizations have been investing heavily to convert themselves into data-driven organizations with the objective to personalize customer experiences, optimize business processes, drive strategic business decisions, etc. As a result, modern data environments are constantly evolving and becoming more and more complex. In general, more data means more business insights that can lead to better decision-making. However, more data also means more complex data infrastructure, which can cause decreased data quality, a higher chance of data breaking, and consequently erosion of data trust within organizations and risk of not being compliant with regulations. The data observability category — which has quickly been developing during the past couple of years — aims to solve these challenges by enabling organizations to trust their data at all times. Although the category is relatively young, there are already a wide variety of players with different offerings and applying various technologies to solve data quality problems.
Data matters more than ever – we all know that. But at a time when being a data-driven business is so critical, how much can we trust data and what it tells us? That’s the question behind data reliability, which focuses on having complete and accurate data that people can trust. This article will explore everything you know about data reliability and the important role of data observability along the way, including:
The term “data lineage” has been thrown around a lot over the last few years. What started as an idea of connecting between datasets quickly became a very confusing term that now gets misused often. It’s time to put order to the chaos and dig deep into what it really is. Because the answer matters quite a lot. And getting it right matters even more to data organizations.
I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.