Data Privacy and Governance

Data Privacy & Governance consists of process to ensure data integrity & data protection and to ensure a company's control over its data.

What is data governance?

Data governance is a measure of a company's control over its data

Data governance is a data management concept. It is a measure of the control a company has over its data. This control can be achieved through high-quality data, visibility on data pipelines, actionable rights management, and clear accountability. Data governance encompasses the people, processes, and tools required to create consistent and proper handling of a company's data. By consistent and proper handling of data, I mean ensure availability, usability, consistency, understandability, data integrity, and data security.

The most comprehensive governance model— say, for a global bank—will have a robust data-governance council (often with C-suite leaders involved) to drive it; a high degree of automation with metadata recorded in an enterprise dictionary or data catalog; data lineage traced back to the source for many data elements; and a broader domain scope with ongoing prioritization as enterprise needs shift.

A good data governance and privacy model is a mix of people, process and software.

Data governance has a direct business impact.

Data governance isn't just that rusty process that companies have to deploy in order to comply to regulation. Of course, part of it is a legal obligation, and thank god, but clean governance can have high business outcomes.

Here are the main goals of data governance:

When did data governance become a thing?

Timeline and key milestones in the space.

For the past twenty years, the challenge around data was to build an infrastructure to store and consume data efficiently and at scale. Producing data has become cheaper and easier over the years with the emergence of cloud data warehouses and transformation tools like dbt. Access to data has been democratized thanks to BI tools with BI tools like Looker, Tableau or Metabase. Now, building nice dashboards is the new normal in Ops and Marketing team.This gave rise to a new problem: decentralized, untrustworthy & irrelevant data and dashboards.

Even the most data-driven companies still struggle to get value from data - up to 73% of all enterprise data goes unused.

→ 1990-2010: emergence of the 1st regulation on data privacy

In the 1970s the first data protection regulation in the world was vetted in Hessen, Germany. Since then data regulation has kept increasing. The 1990's mark the first regulations regarding data privacy with the EU directive on data protection.

Yet, compliance with regulation really became a worldwide challenge in the second half of the 2010s with the emergence of GDPR, HIPAA, and other regional regulations on personal data privacy. These first regulations drove data governance for large enterprises. This created an urgency to build tools to handle these new requirements.

→ 2010 - 2020: 1st tools to comply with regulation. C-level realizes data governance becomes a strategic advantage to drive business value

With the increasing complexity of data resources/processes on one hand and the first fines for GDPR infringement on the other, companies started to build regulatory compliance processes. The 1st pieces of software to organize Governance and Privacy were born with companies like Alation and Collibra.

The challenge is simple: enforce traceability across the various data infrastructure in the company. Data governance was then a privilege of enterprise level companies, the only ones able to afford those tools. On-premise data storage makes it expensive to deploy these software. Indeed, companies like Alation and Collibra had to deploy technology specialists on the field to connect the data to their software. The first version of data governance tools aims at collecting and referencing data resources across the organization's departments.

There were several forces at play in this period. It became easier to collect data, cheaper to store it, simpler to analyze it. This led to a Cambrian explosion of the number of data resources. As a result, large companies struggled to have visibility over the work done with data. Data was decentralized, untrustworthy & irrelevant. This chaos brought a new strategic dimension to data governance. More than a compliance obligation, data governance became a key lever to bring about business value.

→ 2020+: Towards an automated and actionable data governance

With the standardization of the cloud data stack, the paradigm changed. It is easier to connect to the data infrastructure and gather metadata. Where it took 6 months to deploy a data governance tool on a multitude of siloed on-premise data centers in 2012, it can take up to 10 minutes in 2021 on the modern data stack (for example: Snowflake, Looker, and DBT).

This gave rise to new challenges: automatization and collaboration. Data governance on excel means maintaining manually 100+ fields, on thousands of tables and dashboards. This is impossible. Data governance with a non-automated tool means maintaining 10+ fields on thousands of tables: this is time-consuming. Doing data governance with a fully automated tool means maintaining 1 or 2 fields only on thousands of tables (literally table and column/field description). For that last part of manual work, you want to leverage the community. Prioritize work based on data consumption (high documentation SLA for popular resources) and democratize usage through a friendly UX.

Additionally, you want that data governance tool to be integrated into the rest of the data stack. Define something once and find it everywhere: whether this is a table definition, a tag, a KPI, a dashboard, access rights, or data quality results.

Data governance challenges are not the same for everyone

Diverse governance's use-cases based on industry needs and company size

There are two main drivers for data governance programs:

Level of regulation needed in the industry

Data regulation push the minimum bar of data governance processes higher. It requires business to add controls, reporting and documentation. This is a need to ensure transparency over sometimes unclear processes.

Level of complexity of the data assets

Having a strong governance become increasingly important with the exponential growth of data resources, tools and people in a company. The level of complexity increases with the scope of business operations (number of lines of business and geographies covered), the velocity of data creation or the level of automation (decision-making, processes) based on data.

How do you set up a good data governance and privacy model?

Several bricks are needed to enforce data management

Data Architecture (Storage, Modeling, Visualization)

Before even talking about data governance, a company needs the basis: a good infrastructure to begin with. Based on business needs and the company's data maturity, the nature of the data architecture can change a lot. Regarding storage, do you go for: on-premise or cloud? data warehouse or data lake? Regarding modeling: Spark or DBT? in data warehouse or in BI tool? real-time or batch? Regarding visualization: do you allow anyone to build dashboards or data teams only? etc.

Search and Discovery

The first level of any data governance strategy is making sure relevant people can find the relevant datasets to do their analysis or build their AI model. If you don't implement this step, companies end up with a lot of questions on Slack and useless meetings with the engineering teams. The company ends up with a lot of duplicate tables, analyses and dashboards. It takes valuable time to engineering resources that are needed to perform the next steps.

Metadata and Documentation

Once you can efficiently find the data. You need to understand it quickly in order to assess if it is going to be useful. For example, you are looking at a dataset called "*active_users_revenue_2021*". There is a column "*payment*". Is this column in € or $? Has it been refreshed this morning, last week, or last year? Does it contain all the data on active users or just the ones in Europe? If I remove a column, will this break important dashboards for the marketing or finance team? etc.

Data Quality

Now that you have data, stored in scalable infrastructure, that everyone can find and understand, you need to trust that what is inside is of high quality. This is why so many data observability and reliability tools were born in the last five years. Data observability is the general concept of using automated monitoring, alerting, and triaging to eliminate data downtime. The two main approaches to data quality are: declarative (manually define thresholds and behavior) or ML-driven (detecting sudden changes in distribution).

Security and Access Rights

Some data might be more private or strategic than others. Let's say you are a bank, you don't want to give access to the transaction logs to anyone in the company. You need to define access rights and managing them efficiently can quickly become a struggle as the number and type of people working with data grow. Sometimes, you want to give access to someone for a specific mission and for anything else. What happens when one of your employees was in the finance department but moves to marketing? You need to manage these rights thoroughly and efficiently.

Compliance and Regulation

This one is self-explanatory. You need to list all assets, report on personal information and usage to comply with regulation. For now, only enterprise companies are targeted by regulators, it is just a question of time before smaller companies start receiving fines.

Where does data governance fit in the modern data stack?

Data governance brings trust from the raw data sources to domain experts dashboards

The typical data flow is the following :

You collect data from various sources from your business. It can be product logs, marketing, and website data, payment and sales logs etc. You extract that information with tools like Fivetran, Stitch, or Airbyte.
You then store this data in a data warehouse (Snowflake, Redshift, Bigquery, Firebolt to name the most popular). The data warehouse is both a place to store and transform your data to refine it.
The new trending transformation layer for the past 3 years is DBT. It enables to perform data transformation in SQL within the data warehouse while implementing software engineering good practices.
At last, the transformation helps you build your "data mart", the golden standard in term of refined data. The visualization brick helps domain experts visualize this gold-level data to share insights throughout the whole organization.

These steps are happening on different tools, with a high level of abstraction. It is hard to keep a bird's eye view of what happens under the hood. This is what data governance is bringing to the table. You can see how the data flows, where the pipeline breaks, where risks lie, where to put your energy as a data manager, etc.