06/04/2020

What is Data Lineage and why should you pursue it?

Rob Thorne - Josh Fradd - Calum Conejo-Watt

What is Data Lineage?

Most organisations want to become data-centric and unleash the power of data to grow, scale and optimise their operations. But they face several obstacles in doing so.

It might be that data isn’t easily available and accessible. Or that Data Management capabilities are lacking, with architecture requiring simplification. In some cases, there may be a lack of central oversight ensuring alignment on firm-wide data projects, with immature Data Ownership and Governance. Many organisations simply don’t understand where their data is sourced from and where it flows to and from.

Whatever the issue, when data is not understood, its value cannot be realised. A crucial part of understanding data is figuring out where it has come from and where it has been. In other words, its journey through an organisation’s different systems, people and processes – who has been in contact with the data, how has the data been altered along its journey and where it is used. That’s where data lineage comes in. Put simply, Data Lineage is a representation of the Data Lifecycle for different units of data. It’s a key pillar of any Data Strategy– one that carries several benefits for organisations with the foresight to pursue it.

Why pursue Data Lineage?

Establishing a framework for Data Lineage allows organisations to pursue a data strategy with confidence.

First and foremost, establishing this framework helps to create a firm-wide standard for capturing data flow and a clear manual for documentation, giving everyone a point of reference around which to coalesce. But on a practical level, it also empowers Data Owners to trace the data they own back to source, facilitating good governance by documenting the output of root cause analysis. Ultimately you can’t effectively manage the data lifecycle without transparency, and data lineage provides that transparency.

It also promotes cross-team collaboration on data management. This works both ways – Data Producers can understand who is consuming their data, and Data Consumers can understand where information is coming from. This supports change initiatives by providing clarity over the impact of upstream changes. Data Lineage also helps Data Owners in doing their jobs, setting standards and prioritising Data Quality initiatives.

In doing so, Data Lineage promotes Data Governance by championing strong ownership of data and driving consistency, alignment and control across the organisation. This is extremely helpful (some might argue, essential) from a regulatory and legal standpoint. An effective Data Lineage framework enables organisations to trace Critical Data Elements and ensure that they have robust controls and processes in place to promote adherence to policies. Further to this, an effective Data Lineage framework will oprimse control placement and design across the data journey, making sure that the various data sources are accurate and the downstream is consistent. In short, it promotes good Data Quality practices.

And crucially, if you want to do cloud-based data management across an organisation, a focus on data lineage is essential. Cloud migration requires a blueprint of data flows into and out of a system, and without an effective data linkage framework, it’s impossible to achieve.

How to make Data Lineage happen?

Fortunately, a range of technology platforms offer organisations the ability to implement Data Management and Data Stewardship– including Data Lineage– without having to build proprietary functionality from scratch. One such platform is Collibra, but there are many others, including OvalEdge, Truedat and IBM Data Governance.

It must be said, however, that Data Lineage should be leveraged by both technical actors in terms of platforms and key business incumbents such as stakeholders, key decision-makers and wider data practitioners. Data Lineage is more than the outcome of technical platforms, plotting accurate lineage curves but to be effective, it must also leverage logical data models which are driven by the business itself.

This is, it must be said, made easier by the new platforms and technology that have been established which make Data Lineage easier by automating the lineage process, simultaneously reducing manual effort while increasing the accuracy of results. This automation of the lineage process has resulted in greater accuracy overall.

Such value has, to an extent, been borne out of necessity, for instance, due to regulation such as BCBS 239. Much of the data lineage work undertaken over the past few years as part of this process has been, frankly, painful not just because of how the work was initiated but through the process itself. Technology-assisted lineage programs and the establishment of clear business glossaries can lead to a better starting point and clearer landscape from which to build value-led data programmes in the future.

The Strength of Collibra

Collibra is a great option for organisations pursuing Data Lineage. Its strength lies in its ability to provide detailed representations of relationships between data assets, using powerful “traceability” functionality. In our work with financial services clients at Mudano, we’ve found that it does a great job in representing relationships between people, processes and technologies. Collibra has also been recognised by Gartner as a leading platform in the lineage space.

Before integrating with a platform such as Collibra, it’s important to take a step back and construct a comprehensive Data Lineage framework. This should be composed of several elements including (but not limited to):

Outline the purpose – standardises the purpose and value of Data Lineage across the organisation.
Decomposition – identifies the correct starting point for Data Lineage
Data Flow covers the inter-application movement of unchanged data.
Data Creation – covers the intra-application transformation of data.
Transformation Categories – describes the movement of data (calculation, standardisation, flow, aggregation, selection etc).
Data Source – distinguishes between authoritative sources vs golden sources.
Data Taxonomy – covers the movement of data from the physical technical term, to the business attribute and entity.
Operating Model – covers the processes, who is involved, governance requirements and metadata tooling.

It’s worth bearing in mind that Data Lineage is also required to enable federated data projects. We certainly don’t suggest undertaking mass lineage initiatives, but the process of capturing lineage should be mandated for all new initiatives that cross-functional data teams are involved in. When responsibility for data is handed back to business teams, the correct tooling and frameworks should be in place.

Data Lineage is an essential part of building a progressive and robust Data Strategy across any organisation. Ultimately, Data Lineage provides the transparency needed to enable trust in data-driven decision-making and legislative response and it also enables cost-cutting measures through streamlining and verifying architecture.

Accurate Data Lineage is necessary for organisations to positively respond ‘yes’ if asked ‘Is your data accurate?’ Because if an organisation is able to trust in the data that they are using, then they can call upon that data to foster decision making in both BAU and during times of crisis such as we are facing now.

However, if there is a lack of accountability or transparency around Data Lineage then that trust cannot be provided. In this instance, data owners, producers and consumers should all be vocal in requesting that Data Lineage is undertaken because, ultimately, the entire organisation stands to benefit.