It is important to know what data is used where to gauge the impact of changing processes and definitions. Or to evolve certain sources without breaking the subsequent apps that use their data.
You can identify two major methods in data lineage: designing the data flows up front on the one hand and discovering the data lineage from existing data flows on the other. First we will go into these two methods, then we talk about three tools that automate your data lineage discovery and finally we explain how to visualize data lineage in ArangoDB graph database in a very hands on way.
When starting to build an analytics environment you will probably have a plan how to integrate your different sources and what data is needed to calculate the KPI’s you want to follow up. This plan will contain a design of the data flows. It will allow you to estimate the work needed to build the pipelines. The plan will be written down somewhere and serve as a guiding documentation.
There are several challenges to deal with when planning to design a data pipeline:
· You may have a multitude of developers in multiple countries speaking different languages and having other preferences when it comes to documenting this type of information.
· There can be different types of analytical applications that make it difficult to use just one method to document your dataflow, because the logic they follow differs so much from each other.
· Once you start building you may discover there are other sources needed, or the data cannot be integrated the way you planned.
· Sometimes you will probably need to finish a critical strategic project very fast, cutting corners on the documentation building up technical debt.
And so on. There are tons of reasons your upfront documentation will not be up to par.
That’s why you need a strong data governance team to ensure the documentation is complete, up to date, consistent over the different projects and to keep a special eye on personal data that is in scope of data privacy regulations. This can be done in data catalogs and data dictionaries — if need be in a spreadsheet.
Once the projects are completed, you may want to double check if everything is implemented as was described, but this is a huge job. Also keeping the documentation up to date is no small feat. Then a data lineage discovery can come in handy. Tools like Collibra, Manta and Octopai allow you to link to the different sources where data is transformed.
· Collibra promises to extract and maintain lineage automatically from source systems, SQL dialects, ETL tools and BI tools with their native lineage harvesters. This would save 95% of time typically spent manually documenting and maintaining lineage.
· Manta talks about data pipeline observability to map the entire environment to determine what data is stored and where. They carry out instant and accurate impact analyses to accelerate software testing, see how planned changes will influence other parts of the environment, and speed up collaboration between data users.
· The use cases Octopia lists include predicting the impact of a process change, analyzing the impact of a broken process, discovering parallel processes performing the same tasks and high-level visualization of data flow.
The promises and use cases of these tools with regards to data lineage are all quite similar. Collibra and Octopai combine data lineage with a data catalog and Collibra also with a data privacy functionality while Manta only covers data lineage. Tool selection is often more of an art than a science and we’ll not cover this here in further detail.
To get a feeling for where the complexities are in this type of exercise I have tried to visualize the data lineage of a rather large Qlik Sense project in a graph database. Why a graph database? When we list all objects that are relevant for data lineage there are in large parts only two: the objects, and the relationship between the objects.
In graph database terms these are the documents and the edges. A third one that is also important and is a function of the relationship, is the kind of combinations and transformations that can occur between one document and the other.
Before we go in further detail we will set some boundaries on the scope of our exercise: the work of collecting and organizing the information about our data was done manually. We don’t have a SQL parser at our disposal to read the Qlik Sense load script or to extract the metadata from the proprietary Qlik QVD data files that are used in this project.
For now we have chosen to limit ourselves to the level of the different tables in the data model, and not go to the level of the fields. It are the fields that are used in the final dashboard to calculate the measures that make up the KPI’s. Of course it should be the next step to see what kinds of aggregations, combinations and functions are applied on the field level. But since we are talking about more than 500 fields and it is a manual labor to collect and organized them, we haven’t gotten around to it yet.
We’ll explain in more detail how we did this technically, but to get a first taste of the results, here you can see how the graph looks when we load the documents and edges in the database:
As you can see, this is not very informative and you really have to know what you are looking at to be able to make anything out of it. That’s why I have chosen to use only one of the facts for the documentation here below.
How did I go about this task concretely and pragmatically? First of all I installed the community edition of open source graph database ArangoDB on an Ubuntu 20.04 server in the Hetzner cloud.
Then I listed all documents and edges in different collections in Excel. I use different collections to be able later to use a color per collection to make the graph more readable. In the example here above, only one collection is used. A drawback of using different collections is that you need different upload files per collection and per edge connecting the collections.
Now we can start defining the graph by adding the different collections and edge definitions. You can also separately upload the vertexes, indicating the direction of a relationship between documents. We added this to our edge definition.
Then I can visualize this graph, but I’ll need to adjust a couple of the default settings first to be sure that all nodes are displayed. For this graph I changed the search depth to 10 and the limit to 10.000. If I do this, the graph looks rather erratic and is determined by the weight the different nodes get based on the number of connections they have.
You can also define weights on the edges. This allows you to use ArangoDB to calculate the shortest paths, for example for a travelling salesman optimization problem. If I think how I could translate this type of usage to our context I could add the reload time transforming one document to the other on the edges and add the file sizes on the documents.
To make the graph visually more appealing and use it as a communication tool I manually dragged and dropped the nodes. This is something the more advanced tools do automatically. As far as I know you cannot save the layout of the graph in this way, so be sure to take a screenshot once you’re done.
Data lineage is a very powerful method to get a grip on your data pipelines. In some industries this is indispensable to be compliant with regulations. There are a couple of very good tools on the market to handle this in a more automated way, but they come with a price. If you don’t want to invest in tooling, make sure you have a process in place to govern your data and have peace of mind that everything is under control. This will also allow you to add use cases to your data environment more efficiently and support more elaborate self service data analysis for your end users.
I was playing with the idea to use a graph database for data lineage for a while and finally building it I learned how to do this in a pragmatic and quick way. Even if the concepts are actually very simple, I didn’t really understand the power of a graph database before I tried it myself. You can read the documentation but if you don’t try and play around with it you will never really get the hang of it. One drawback of this way of working still is the huge manual labor that goes into collecting and organizing the relevant information.
Next steps will be to get to know the other functionalities of the database better, like queries, views, services and schema’s, and to figure out if the more advanced features like shortest paths can be applied in the context of data lineage. I hope you liked this content, and if you did, please follow me on LinkedIn.
All the best and good luck on your data journey!