What is a Data Pipeline?
Unipart Digital’s Data Pipeline is a tool designed to ingest data from a variety of data sources, validate the data, transform it, and finally either load it into a database or save it to output files. Through this process our Data Pipeline provides access to structured and accurate data sets suitable for further analytics and downstream business use. The tool is already used as a plug-in to Unipart Digital applications such as Dragonfly and UBIS, and has also been used to transform product data into a form suitable for upload to an e-commerce website.
Why did we build it?
In the world of automation, it is acknowledged that data is a valuable asset and its appropriate maintenance and utilisation is becoming an integral part of business success. However, the process of collecting as well as automating data to flow into structured and meaningful formats is quite a complex task: it requires data engineering expertise as well as appropriate infrastructure to establish long-term usability.
When these conditions are not fulfilled, manual process of data exports in the form of .CSV or alike remain prevalent. In Unipart, the need for a consistent Data Pipeline arose from various projects within the Data Science team which all required access to organised and accurate datasets. For each of those projects we had built a custom extract, transform and load (ETL) process. But this implementation results in increased development time, as well as overall fragility and, hence, additional maintenance burden. Having an automated Data Pipeline, that can be integrated with all of Data Science products, solves these problems.
It soon became apparent that the Data Pipeline could also be used for validating input data, transforming it and saving it to output files for further use, in other words an extract, transform and save (ETS) process. The Data Pipeline has been applied to the problem of taking product data as input, validating it (such as checking that the price fields are all valid), transforming it and saving it to output files that have a more complex structure, suitable for uploading to an e-commerce website. This application enabled tens of thousands of additional products to be offered for sale on-line.
How does it work?
Our Data Pipeline application is built up from a series of functions that can process incoming data. By adopting the Data Pipeline, engineers gain access to common tools, built on an Application Programming Interface (API). These tools are meant for ETL processes, validation, visualisation, configuration, local development and testing. Because these tools are built on an API, there is the opportunity to build multiple-overlapping ETL ecosystems around the Data Pipeline libraries, and not just a set of one-off tools. This standardised, modular architecture can be re-used for all data projects.
Ultimately, Data Pipeline allows us to bring data into a single source of truth, making high quality verified data available to anyone who wants to use it.
Added Value of a Data Pipeline:
- Provides data accuracy: validation is automatic and any data anomalies are apparent as soon as validation fails.
- Possibility to ingest data from any Warehouse Management System and transform it to a common format for analysis: the Data Pipeline produces standardised data sets with standardised formats. This common process for data transfer allows for common Data Science tools to be developed.
- Ease of maintenance of large datasets, e.g. historical data or from sensors.
- Early warning indicators: allowing for faster decisions and delivery of actionable insights in real-time.
- Removal of duplication of effort in building bespoke data extraction and transformation modules.
- Operation Managers and Data Scientists are freed-up from time-consuming, frustrating and error-prone tasks of maintaining custom data extraction and transformation.
- Data Discovery: data silos are removed and relevant information can be shared automatically across overlapping projects, and kept updated in a consistent fashion.
- Data lineage (provenance): seeing where the data has come from and the steps it has gone through gives confidence in the accuracy of any predictions. Provenance also helps with ‘explainability’ of predictions, and it allows selection of input data by source for specialised analysis. In a long-term project domain knowledge is often lost, and without good data provenance users can easily misinterpret data meaning.
- Data Governance: personally Identifiable Information (PII) can be maintained in a consistent fashion within the Data Pipeline, and can be easily removed from a single location as part of a GDPR filter.
Who Benefits from a Data Pipeline?
Many people in a business will benefit from a data pipeline – from Operational managers, to Data Scientists, Business Intelligence experts, Business Analysts, and other stakeholders – anyone who wants to take advantage of their available data!