Cloud Acceleration

Machine Learning to overcome data non-quality

Jun 6, 6:11 AM

For several years now, data quality has been the focus of regular publications in specialized articles. Numerous studies show the frightening deficiencies of data quality within companies, as well as the negative impacts on the work of analysts and the decisions of managers. Beyond the "shock" figures often presented, what are the reasons that make this problem a major pitfall for the digital development of companies? What is the current state of curative or preventive solutions implemented? And how can we use our data science know-how to (finally) reach the desired level of data quality?

CORPORATE DATA REFERENTIALS, FUNDAMENTAL BUILDERS OF YOUR PROJECT
At the heart of the data quality issue lie corporate data repositories - which evolve along with its activity - such as the census and description of products, customers, HR, resources... Pillars of a company's digital life, they enable it to anchor all its data processing to its context and reality.
If these data repositories are incomplete or erroneous, all the data or digital initiatives that the organization undertakes, whether basic or elaborate, traditional or disruptive, will deliver incomplete or erroneous results, whatever the level of investment made in these various projects.

Thus and for examples:

Inter-application flows will not allow your management applications to communicate efficiently;
Your e-commerce site will not be able to boost your commercial activity in an efficient and sustainable way;
Your datascience or big data initiatives will not be able to revolutionize your analytical approach.

ENTERPRISE AWARENESS
Most companies are now aware of this fact and some have even initiated campaigns to improve the quality of their data repositories. These are generally based on governance processes between the IT department, which has access to the technical environments hosting the repositories, and the business experts, who have the knowledge to improve data quality.

Long and laborious, these approaches work because they "force" business experts to spend time on manual corrective tasks. However, they do not guarantee that quality will be maintained over time, as the actions are not carried out at the source when the data is created, but after the fact, when the data is already in the information system.

Some companies invest in data quality management tools, which aim to industrialize the business rules for data quality. Here again, this works well, the tools are mature, but they can only correct data for which a business rule exists, and the data on which these rules are based must be accurate and of high quality.

Even for those companies that do take these steps, a significant portion of the data often remains unadjusted, and therefore of a quality that does not allow for the achievement of the issues we have previously mentioned.

DATA QUALITY FACTORY, THE SOLUTION THAT REVOLVES DATA QUALITY
With this in mind, we have developed a solution in line with the market and our customers' expectations: Data Quality Factory. The objective of this non-intrusive, easy-to-use tool is to solve data quality problems by using the full power of machine learning algorithms specifically developed by our teams.

We designed this solution as a tool that massively analyzes the data in the processed repository, and provides the business analyst with predictions on missing data, detects potential anomalies that could not be detected otherwise, and indicates the actions to be taken to significantly accelerate the correction of the repository.

In addition to corrective campaigns, the solution is also designed to be implemented in preventive mode, capitalizing on the learning acquired during the first correction campaigns. It is thus possible to integrate it into the IT system in order to perpetuate the quality of the data at the source, as soon as it is created.

And the results are convincing! Within the framework of a large-scale program at a customer's, involving both the IT team and the business, we were able to accelerate the process of quality control of the product repository data, with a return on investment of 9 months of project time saved!