Expert opinion

Machine Learning to overcome data non-quality

Jun 06, 6:11 AM

For some years now, data quality has been the focus of regular articles in the trade. Numerous studies demonstrate the appalling shortcomings of data quality within companies, as well as the negative impact on analysts' work and management decisions. Beyond the "shock" figures often presented, what are the reasons today that make this issue a major stumbling block to the digital expansion of companies? What is the current state of curative and preventive solutions? And how can we leverage our data science expertise to (finally) achieve the desired level of data quality?

Enterprise data repositories: fundamental building blocks for your project

At the heart of the data quality issue are the company's data repositories - which evolve along with its business - such as the census and description of products, customers, HR, resources... Pillars of a company's digital life, they enable it to anchor all its data processing to its context and reality.
If these data repositories are incomplete or erroneous, all the data or digital initiatives that the organization undertakes, whether basic or elaborate, traditional or disruptive, will deliver incomplete or erroneous results, whatever the level of investment made in these various projects.

Thus and for examples:

Inter-application flows will not allow your management applications to communicate efficiently;
Your e-commerce site will not be able to boost your commercial activity in an efficient and sustainable way;
Your datascience or big data initiatives will not be able to revolutionize your analytical approach.

A growing awareness on the part of companies

Most companies are now aware of this, and some have even launched campaigns to improve the quality of their data repositories. These are generally based on governance processes between the IT department, which has access to the technical environments hosting the repositories, and the business experts, who have the knowledge to improve data quality.

Long and laborious, these approaches work because they "force" business experts to spend time on manual corrective tasks. However, they do not guarantee that quality will be maintained over time, as the actions are not carried out at the source when the data is created, but after the fact, when the data is already in the information system.

Some companies invest in data quality management tools, which aim to industrialize the business rules for data quality. Here again, this works well, the tools are mature, but they can only correct data for which a business rule exists, and the data on which these rules are based must be accurate and of high quality.

Even for those companies that do take these steps, a significant portion of the data often remains unadjusted, and therefore of a quality that does not allow for the achievement of the issues we have previously mentioned.

Data Quality factory, the solution that revolutionizes quality assurance

With this in mind, we have developed a solution in line with the market and our customers' expectations: Data Quality Factory. The aim of this non-intrusive tool, which is easy to set up and use, is to resolve data quality issues using the full power of machine learning algorithms specifically developed by our teams.

We designed this solution as a tool that massively analyzes the data in the processed repository, and provides the business analyst with predictions on missing data, detects potential anomalies that could not be detected otherwise, and indicates the actions to be taken to significantly accelerate the correction of the repository.

In addition to corrective campaigns, the solution is also designed to be implemented in preventive mode, capitalizing on the learning acquired during the first correction campaigns. It is thus possible to integrate it into the IT system in order to perpetuate the quality of the data at the source, as soon as it is created.

And the results are convincing! Within the framework of a large-scale program at a customer's, involving both the IT team and the business, we were able to accelerate the process of quality control of the product repository data, with a return on investment of 9 months of project time saved!