By Tanguy le Nouvel, Director of Data Science Practice, Micropole
Stemming from the strong rise of machine learning and the intensive use of open source tools (such as R and Python), Data Science is in a way the extension of Data Mining to the new Big Data platforms.
If we take a closer look, we realize that most of the foundations of the algorithms cited as belonging to Data Science were defined a long time ago. Whether it is image processing, text processing or machine learning.
What has changed, however, is the coupling between almost infinite computing power and the democratization of access to the latest generation of algorithms, which now makes it possible to process any type of information and deliver more predictions and recommendations in real time, sometimes with surgical precision. However, if the field of possibilities has now been greatly extended, many projects launched recently could have been processed without any problem 10 years ago on a desktop PC! So much the better if all this buzz around Big Data and Data Science has helped to wake up people's minds!
The other advantage of the new Big Data platforms is that they allow to gather in a unique environment all the company's data sources (structured or not, Data Warehouse, Web, Sensors, external data...). This significantly increases the productivity of Data Scientists and makes it possible to have a 360° vision, which until now had remained at the virtual stage for many companies.
While reconciling all of this data in a single environment is simplified, it should not be forgotten that each data science project requires a very specific data preparation and framing phase. Reconstructing individual histories and customer trajectories (growth, decline, instability of behavior, etc.) in an omnichannel context in order to predict an event (churn, subscription, household expansion, real estate project, etc.) is not something you can improvise if you have never done it before!
Indeed, most algorithms need to work on data tables that look nothing like the raw data dumped in the datalakes. In most cases, these algorithms need to work on tables where each row represents a distinct individual and each column a specific information about this individual. However, most of the data dumped into the datalakes is in transactional format. For example, for a customer knowledge project, it will be necessary to be able to transform this raw data in order to summarize the situation of each customer before the event we are trying to model. These indicators will be based on the customer's profile as well as on his past behaviors (cumulative and recent purchases, online or offline visits, purchasing path, reactivity to marketing solicitations, consumer opinions, travel, affinity preferences, product use via sensory sensors, etc.).
Therefore, even if you are the "king of programming", you will not be very advanced if you have never been confronted with the transformation of raw data into potentially relevant indicators to explain or predict the targeted event. Up until now, most data mining projects have been devoted to data preparation. We can see that nothing changes from this point of view with the arrival of Data Science.
Finally, this technological shift is a great opportunity for companies wishing to anticipate and predict key events in their business. It is also a great opportunity for data miners who will be able to discover new approaches (machine learning) and new tools (R, Python, H2O...), which are finally very accessible.
And even if some Data Miners must have felt a bit lost in front of such an effervescence and the incredible accumulation of new environments, languages, packages and solutions that they were asked to master by companies willing to recruit, let them reassure themselves! These job descriptions correspond to the profiles of the pioneers of data science: these famous "12-legged sheep". They will gradually give way to two types of complementary profiles:
- Big Data architects with a profile that is more IT than business: responsible for configuring and administering the BigData platform, managing data flows, preparing data and automating its transformation to facilitate the work of the Data Scientist and the operational exploitation of predictions or recommendations.
- Data Scientists with a more statistical and business profile: in charge of making the link between business needs and data, transforming them to analyse, synthesise, explain and predict certain events or behaviour. In a way, an extension of the data miner profile with, in addition, mastery of the R and Python languages and a real agility to choose the right language according to the specific needs of each study.
More generally, Big Data architectures lead to a change in the collaborative approach of the various players. Whereas the Data Miner was confined to the end of the chain and was very rarely called upon upstream of projects, the Data Scientist will work from the start of the project with the Big Data architect, depending on the use case to be processed, on the best way to retrieve the data (API, JSON type files, real-time processing of a data stream, etc.). The Data Scientist will thus give his inputs according to the packages, libraries and algorithms he intends to use, the very use of these algorithms being conditioned by the volume of data.
There is therefore a governance dimension involved in the work of the data scientist, due to his or her unique ability within the datalake to cross-reference all of the company's transversal data. This raises questions about security, respect for and protection of private data, handling of sensitive data, etc. The Data Scientist will therefore have to work tomorrow with profiles such as the CISO (Information Systems Security Manager), but also the CDO (Chief Data Officer) who steers the strategy and ambition of the data within the organisation.
Due to Big Data, the computing power of new platforms and the need to deliver more and more predictions, prescriptions and relevant recommendations, some of them in real time, the intensification of the use of Data Science in machine learning mode in operational processes is inevitable. But machine learning means a black box, and predictive analysis means being limited to the spectrum of past events to influence and guide the future. And yet, companies will always need to understand, create and experiment with new offerings, strategies and devices.
Companies will therefore have to be proactive and make massive use of the "test and learn" approach. This is how the classic statistical approach and data science will enable them to measure and identify their new growth levers.