Tanguy Le Nouvel, Director of Data Science at Micropole, shares his views on neo Data Scientists.
Kaggle: site proposing data science competitions, submitted by companies / organizations, to all the data scientists of the planet. Anonymized data and problems often related to the prediction of an event by past data. Rewards generally ranging from 5000 to 100 000 €, to be shared among the winners.
We see a deviation in practice as many neo-data-scientists consider the winning solutions developed by Kaggle contest winners to be the state of the art in data science.
We are no longer interested in understanding the subject, its stakes, its context, nor are we interested in the data and consequently in all the relevant indicators that could have been built (the data is anonymized, already prepared and aggregated) nor in the constraints of industrializing the solution in the client's IS.
And I am not talking about the interpretation of the results, made impossible by the anonymization of the data and by the totally black box aspect of the algorithms and approaches used.
The end justifies the means! The $15,000 gain (on average per winner, estimated with the ALL method: to the ladle J) can justify producing modeling strategies of unparalleled complexity if they allow you to gain a few places in the "leaderbord" (online ranking updated in real time with each participant submission) and to pocket the stake.
But in the real world, these strategies turn out to be unworkable (CF NetFlix, BnpParibas and all the "use cases" that will never be industrialized...). And the neo-data-scientists at Kaggle are often confused when they have to evaluate themselves the performance of their results on real projects. And yes, when you're alone building a predictive model, there's no way to benchmark yourself against thousands of other contributors... and know if you're in the "right" or not.
By relegating discernment, critical thinking, business expertise and interpretation to the back burner in favor of full-blown stuffing (some winners propose solutions that aggregate the predictions of several thousand models), data scientists are shooting themselves in the foot and preparing for a jobless tomorrow. What is the added value compared to a powerful pseudo-intelligent machine? None!
And what about master2 courses that, as a project, ask their students to do a data science contest on the anonymized data of a Kaggle contest? Help!
Faced with hermetic data engineers who set up gas factories for epsilonian gains and data engineers who jump into the breach at the first opportunity to justify their industrialization in Scala language: foresight, pragmatism and agility will prove to be essential in order not to destroy the benefits that companies could reap from now on.
My teams and I are here to guarantee this!