Alexandre Gramfort
Scikit-learn is currently the most widely used open source library for Machine Learning applications. It has been developed in Python (Cython and C/C++) and, with over 1000 documentation pages, has become the major contribution for democratizing machine learning for a large audience.
A detailed presentation and outline of the talks can be found here.
The program focused on the following subjects :
This first talk gave a general introduction and presentation of the sklearn project. The speaker, who is one of the top committers and a major contributor, told us about the beginning of the project and highlighted some of the reasons that made it so successful.
Some facts :
Scikit-learn was thought to be domain agnostic (with the exception of text vectorization which focuses on text analysis) and designed to be able to perform some highly non-trivial tasks in a few lines of code.
Some quotes :
"Machine Learning is easy, there is scikit-learn" - Gaël Varoquaux
"But making scikit-learn was not easy !" - Anonymous scikit-learn developer
The Ingredients of success :
Technical reasons
Even more important reasons
Social reasons
Researchers’ contributions
Scaling the development of the scikit-learn ecosystem
Examples of scikit-learn on some use cases :
Our take on this talk :
Scikit-learn’s success surely lies on the expertise in machine learning of the team who developed it, but first and foremost because its earliest contributors and founders were good coders convinced that code quality and maintainability were crucial assets for the project. It is also a case study for a hugely successful open source project made on a small budget and with the mindset to serve a community of diverse users and to democratize machine learning.
The slides of this talk can be found here. This talk was given initially at PyData Berlin - 2016.
NB : outdated informations on slide 28, you can distribute merge & group by today.
This talk began with an introduction questioning the real need for distributing predictive modeling as of today. The speaker based his reflexion on an article (« Big RAM is eating big data » by S. Pafka) stating that for the most part, datasets size is increasing by 20% year on year on average, but Big EC2 instances RAM size is increasing by 50% y/y. So why do distributed computing at a time when you can do almost anything in memory ? This analysis was tempered by the fact that this study relied on KDnuggets surveys (conducted yearly since 2006) that could possibly be biased, and that some datasets of several petabytes captured in the surveys do actually require the need for distributed computations.
The talk then focused on the approach for running predictive models. There are basically two ways : the “fast lane”, with distributed events stream processing for real time applications, and the “slow lane”, based on distributed storage and offline distributed batch processing. There are several alternatives to do this but the speaker focused mainly on the current Spark/Scala/Python paradigm and on Dask as an alternative. PySpark has the limitations of latency, which is induced by network architecture, and that traceback is complex due to the mix of Python and Scala code. There is no pure python mode ! The alternative is to use Dask and distributed.
In summary, the paradigm is to wrap the functions in delayed mode (which means a promise that the function will be executed in the future), then pass the delayed objects to the cluster for scheduled computation. This approach has the advantage of lower overhead than the Hadoop/Map Reduce framework. With Dask we can compute the delayed evaluation in parallel (multiple threads on a single machine or multiple threads on multiple Python processes running on several machines) or on a single machine (single thread sequential code, easier to debug).
Our take on this talk :
Dask distributed seems to be a promising tool to distribute tasks on a cluster using python, with some interesting advantages over the current PySpark approach.
The definition of out-of-core is what does not fit in RAM. From Wikipedia :
“Out-of-core or external memory algorithms are algorithms that are designed to process data that is too large to fit into a computer’s main memory at one time. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory (auxiliary memory) such as hard drives or tape drives.”
What are the strategies to scale Scikit-learn computationally? The speaker presented examples of incremental learning :
Scikit-learn proposes several methods to solve out-of-core problems (basically classes with a 'partial_fit' method). Rather than call 'fit', you must call 'partial_fit'.
Some algorithms recently added to the sklearn library using the partial fit method :
Classification :
Clustering :
Other :
This link will give you all available out-of-core method of scikit-learn.
Scikit-learn proposes some tools that can be useful to deal with these problems :
For more information about out-of-core problem in scikit-learn, you can read this article.
For more information about feature extraction, you can read the related documentation.
A notebook about large scale text classification can be found on that page.
Our take on this talk :
Some new algorithms are now available that use incremental learning for training models with large datasets.
This talk was given by Vincent Feuillard, R&D engineer in applied mathematics at Airbus Group Innovation. Unlike the other talks, this one was not especially focused on how one can go to production using sklearn. It was more of a storytelling on how the team could setup a prototype on a specific use case : predictive (condition based) maintenance using several signals from the airplane engine’s Auxiliary Power Unit (APU).
Before the prototype maintenance was based on engine health indicators defined by the expert engineers. The idea behind the prototype has been to approach the problem from a machine learning perspective and to validate the results by experts from the AiRTHM team.
The python stack they used for this project was :
In contrast with R, Python Scikit-learn is best suited and easier to use when prototyping from the beginning to the very end of the pipeline, because it is better maintained, more stable and has a clear API. The main lesson learned is that feature engineering is the most important step when doing anomaly detection with functional data.
Our take on this talk :
The POC presented is a nice application of the Python/sklearn stack to an industrial business case. The speaker highlighted the multidisciplinary teamwork and the Agile organization of the project, with direct transfer of R&T development to Airbus operational support.
The workshop wrapped up with a comment from A. Gramfort who explained that the project has grown to a size that is hard to maintain at the moment. The funding of scikit-learn development requires about 300-400 Keur/year, and so far this has been provided mainly by public funds, but this situation is not sustainable. Hence the founders are looking for alternative solutions. Since many companies, ranging from startups to established industrial players are currently prototyping with scikit-learn, and some contractors are willing to fund its development, there are prospects to create an entity that could accept and manage donations as done by the Wikipedia Foundation.
We warmly thank the speakers for their help and input with reviewing this article.