In this article I want to talk about the main difficulties of machine learning automation, its nature and advantages, and also consider a more flexible approach that allows you to get away from some of the shortcomings.
Automation, by definition, Mikell P. Groover is a technology by which a process or procedure is performed with minimal human involvement. Automation has long been able to achieve increased productivity, which often leads to lower costs per unit of product. Automation methods, as well as their application areas, are rapidly improving and over the past centuries have evolved from simple mechanisms to industrial robots. Automation begins to affect not only physical labor, but also intellectual, getting to relatively new areas, including machine learning - automated machine learning (auto ml, aml). At the same time, machine learning automation has already found its application in a number of commercial products (for example, Google AutoML, SAP AutoML and others).


DisclaimerThis article does not pretend to be dogmatic in the field and is the author’s vision.
Automated Machine Learning
The tasks in the field of data processing and machine learning are associated with many factors that arise due to the complexity of the system and complicate their solution. These include ( according to Charles Sutton ):- The presence of uncertainty and uncertainty, which leads to a lack of a priori knowledge of data and the desired dependencies. Thus, the research element is always present.
- "Death from a thousand cuts." In practice, when building a pipeline for data processing and analysis and subsequent modeling, you have to make many large and small decisions. For example, is it necessary to normalize the data, if so, what method, and what parameters should this method have? Etc.
- The presence of feedback loops resulting from uncertainty. The longer the immersion in the task and the data takes place, the more you can learn about them. This leads to the need to take a step back and make changes to the existing processing and analysis mechanisms.
- In addition, the results of models obtained by machine learning algorithms are only an approximation of reality, i.e. obviously not accurate.
Thus, the process of obtaining a full pipeline of data processing and analysis can be considered as a complex system (i.e., a complex system).Complex systemPeter Sloot, « » « », . , () , , () , () .. , , .
On the one hand, the presence of these factors complicates both the solution of machine and deep learning problems and their automation. On the other hand, the ever-growing and increasingly accessible computing capabilities allow us to attach more resources to the task.
According to the common CRISP-DM standard, the life cycle of a data analysis project iteratively consists of six main stages: understanding a business task, understanding and studying data (data understanding), processing data (data preparation), modeling ( modeling), quality assessment (evaluation) and practical application (deployment, application). In practice, not all of these steps can be effectively automated today.Most works or existing libraries (h2o, auto-sklearn, autokeras) focus on modeling automation and partly on quality assessment. However, the expansion of the approach towards data processing automation allows covering more stages (which, for example, was applied in the Google AutoML service).Formulation of the problem
The tasks of machine learning with a teacher can be solved by various methods, most of which are reduced to minimizing the loss function or maximizing the likelihood function in order to obtain an estimate of the parameters based on the available sample - training dataset :or θ^m=argminθm(J(yt;θm))where θm- trained model parameters (for example, coefficients in case of regression).In order not to limit automation to only modeling, it is possible to extend the scope of the method to other stages of the pipeline. For example, to automate decision-making about what data processing methods to apply, about choosing a model or their combinations, as well as selecting close to optimal hyperparameters.We illustrate what is described with a simple example, in the framework of which a choice is made between two data processing methods ( standard scaler and quantile scaler ) and two models ( random forest and neural network ), including the selection of some hyperparameters. The selection structure can be represented as a tree:
Each selection made is a parameter of the system, while the tree itself becomes the space of possible parameters. Such a look at the problem allows us to rise higher to the level of abstraction and formulate the task of obtaining the final pipeline, including data processing methods, models and their parameters, as a process of minimizing or maximizing a function:ω^=argmaxω(L(yt,ycv;ω))or ω^=argminω(J(yt,ycv;ω))where ω- non-learning parameters, ycv- delayed control selection (data set for cross-validation).The main advantages of such a learning automation include:- Selection of a larger number of system parameters in the presence of one input point within the framework of a single optimization process.
- Automation routine that saves the researcher or developer from the "thousand cuts."
- “Democratization" of machine learning through its automation, which allows many non-specialists to apply many methods.
However, automation is not without drawbacks:- With an increase in the number of parameters, their space also grows, which sooner or later leads to a combinatorial explosion, which requires the development of algorithms and an increase in the number of computing resources.
- Fully automatic methods do not always provide a flexible solution based on the “black box” principle, which reduces control over the result.
- The parameter space ω is nonlinear and has a complex structure, which complicates the optimization process.
From automation to semi-automation
Trying to preserve as many advantages as possible and at the same time get away from a number of shortcomings, in particular, because of the desire to gain additional control over the solution, we came to an approach called semi-auto ml. This is a relatively new phenomenon in the field, which may indirectly be evidenced by a quick analysis of Google Trends:
Achieving such a compromise can be conditionally compared with various methods of gear shifting in automobile transmissions (namely, shifting methods, but not their internal structure):In the course of work on internal projects, we created a tool that allows us to solve the problem of semi-automatic machine learning based on a hybrid functional-declarative configuration system. This configuration approach uses not only standard data types, but also functions from common modern libraries for machine and deep learning. The tool allows you to automate the creation of simple data processing methods, the basic design of features (feature engineering), the selection of models and their hyperparameters, and also perform calculations on a Spark or GPU cluster. The listing formalizes the example given earlier in the article. The example uses simple models from sk-learn and hyperopt (which even managed to make an insignificant contribution to the open source code) for parameter distribution and optimization.'preprocessing': {
'scaler': hp.choice('scaler', [
{
'func': RobustScaler,
'params': {
'quantile_range': (10, 90)
}},
{
'func': StandardScaler,
'params': {
'with_mean': True
}}
]),
},
'model': hp.choice('model', [
{
'func': RandomForestClassifier,
'params': {
'max_depth': hp.choice('r_max_depth', [2, 5, 10]),
'n_estimators': hp.choice('r_n_estimators', [5, 10, 50])
}
},
{
'func': MLPClassifier,
'params': {
'hidden_layer_sizes': hp.choice('hidden_layer_sizes', [1, 10, 100]),
'learning_rate_init': hp.choice('learning_rate_init', [0.1, 0.01])
}
},
])
Such a semi-automatic system, including a configuration mechanism, makes it possible to create pre-prepared standard scenarios in cases where, for example, a certain family of models is better suited to solve any problems. These, in particular, may include credit scoring, however, such a step requires additional research on a wide range of similar tasks. Also, when working on the search mechanism, it is possible to automatically maintain the balance in the bias-variance tradeoff dilemma by simultaneously taking into account the values of the optimized function both in the training and cross-validation samples.Conclusion
The complete lack of automation in practice is quite rare, since even enumerating the values of one hyperparameter in a cycle is already a step towards automation. At the same time, the full automation of the entire pipeline construction process is also practically unattainable today. Accordingly, in the development of most modern projects, automation approaches are consciously or unconsciously applied.The use of semi-automatic machine learning allows more efficient use of the resources of a researcher or developer due to automation of a routine, without taking away a significant part of the flexibility in work. As we see, the proposed solution requires the participation of a person, limiting the space of possible system parameters. Moreover, the introduction of standard scenarios obtained on the basis of the configuration system allows using not only partial automation approaches, but also full ones that do not require human involvement.