a little research for a pretty optimization
Hi to everyone! Let’s talk about tuning hyperparameters in ensemble learning (mostly, blending). In such ensembles, predictions from one machine learning model become predictors for another (next level). The figure below shows some variations of ensembles where the data is transferred from left to right. Such ensembles will also be named as pipelines or composite models (composite pipelines) in this post.
The use of such complex composite structures can reduce prediction error. Sometimes the models become more robust. However, there is a drawback: high efficiency comes at the price of “ensemble consumption” (much more time and computational resources are required for the ensemble than for stand-alone models). Also, the more complex the ensemble, the more difficult it is to configure. Let’s have a talk about how hyperparameters can be tuned in such ensembles.
Disclaimer: We are not discussing specific optimization techniques, such as Bayesian optimization, random search, or evolutionary algorithms (or any other algorithms). Here the discussion is about higher-level approaches to tuning hyperparameters independently of the algorithms used in the core. The examples in this post use Bayesian optimization using the hyperopt library, but it can be replaced with a more appropriate one (for you or us) if necessary.
How this problem is solved in the industry
There are relatively few materials on the topic of tuning ensembles with machine learning models. Let’s start with “Hyperparameter tuning for Ensemble of ML models (Simple Python Example)” (video). Then let’s continue studying the topic with “Blending Ensemble Machine Learning With Python” (here there is no hyperparameters tuning by the way), look through the Internet, and finally run through ready-made solutions of winners and participants of kaggle competitions (for example, Hyperparameter Tuning /Ensemble Methods).
From the items above we can summarize the following classical approach. First all single models of the first level are trained and tuned, — then the ensemble model(s). Then the process is repeated — iterative movement from left to right. The same approach is exploited in “Hyperparameter Tuning the Weighted Average Ensemble in Python”, where only the ensemble weights are tuned.
In the kaggle notebook “A Data Science Framework: To Achieve 99% Accuracy”, the scheme is the same: hyperparameters are tuned separately for each model. However, even champion cagglers point out that tuning hyperparameters in ensembles is non-trivial: “We could have experimented with more hyperparameters in our detection models as well. I really dislike hyperparameter tuning (never really developed a good strategy for it) and often try and compensate by ensembling different models together.” © [1st place] Solution Overview & Code.
How this problem is solved in science
Here is where there is a lot of information and approaches. It is in scientific papers, which usually do not have any links to software implementations of algorithms, nor to a repository with experiments (It’s a big problem, honestly).
The noteworthy approach is “do nothing” — with this approach, the ensemble weights (if it is a weighted ensemble) are not tuned at all. All first-level models are included in the ensemble with the same weights. However, this method is not always optimal (see Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases). Clearly, this approach cannot be classified as a hyperparameter tuning method, so let’s move on to the real methods.
Let’s start with the classical approach “first each model is tuned individually, then the ensemble weights”. This approach is described, for example, in Optimizing ensemble weights and hyperparameters of machine learning models for regression problems, where its optimality is criticized. Actually, even within the framework of this paper, the authors suggest a more promising alternative — to tune hyperparameters of models and ensemble weights simultaneously.
Hyperparameters tuning of ensemble models for software effort estimation uses exactly the same approach, only generalized for stacking, where the final predictions are combined by the model rather than by weighting. Since the search space turned out to be quite large in their case, the particle swarm method and genetic algorithm were used for optimization.
In general, there are quite a lot of variations on the several approaches described above, but there are only two fundamentally different methods: tuning the basic models individually and tuning all the models in an ensemble simultaneously.
Which approaches we finally identified
Note that we did all the above research in order to improve the hyperparameter tuning module in our open-source AutoML framework FEDOT. This is a typical example of R&D work: we wanted to improve the hyperparameter tuning and needed to choose one of the most appropriate approaches. In this case, it is necessary to 1) identify competing solutions, 2) implement prototype approaches, 3) evaluate and 4) select the best one (according to given criteria) and 5) integrate the selected approach into the project. When we reviewed the approaches, we implemented three hyperparameter tuning approaches within the selected branch. And then we conducted experiments to determine the most successful approach. We highlight two main criteria for a successful approach: the tuning algorithm must work quickly and accurately, i.e. the ensemble error after tuning hyperparameters must be less than with the default parameters.
We made the following three cuties (names made up by the author):
- isolated tuning;
- sequential tuning;
- simultaneous tuning.
Let’s now discuss each strategy separately. We will use animations to make it clearer. The first will be the isolated one (Animation 1).
The nodes in which the hyperparameters are tuned are highlighted in red, and the arrows show through which nodes the data is passed. The animation shows that only one node is tuned during each iteration. The error metric is measured by the predictions from that node. That means that in one iteration, the data is transferred only through the ancestor nodes of the tuned model. This strategy follows the principle of “one iteration, one model trained”.
Now let’s move on to analytics. One of the advantages of the above approach is its computational efficiency. Indeed, at each iteration, it is not necessary to train all the pipeline models again.
It is especially worth taking a closer look at the disadvantages. The first and most important limitation is the model-specificity of the algorithm — it works only on those nodes whose output can be matched to the target variable and obtain a metric, i.e., only on models. If there are preprocessing operations in the pipeline structure (e.g. feature selection algorithm or principal component analysis), this approach cannot be applied. Since the output of some nodes cannot be directly matched to the target variable for error estimation, such operations cannot be tuned. This problem can be partially solved by including additional rules: for example, it is possible to ignore such cases, or to tune preprocessing hyperparameters together with the parameters of the descendant models. However, these modifications take time to implement, so we decided to delay them until we were sure that the approach was the most promising — the prototype managed without such modifications. The second disadvantage is that the algorithm is greedy. At each step, we try to obtain a locally optimal configuration of hyperparameters for each subgraph of a large pipeline. There is no guarantee that in the chase for an optimum at each step, we won’t miss the global one for the entire pipeline.
The disadvantages of the first approach (at least some of them) can be overcome by the second, sequential tuning (Animation 2).
In this approach, the full data is now passed through the entire pipeline at each iteration (red arrows are lit for the full pipeline), although it is still only one operation that has its hyperparameters optimized.
Pros of the approach:
- It is possible to optimize a pipeline of any structure, including ones containing preprocessing operations;
- At each iteration, the output of the whole composite model is matched. Consequently, the model is optimized over the final, rather than intermediate, predictions.
But there are also disadvantages. Since we start from right to left, it turns out that during hyperparameters optimization of the previous models are tuned to one configuration of the subsequent models (descendants). And already at the next node, the descendant configuration changes to the new one. Perhaps this is not a big problem, because it is always possible to remain initial values of hyperparameters at descendant nodes if any other combinations do not lead to improvement of metrics.
However, by moving sequentially, we narrow the search space, since the previously optimized nodes cannot change the assigned hyperparameters. Thus, we can drive ourselves into a “non-optimal corner”, and although we can still optimize the final ensemble model, nothing will be improved. The analogy that comes to my mind with a cake is that people pass along the queue, biting off one piece at a time. The last person in this queue may get nothing — though he could possibly discover the secret of its cooking and then we would never have to stand in a queue for cakes…
Colleagues: So how to overcome the problems with previous approaches?
Me: Let’s just dump everything into a heap!
This is what we have done. Animation 3 shows that it is possible to optimize an entire pipeline as one big “black box”, where the set of hyperparameters of the pipeline equals the union of the sets of model parameters in its structure.
- Optimizes the value of the metric for the entire pipeline;
- The search space is not reduced during optimization — it is possible to change the hyperparameters of both the root model and the predecessor models at the same time.
Disadvantages arising from the advantages of the approach:
- The most computationally expensive method;
- Very large search space, especially for big ensembles.
Thus, the third approach seems most effective for the problem of finding optimal combinations of hyperparameters in an ensemble — let’s check it!
So, let’s choose the most appropriate approach for its integration. To do this, we set up the experiment as follows (Table 1). Important clarification: the final metrics are measured on a delayed sample, which the tuner did not have access to during the optimization.
The picture below (Figure 2) shows three different pipelines that were involved in the experiments.
The number of operations (models) in the pipelines varied from two (Pipeline A) to ten (Pipeline C). Explanation of model names: ridge — ridge regression, lasso — LASSO regression, knnreg — K-nearest neighbors for regression task, svr — support vector machine for regression task, rfr — random forest for regression, dtreg — decision tree for regression task, knn — K-nearest neighbors for classification task, rf — random forest for classification, dt — decision tree for classification task.
As mentioned above, the comparison will be made according to two criteria: execution time and metrics on the validation sample. With the metric on the validation sample, we will also introduce an additional criterion: the more stable (by 30 launches) the result is, the more preferable the method is. I.e., we will look not only at the average metric value but also at the variance. Besides, we will also check on which types of pipelines the considered approaches work better: simple linear A; relatively simple but branched B, and complex multilevel C. We will also draw conclusions about whether the amount of available computational resources (determined by the number of iterations) influences the efficiency of the approach.
Important information. Since we are not interested in how well Type A, B or C pipelines can predict, but rather in the efficiency of the parameter tuning algorithms, we will talk below about gains in metrics (rather than absolute values). For symmetric absolute percentage error (SMAPE), the delta will be calculated as the difference between the metric value before tuning and the metric value after. For classification, we will subtract from the metric after tuning the metric before hyperparameter tuning. Thus, the statement “the greater the delta, the better the algorithm” will be true for all tasks.
Disclaimer: The following is not intended to be a full-scale scientific study, but was done on a routine development process. We have used similar benchmarking during prototyping. So, if you want to write a scientific article in Q1-journal on such a subject, you will have to do some more experiments: calculate more metrics, take 10–15 datasets for each task, formalize hypotheses, perform statistical testing, etc.
So, the results of the metric improvement comparison for the regression task are shown in Figure 3.
In the figure, each point cloud represents a sample of thirty runs, i.e. thirty runs for each type of pipeline (A, B, C), for each approach (isolated, sequential, simultaneous), for each dataset (three regression), and for two options of allocated number of iterations for optimization (20 and 100). A total of 1620 runs of the tuning module for regression, and the same number for classification.
The results on the dataset “pol” stand out. Indeed, in the vast majority of cases, hyperparameter tuning led to worse prediction on the delayed sample. Well, even in the case where hyperparameter tuning ultimately leads to an increase in prediction error, it is still better to choose the algorithm that harms less than the others. If we look at more “adequate” launches, we can see that, on average, simultaneous tuning produces a more consistently high result: the variation in values is smaller and the cloud is positioned higher than the competitors.
Another observation is that the experiments use pipelines of varying degrees of branching, or complexity. Figure 3 shows that, as a rule, the more complex a pipeline is, the greater the search space for its tuning, and the greater the gain in metric that can be obtained.
Figure 4 shows the results for the classification datasets.
In general, the results for classification are the same as for regression, with a winner in terms of metric gain and variance — a simultaneous hyperparameters tuning.
Now let’s look more closely at some cases, namely runs of 20 iterations each (Figure 5) for the tables “Amazon_employee_access” (classification) and “cal_housing” (regression).
What can be seen from the figure? — First, it can be noticed that in the case of the improvement, classification metrics gradually slide “to the right” in the order of isolated tuning — sequential tuning — simultaneous. This indicates that the simultaneous approach allows better hyperparameter tuning than the sequential approach, which is more reliable than isolated tuning. It can also be seen that only the third approach achieves consistent improvement in metrics; with other algorithms, this is not guaranteed.
Now consider the distributions for Pipeline C when solving the regression problem: it can be seen that only in the case of simultaneous tuning does one noticeable high peak (mode) in the distribution stand out. In the other two cases, the distributions are either not unimodal (isolated tuning), or stretched almost along the entire x-axis. In other words, only in the case of simultaneous tuning was it possible to achieve high densities. It indicates that if we apply the third approach, we are likely to obtain the same high result as in the previous launch.
Now check the execution time of the approaches using the example of the regression tables in the “violin plot” (Figure 6). In such a plot, the kernel density estimates of the distribution are plotted on the sides relative to the central axis of each item. Thus, the locations of the expansions in such “violins” show the mode.
Let’s start with the obvious: the more iterations allocated, the longer the algorithm runs. Now note that, as expected, isolated tuning turned out to be a pretty fast algorithm.
Note, however, that in the case of sequential tuning, there is an opportunity for significant improvement. All operations in the pipelines that are not being trained with new hyperparameters during a given iteration and do not depend on the tuning model (e.g. located in the pipelines before it), cannot be trained, but simply call the predict method of the already trained model. Indeed, predictors are unchanged and hyperparameters of the model didn’t change either. It means that such operations can be cached and used to generate forecasts only. In the case of simultaneous tuning of hyperparameters, there is no possibility to perform such a modification.
Now that we have reached the final, I propose to conclude in the form of a summary table with metrics (Table 2).
In this paper, we looked at how hyperparameters can be tuned in ensembles of machine learning models. We created prototypes of each of the three approaches we identified to determine which is the most promising.Then we ran experiments on six different datasets.
The conclusions are that simultaneous tuning achieves greater metric gains and, at the same time, allows these gains to be achieved in a more stable way. The isolated approach was the fastest (in terms of average execution time). On average, sequential tuning was more efficient than isolated tuning but less efficient than simultaneous tuning. However, the execution time of sequential tuning can be reduced by using operation caching.
P.S. In our FEDOT framework, we decided to implement and support simultaneous and sequential tuning after all. The default for tuning is simultaneous strategy.
- Open-source AutoML repository of the FEDOT framework
- Branch with experiments
- NSS lab website with information about the research group
- Our new paper, which discusses the topic of hyperparameters optimization: Automated evolutionary approach for the design of composite machine learning pipelines
Hyperparameters Tuning for Machine Learning Model Ensembles was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.