# Hyperparameter search

This guide explains how to do hyperparameter search for machine learning on CSC's supercomputers. It is part of our Machine learning guide.

## Introduction

In machine learning (ML), one key component in model selection is to select a suitable set of model configurations for solving a given statistical problem. These model configurations are known as *hyperparameters*.

This is in no way an easy task. Statistical estimation problems assumes that there exists an optimal solution, however in many real-world problems there is no guarantee that such a solution exists. For example, in biological and medical applications the variability of observations makes near impossible to obtain optimal, separable decision boundaries; whereas (the canonical) seabass and salmon sorting does have a more (or less) optimally defined solutions; and in linguistics they maybe statistically optimal solutions, however semantically the results may be incomprehensible to humans. The lack of an optimal solution makes it difficult to give recommendations for setting up hyperparameters in a ML for a given problem. For this reason, this document chooses to use the term **search** instead of **optimization** (as optimization appears more frequently in ML literature).

Some key points when attempting hyperparameter searching:

- the amount of available data
- choice of loss function ( determines the type of estimator to be used )
- choice of solver

These points are conditioned on each other. Hyperparameters can be numerical (discrete or continuous) or categorical (type of regularization, activation functions, solvers), and additionally many hyperparameters (may) have both linear and nonlinear effects on each other. Since ML models are applied to statistical problems, there is no substitute for good statistical knowledge of the problem at hand: this helps setting some possible hyperparameter options and their ranges and choosing the right type of estimator. For example, if all patterns, or almost all, are considered important to the problem at hand, parallel learning methods are preferable (such as neural networks); then if only some patterns are considered important then sequential statistical methods should be preferred. Choosing suitable hyperparameters can be done either manually or automatically. This documentation focuses on automated searching.

In order to do hyperparameter search, the data is randomly divided into three separate sets: *training*, *validation* and *test* sets ( given that there is enough data. Otherwise, the validation set can be omitted ). Hyperparameters are searched using the training data and validated the validation set. Once a suitable set of hyperparameters has been found, the completed model is evaluated against the test set.

One thing to keep in mind however, is that a change in a hyperparameter defines a new model, *all* hyperparameter searches select the best model w.r.t to a given metric (probability of error for example), which is statistically a questionable approach. Training **n** models on the *same* data set induces a dependency, which has to be considered when comparing different models. ANOVA or Bernoulli trials could be considered here when comparing different models on the same data.

## Search strategies

There are five main categories of searching for hyperparameters.

### Grid search

In *Grid search*, the hyperparameters of a model are considered occupying a grid. Each grid point corresponds to a set of hyperparameters, and then the grid is searched and evaluated w.r.t given metric. This approach is more suitable when the number of hyperparameters is low and their ranges is limited. This is because Grid search does an exhaustive search from the entire search space, which can be time consuming with a large search space. So Grid Search would be more appropriate if you are searching for a small number of hyperparameters, or for suitable categorical hyperparameters, such as type of regularizer and type of solver. Grid search approach is easily parallelizable.

### Randomized search

*Randomized search* allows searching for hyperparameters from a larger search space. Given a set of hyperparameters and their lower and upper bounds, values are randomly selected from the given ranges. In theory, given a large search space, Random search will be able to find more suitable set of hyperparameters. Similarly as Grid search, Random search is also easily parallelizable.

### Adaptive search

Grid and Random search approaches do not take into account the previously evaluated hyperparameter candidates, whereas *adaptive search*, or *Bayesian optimization*, utilizes previously evaluated hyperparameters to direct its search for a new set of hyperparameters. The search is done by training a *surrogate model*, which contains a set of initial hyperparameters. After a surrogate model has been trained, an *acquisition function* is used to direct the search for new hyperparameters. This approach is sequential in nature: the search moves forward only after the set of hyperparameters have been evaluated by the acquisition function, making adaptive search difficult to parallelize.

### Multifidelity search

*Multifidelity search* addresses the problem of searching large hyperparameter spaces, particularly for large neural network models. The search is divided into *low* and *high* fidelity evaluations: low fidelity involves a small subset of the training data while high fidelity involves a large subset of the training data. The performances of each hyperparameter configuration from low and high fidelity evaluations are recorded and those which performed best are selected to the next evaluation. There are two popular multifidelity approaches: *successive halving* and *hyperband*. In successive halving, each evaluation iteration discards half of the "poorest performing" hyperparameter configurations which have a static budget, and keeps the better half of the hyperparameter configurations. This process in continued until the best set of hyperparameters remain. Hyperband addresses the static nature of successive halving by dynamically allocating the amount budget to hyperparameters during iterations, using successive halving as a subroutine for selecting the suitable hyperparameters.

### Metaheuristics

*Metaheuristics* are a set of search methods that can cope with difficult constraints, such as nonconvexity, noncontinuous and nonsmooth functions. In theory, metaheuristics can do better approximations to a global optimum (if a true optimum exists) for large scale models. For hyperparameter search, popular metaheuristic approaches are *evolutionary computation* and *particle swarm* approaches: in evolutionary computation, a set of initial, random hyperparameters are evaluated, then mutated and selected according to a fitness function; particle swarm approaches work in a similar manner, but searching for hyperparameters is done in a semi-collective manner: each hyperparameter is individually evaluated and then information is shared between all hyperparameters to direct the search for a new set of hyperparameters.

## Software for hyperparameter search

This documentation is biased towards the Python programming languages. We present here software packages which have a clearer API, documentation and are suitable usage in CSC's computing infrastructure.

Due to package dependencies, you should load the python-data module with the latest python version:

```
module load python-data/3.9-1
```

#### Scikit-learn

When Grid or Random search is a suitable option for hyperparameter search, **Scikit-learn** has implementations of both Grid and Random search with cross-validation. Cross-validation is its own model selection process, and is highly dependent on the amount of available data and, for example, the number of folds to use (number of folds and train/test data split are dependent variables). Note that if you are testing a *larger number* of possible models, you will be in risk of getting an overfitted model.

```
# Scikit-learn
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
hyperparams = { ... } # list of hyperparameters with their ranges and options
gridsearch = GridSearchCV( model, hyperparams, ... )
... = # Define data splits
gridsearch.fit( Xtrain, ytrain )
```

#### Optuna

**Optuna** is a model agnostic library for hyperparameter search. In addition to Grid and Random search, Optuna implements Tree-structured Parzen estimator and CMA-ES sampling methods for searching hyperparameters, and additionally you can implement your own custom sampler methods. To make searching more efficient, multifidelity search can be done with Optuna, with Median or Threshold as an additional pruning option of the search space.

```
# Optuna
import optuna
model = ... # Define model
hyperparams = { ... }
gridsearch = optuna.create_study( sampler = optuna.samplers.GridSampler( params ) )
gridsearch.optimize( model )
```

#### Scikit-optimize

**Scikit-optimize** implements adaptive search strategies and is built on top f scikit-learn. This makes it easy to implement right on top of models built using scikit-learn.

```
# Scikit-optimize using Scikit-learn models
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier() # Define GradientBoosting
from skopt.space import Real, Integer
from skopt.utils import use_name_args
hyperparams = [ Integer( 1, 5, name = 'max_depth' ),
Real( 10 ** -5, 10 ** 0, 'log-uniform', name = 'learning_rate' )
]
@used_named_args( space )
def objective( **hyperparams ):
model.set_params( **hyperparams )
return -( np.mean( cross_val_score( model, X, y, cv = 3, n_jobs = -1, scoring = 'neg_mean_absolute_error' ) )
# Execute Gaussian Process
from skopt import gp_minimize
model_gp = gp_minimize( objective, hyperparams, n_class = 50, random_state = 123 )
```

#### Ray

For the case you would like to enable parallel model selection, *Ray* offers an efficient wrappers for various Python packages. In case you need packages that are not included in the python-data module, use

```
pip install --user <package-name>
```

**tune-sklearn**package. With tune-sklearn, you will get better parallelization performance using Ray, instead of using

**n_jobs = -1**argument within GridSearchCV or RandomizedSearchCV functions. Here's is an example using

**TuneGridSearchCV**with models using scikit-learn:

```
from ray.tune.sklearn import TuneGridSearchCV
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
hyperparameter_grid = { "loss": [ 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron' ],
"max_iter": [ 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 ],
"n_iter_no_change": [ 5, 10, 15, 25, 40, 60 ]
}
tune_search = TuneGridSearchCV( model, hyperparameter_grid, early_stopping = True, max_iters = 10,
use_gpu = False, n_jobs = -1 )
... # Do your data split into train, validation and test sets
tune_search.fit( Xtrain, ytrain )
print( tune_search.best_params_ )
```

**TuneSearchCV** function enables the use of Randomized and Adaptive Search. To switch to adaptive search, change the **search_optimization** argument. By default TuneSearchCV uses Randomized search. Instead of fixed set of hyperparameters, use suitable ranges using Numpy or Scipy functions. Note that different *search_optimization* arguments may have packages dependencies which you have to install if they are missing.

```
from ray.tune.sklearn import TuneSearchCV
hyperparameter_grid = { ... } # Set hyperparameter ranges
# Randomized search
tune_randomized = TuneSearchCV( model, hyperparameter_grid, early_stopping = True,
search_optimization = 'random',
max_iters = 10, use_gpu = False, n_jobs = -1 )
tune_bayes = TuneSearchCV( model, hyperparameter_grid, early_stopping = True,
search_optimization = 'bayesian',
max_iters = 10, use_gpu = False, n_jobs = -1 )
```

### Example using CSC's supercomputers

Here are case examples of doing model selection using Puhti and Mahti. Be sure to assign the number of CPUs **explicitly** in Puhti. In Mahti when you reserve a node, you automatically reserve the max number of CPUs. Also check the documentation for the partition names to use in both Puhti and Mahti.

```
# example-slurm.sh
#!/bin/bash
#SBATCH --partition=small
#SBATCH --nodes=1
#SBATCH --time=10:00:00
#SBATCH --ntasks=1
#SBATCH --mem=64G
#SBATCH --cpus-per-task=40
#SBATCH --account=projectname
module load python-data/3.9-1
set -xv
python3 $*
```

```
# hyperparameter_example.py
from ray.tune.sklearn import TuneGridSearchCV
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
hyperparameter_grid = { "loss": [ 'hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron' ],
"max_iter": [ 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 ],
"n_iter_no_change": [ 5, 10, 15, 25, 40, 60 ]
}
tune_search = TuneGridSearchCV( model, hyperparameter_grid, early_stopping = True, max_iters = 10,
use_gpu = False, n_jobs = -1 )
X, y = load_digits(n_class=10, return_X_y=True)
Xtrain, Xtest, ytrain, ytest = train_test_split( X, y, train_size = 0.90, test_size = .1, random_state = 0 ) # Split chosen arbitrarily
tune_search.fit( Xtrain, ytrain )
print( tune_search.best_params_ )
print( tune_search.best_score_ )
print( tune_search.score( Xtest, ytest ) )
```

```
sbatch example-slurm.sh hyperparameter_example.py
```

### More examples

More examples can be found from https://github.com/bilbrait/ml-guide-examples.

*Last edited Mon Dec 20 2021*