|
|
|
.. _exp-manager-label: |
|
|
|
Experiment Manager |
|
================== |
|
|
|
NeMo |
|
Experiment Manager is included by default in all NeMo example scripts. |
|
|
|
To use the experiment manager simply call :class:`~nemo.utils.exp_manager.exp_manager` and pass in the PyTorch Lightning ``Trainer``. |
|
|
|
.. code-block:: python |
|
|
|
exp_dir = exp_manager(trainer, cfg.get("exp_manager", None)) |
|
|
|
And is configurable via YAML with Hydra. |
|
|
|
.. code-block:: bash |
|
|
|
exp_manager: |
|
exp_dir: /path/to/my/experiments |
|
name: my_experiment_name |
|
create_tensorboard_logger: True |
|
create_checkpoint_callback: True |
|
|
|
Optionally, launch TensorBoard to view the training results in ``./nemo_experiments`` (by default). |
|
|
|
.. code-block:: bash |
|
|
|
tensorboard --bind_all --logdir nemo_experiments |
|
|
|
.. |
|
|
|
If ``create_checkpoint_callback`` is set to ``True``, then NeMo automatically creates checkpoints during training |
|
using PyTorch Lightning |
|
We can configure the ``ModelCheckpoint`` via YAML or CLI. |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
... |
|
# configure the PyTorch Lightning ModelCheckpoint using checkpoint_call_back_params |
|
# any ModelCheckpoint argument can be set here |
|
|
|
# save the best checkpoints based on this metric |
|
checkpoint_callback_params.monitor=val_loss |
|
|
|
# choose how many total checkpoints to save |
|
checkpoint_callback_params.save_top_k=5 |
|
|
|
Resume Training |
|
--------------- |
|
|
|
We can auto-resume training as well by configuring the ``exp_manager``. Being able to auto-resume is important when doing long training |
|
runs that are premptible or may be shut down before the training procedure has completed. To auto-resume training, set the following |
|
via YAML or CLI: |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
... |
|
# resume training if checkpoints already exist |
|
resume_if_exists: True |
|
|
|
# to start training with no existing checkpoints |
|
resume_ignore_no_checkpoint: True |
|
|
|
# by default experiments will be versioned by datetime |
|
# we can set our own version with |
|
exp_manager.version: my_experiment_version |
|
|
|
|
|
Experiment Loggers |
|
------------------ |
|
|
|
Alongside Tensorboard, NeMo also supports Weights and Biases, MLFlow and DLLogger. To use these loggers, simply set the following |
|
via YAML or :class:`~nemo.utils.exp_manager.ExpManagerConfig`. |
|
|
|
|
|
Weights and Biases (WandB) |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
.. _exp_manager_weights_biases-label: |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
... |
|
create_checkpoint_callback: True |
|
create_wandb_logger: True |
|
wandb_logger_kwargs: |
|
name: ${name} |
|
project: ${project} |
|
entity: ${entity} |
|
<Add any other arguments supported by WandB logger here> |
|
|
|
|
|
MLFlow |
|
~~~~~~ |
|
|
|
.. _exp_manager_mlflow-label: |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
... |
|
create_checkpoint_callback: True |
|
create_mlflow_logger: True |
|
mlflow_logger_kwargs: |
|
experiment_name: ${name} |
|
tags: |
|
<Any key:value pairs> |
|
save_dir: |
|
prefix: |
|
artifact_location: None |
|
# provide run_id if resuming a previously started run |
|
run_id: Optional[str] = None |
|
|
|
DLLogger |
|
~~~~~~~~ |
|
|
|
.. _exp_manager_dllogger-label: |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
... |
|
create_checkpoint_callback: True |
|
create_dllogger_logger: True |
|
dllogger_logger_kwargs: |
|
verbose: False |
|
stdout: False |
|
json_file: "./dllogger.json" |
|
|
|
ClearML |
|
~~~~~~~ |
|
|
|
.. _exp_manager_clearml-label: |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
... |
|
create_checkpoint_callback: True |
|
create_clearml_logger: True |
|
clearml_logger_kwargs: |
|
project: None # name of the project |
|
task: None # optional name of task |
|
connect_pytorch: False |
|
model_name: None # optional name of model |
|
tags: None # Should be a list of str |
|
log_model: False # log model to clearml server |
|
log_cfg: False # log config to clearml server |
|
log_metrics: False # log metrics to clearml server |
|
|
|
Exponential Moving Average |
|
-------------------------- |
|
|
|
.. _exp_manager_ema-label: |
|
|
|
NeMo supports using exponential moving average (EMA) for model parameters. This can be useful for improving model generalization |
|
and stability. To use EMA, simply set the following via YAML or :class:`~nemo.utils.exp_manager.ExpManagerConfig`. |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
... |
|
# use exponential moving average for model parameters |
|
ema: |
|
enabled: True # False by default |
|
decay: 0.999 # decay rate |
|
cpu_offload: False # If EMA parameters should be offloaded to CPU to save GPU memory |
|
every_n_steps: 1 # How often to update EMA weights |
|
validate_original_weights: False # Whether to use original weights for validation calculation or EMA weights |
|
|
|
|
|
.. _nemo_multirun-label: |
|
|
|
Hydra Multi-Run with NeMo |
|
------------------------- |
|
|
|
When training neural networks, it is common to perform hyper parameter search in order to improve the performance of a model |
|
on some validation data. However, it can be tedious to manually prepare a grid of experiments and management of all checkpoints |
|
and their metrics. In order to simplify such tasks, NeMo integrates with `Hydra Multi-Run support <https://hydra.cc/docs/tutorials/basic/running_your_app/multi-run/>`_ in order to provide a unified way to run a set of experiments all |
|
from the config. |
|
|
|
There are certain limitations to this framework, which we list below: |
|
|
|
* All experiments are assumed to be run on a single GPU, and multi GPU for single run (model parallel models are not supported as of now). |
|
* NeMo Multi-Run supports only grid search over a set of hyper-parameters, but we will eventually add support for advanced hyper parameter search strategies. |
|
* **NeMo Multi-Run only supports running on one or more GPUs** and will not work if no GPU devices are present. |
|
|
|
Config Setup |
|
~~~~~~~~~~~~ |
|
|
|
In order to enable NeMo Multi-Run, we first update our YAML configs with some information to let Hydra know we expect to run multiple experiments from this one config - |
|
|
|
.. code-block:: yaml |
|
|
|
# Required for Hydra launch of hyperparameter search via multirun |
|
defaults: |
|
- override hydra/launcher: nemo_launcher |
|
|
|
# Hydra arguments necessary for hyperparameter optimization |
|
hydra: |
|
# Helper arguments to ensure all hyper parameter runs are from the directory that launches the script. |
|
sweep: |
|
dir: "." |
|
subdir: "." |
|
|
|
# Define all the hyper parameters here |
|
sweeper: |
|
params: |
|
# Place all the parameters you wish to search over here (corresponding to the rest of the config) |
|
# NOTE: Make sure that there are no spaces between the commas that separate the config params ! |
|
model.optim.lr: 0.001,0.0001 |
|
model.encoder.dim: 32,64,96,128 |
|
model.decoder.dropout: 0.0,0.1,0.2 |
|
|
|
# Arguments to the process launcher |
|
launcher: |
|
num_gpus: -1 # Number of gpus to use. Each run works on a single GPU. |
|
jobs_per_gpu: 1 # If each GPU has large memory, you can run multiple jobs on the same GPU for faster results (until OOM). |
|
|
|
|
|
Next, we will setup the config for ``Experiment Manager``. When we perform hyper parameter search, each run may take some time to complete. |
|
We want to therefore avoid the case where a run ends (say due to OOM or timeout on the machine) and we need to redo all experiments. |
|
We therefore setup the experiment manager config such that every experiment has a unique "key", whose value corresponds to a single |
|
resumable experiment. |
|
|
|
Let us see how to setup such a unique "key" via the experiment name. Simply attach all the hyper parameter arguments to the experiment |
|
name as shown below - |
|
|
|
.. code-block:: yaml |
|
|
|
exp_manager: |
|
exp_dir: null # Can be set by the user. |
|
|
|
# Add a unique name for all hyper parameter arguments to allow continued training. |
|
# NOTE: It is necessary to add all hyperparameter arguments to the name ! |
|
# This ensures successful restoration of model runs in case HP search crashes. |
|
name: ${name}-lr-${model.optim.lr}-adim-${model.adapter.dim}-sd-${model.adapter.adapter_strategy.stochastic_depth} |
|
|
|
... |
|
checkpoint_callback_params: |
|
... |
|
save_top_k: 1 # Dont save too many .ckpt files during HP search |
|
always_save_nemo: True # saves the checkpoints as nemo files for fast checking of results later |
|
... |
|
|
|
# We highly recommend use of any experiment tracking took to gather all the experiments in one location |
|
create_wandb_logger: True |
|
wandb_logger_kwargs: |
|
project: "<Add some project name here>" |
|
|
|
# HP Search may crash due to various reasons, best to attempt continuation in order to |
|
# resume from where the last failure case occured. |
|
resume_if_exists: true |
|
resume_ignore_no_checkpoint: true |
|
|
|
|
|
Running a Multi-Run config |
|
~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
Once the config has been updated, we can now run it just like any normal Hydra script -- with one special flag (``-m``) ! |
|
|
|
.. code-block:: bash |
|
|
|
python script.py --config-path=ABC --config-name=XYZ -m \ |
|
trainer.max_steps=5000 \ # Any additional arg after -m will be passed to all the runs generated from the config ! |
|
... |
|
|
|
Tips and Tricks |
|
~~~~~~~~~~~~~~~ |
|
|
|
* Preserving disk space for large number of experiments |
|
|
|
Some models may have a large number of parameters, and it may be very expensive to save a large number of checkpoints on |
|
physical storage drives. For example, if you use Adam optimizer, each PyTorch Lightning ".ckpt" file will actually be 3x the |
|
size of just the model parameters - per ckpt file ! This can be exhorbitant if you have multiple runs. |
|
|
|
In the above config, we explicitly set ``save_top_k: 1`` and ``always_save_nemo: True`` - what this does is limit the number of |
|
ckpt files to just 1, and also save a NeMo file (which will contain just the model parameters without optimizer state) and |
|
can be restored immediately for further work. |
|
|
|
We can further reduce the storage space by utilizing some utility functions of NeMo to automatically delete either |
|
ckpt or NeMo files after a training run has finished. This is sufficient in case you are collecting results in some experiment |
|
tracking tool and can simply rerun the best config after the search is finished. |
|
|
|
.. code-block:: python |
|
|
|
# Import `clean_exp_ckpt` along with exp_manager |
|
from nemo.utils.exp_manager import clean_exp_ckpt, exp_manager |
|
|
|
@hydra_runner(...) |
|
def main(cfg): |
|
... |
|
|
|
# Keep track of the experiment directory |
|
exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None)) |
|
|
|
... add any training code here as needed ... |
|
|
|
# Add following line to end of the training script |
|
# Remove PTL ckpt file, and potentially also remove .nemo file to conserve storage space. |
|
clean_exp_ckpt(exp_log_dir, remove_ckpt=True, remove_nemo=False) |
|
|
|
|
|
* Debugging Multi-Run Scripts |
|
|
|
When running hydra scripts, you may sometimes face config issues which crash the program. In NeMo Multi-Run, a crash in |
|
any one run will **not** crash the entire program, we will simply take note of it and move onto the next job. Once all |
|
jobs are completed, we then raise the error in the order that it occured (it will crash the program with the first error |
|
stack trace). |
|
|
|
In order to debug Muti-Run, we suggest to comment out the full hyper parameter config set inside ``sweep.params`` |
|
and instead run just a single experiment with the config - which would immediately raise the error. |
|
|
|
|
|
* Experiment name cannot be parsed by Hydra |
|
|
|
Sometimes our hyper parameters include PyTorch Lightning ``trainer`` arguments - such as number of steps, number of epochs |
|
whether to use gradient accumulation or not etc. When we attempt to add these as keys to the expriment manager |
|
Hydra may complain that ``trainer.xyz`` cannot be resolved. |
|
|
|
A simple solution is to finalize the hydra config before you call ``exp_manager()`` as follows - |
|
|
|
.. code-block:: python |
|
|
|
@hydra_runner(...) |
|
def main(cfg): |
|
# Make any changes as necessary to the config |
|
cfg.xyz.abc = uvw |
|
|
|
# Finalize the config |
|
cfg = OmegaConf.resolve(cfg) |
|
|
|
# Carry on as normal by calling trainer and exp_manager |
|
trainer = pl.Trainer(**cfg.trainer) |
|
exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None)) |
|
... |
|
|
|
|
|
ExpManagerConfig |
|
---------------- |
|
|
|
.. autoclass:: nemo.utils.exp_manager.ExpManagerConfig |
|
:show-inheritance: |
|
:members: |
|
:member-order: bysource |
|
|