NeMo / docs /source /nlp /nlp_model.rst
camenduru's picture
thanks to NVIDIA ❤
7934b29
.. _nlp_model:
Model NLP
=========
The config file for NLP models contain three main sections:
- ``trainer``: contains the configs for PTL training. For more information, refer to :doc:`../core/core` and `PTL Trainer class API <https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html#trainer-class-api>`.
- ``exp_manager``: the configs of the experiment manager. For more information, refer to :doc:`../core/core`.
- ``model``: contains the configs of the datasets, model architecture, tokenizer, optimizer, scheduler, etc.
The following sub-sections of the model section are shared among most of the NLP models.
- ``tokenizer``: specifies the tokenizer
- ``language_model``: specifies the underlying model to be used as the encoder
- ``optim``: the configs of the optimizer and scheduler :doc:`../core/core`
The ``tokenizer`` and ``language_model`` sections have the following parameters:
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **Parameter** | **Data Type** | **Description** |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **model.tokenizer.tokenizer_name** | string | Tokenizer name will be filled automatically based on ``model.language_model.pretrained_model_name``. |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **model.tokenizer.vocab_file** | string | Path to tokenizer vocabulary. |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **model.tokenizer.tokenizer_model** | string | Path to tokenizer model (only for sentencepiece tokenizer). |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **model.language_model.pretrained_model_name** | string | Pre-trained language model name, for example: ``bert-base-cased`` or ``bert-base-uncased``. |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **model.language_model.lm_checkpoint** | string | Path to the pre-trained language model checkpoint. |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **model.language_model.config_file** | string | Path to the pre-trained language model config file. |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
| **model.language_model.config** | dictionary | Config of the pre-trained language model. |
+------------------------------------------------+-----------------+--------------------------------------------------------------------------------------------------------------+
The parameter **model.language_model.pretrained_model_name** can be one of the following:
- ``megatron-bert-345m-uncased``, ``megatron-bert-345m-cased``, ``biomegatron-bert-345m-uncased``, ``biomegatron-bert-345m-cased``, ``bert-base-uncased``, ``bert-large-uncased``, ``bert-base-cased``, ``bert-large-cased``
- ``distilbert-base-uncased``, ``distilbert-base-cased``
- ``roberta-base``, ``roberta-large``, ``distilroberta-base``
- ``albert-base-v1``, ``albert-large-v1``, ``albert-xlarge-v1``, ``albert-xxlarge-v1``, ``albert-base-v2``, ``albert-large-v2``, ``albert-xlarge-v2``, ``albert-xxlarge-v2``