Model Card: French-Continued Pretrained Language Models

Overview

Vigogne is a series of pretrained LLMs built by Zaion, leader in Conversational AI designed for CX.

This model card documents the continual pretraining of five language models on French data to enhance their proficiency in the French language:

  • Llama-3.2-1B
  • Llama-3.2-3B
  • Llama-3.2-8B
  • Qwen2.5-1.5B
  • Qwen2.5-3B

Training Procedure

The training process consisted of three distinct phases:

Phase 1: Initial Pretraining

  • Data Source: The French subset of the FineWeb-2 corpus.
  • Learning Rate: A constant learning rate was used throughout this phase.

Phase 2: Annealing Phase

  • Learning Rate Scheduler: A cosine scheduler was applied to gradually adjust the learning rate.
  • Data Composition:
    • Subset of FineWeb-2: A portion of the French FineWeb-2 corpus used in Phase 1.
    • LLM-Rewritten Subset: A portion of the corpus rewritten using a Large Language Model (LLM).
    • French Magpie Dataset: A dataset specifically curated for this work, containing response components from a French magpie dataset. Available here.

Phase 3: Supervised Fine-Tuning (SFT)

In this phase, the annealed models were fine-tuned on meticulously curated in-house instruction data.

Evaluation Results

The models were evaluated on various French language tasks, with results detailed in the table below:

Model Reading Comp ARC Challenge HellaSwag Grammar BoolQA French Bench Vocab Avg
CroissantLLMBase 0.6197 0.2258 0.3918 0.7815 0.4887 0.7815 0.5481
SmolLM2-1.7B 0.5211 0.2592 0.3327 0.6134 0.5506 0.5966 0.4789
Mistral-7B-v0.3 0.6619 0.3806 0.4729 0.7563 0.4943 0.7815 0.5912
Lucie-7B 0.6338 0.4097 0.4925 0.7983 0.5505 0.8151 0.6166
Llama-3.2-1B 0.5493 0.2387 0.3548 0.6891 0.5674 0.7563 0.5259
Vigogne_Llama-3.2-1B 0.6338 0.2814 0.4136 0.7647 0.5561 0.7983 0.5747
Qwen2.5-1.5B 0.5915 0.3045 0.3821 0.7563 0.7191 0.7479 0.5836
Vigogne_Qwen2.5-1.5B 0.6619 0.3122 0.4514 0.8403 0.5393 0.8067 0.6019
Llama-3.2-3B 0.6760 0.3550 0.4315 0.7731 0.5000 0.7899 0.5876
Vigogne_Llama-3.2-3B 0.6760 0.3669 0.4897 0.8403 0.6966 0.8403 0.6496
Qwen2.5-3B 0.5774 0.3567 0.4344 0.7563 0.8932 0.7983 0.6361
Vigogne_Qwen2.5-3B 0.6619 0.4080 0.4922 0.8151 0.7247 0.8235 0.6542
Llama-3.1-8B 0.7042 0.4174 0.4881 0.7815 0.4943 0.8067 0.6154
Vigogne_Llama-3.1-8B 0.6760 0.4148 0.5240 0.8067 0.7977 0.8235 0.6738

Reproducing Results

To replicate these results, install lm-evaluation-harness and run the following command:

lm_eval --model hf --model_args pretrained=MODEL --tasks TASK --batch_size auto

where TASK is one of the following: french_bench_arc_challenge, french_bench_grammar, french_bench_hellaswag, french_bench_boolqa, french_bench_reading_comp or french_bench_vocab.

Limitations

  • The models underwent only Supervised Fine-Tuning (SFT) as post-training. Further improvements could be made using additional post-training techniques such as Direct Preference Optimization (DPO). We leave this for future work.
  • The models are of limited capacity and might generate harmful or biased content, incorrect information or generally unhelpful answers.

Ethical Considerations

Users should be aware of potential biases present in the training data, which may influence model outputs. It is recommended to deploy these models responsibly, especially in sensitive applications where fairness and accuracy are crucial.

Acknowledgement

This work was granted access to the HPC resources of IDRIS under the allocation 2024-GC011015467 made by GENCI.

Downloads last month
145
Safetensors
Model size
3.4B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Datasets used to train moussaKam/Vigogne_Qwen2.5-3B

Collection including moussaKam/Vigogne_Qwen2.5-3B