YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~135M param model.

Test network using Differential Transformer (Attention). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

Scripts:

  • inference.py to run the model with some test prompts
  • test_train.py runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with "text":"example text", "text":"..."

Notes:

Appears to be very competent, learned significantly faster than the GQA control. Achieved a slightly better minimum loss. The runtime at this scale is about on par with the GQA/MHA control.

Training Metrics

Dataset Information

  • Training data per epoch: 1 GB
  • Total tokens trained: 48,261,120
  • No sythetic data

Training Results

  • Final Train Loss: 2.8485
  • Final Train Perplexity: 17.15

image/png

Downloads last month
42
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.