Darija-GPT: Small Multilingual Language Model (Darija Arabic)
Model Description
This is a small multilingual language model based on a Transformer architecture (GPT-like). It is trained from scratch on a subset of Wikipedia data in the ary language for demonstration and experimentation.
Architecture
- Transformer-based language model (Decoder-only).
- Reduced model dimensions (
n_embd=768
,n_head=12
,n_layer=12
) for faster training and smaller model size, making it suitable for resource-constrained environments. - Uses Byte-Pair Encoding (BPE) tokenizer trained on the same Wikipedia data.
Training Data
- Trained on a Wikipedia subset in the following language:
- ary
- The dataset is prepared and encoded to be efficient for training smaller models.
Limitations
- Small Model: Parameter count is limited to approximately 30 million, resulting in reduced capacity compared to larger models.
- Limited Training Data: Trained on a subset of Wikipedia, which is relatively small compared to massive datasets used for state-of-the-art models.
- Not State-of-the-Art: Performance is not expected to be cutting-edge due to size and data limitations.
- Potential Biases: May exhibit biases from the Wikipedia training data and may not generalize perfectly to all Darija dialects or real-world text.
Intended Use
- Primarily for research and educational purposes.
- Demonstrating language modeling in ary.
- As a starting point for further experimentation in low-resource NLP, model compression, or fine-tuning on specific Darija tasks.
- For non-commercial use only.
How to Use
You can use this model with the transformers
library from Hugging Face. Make sure you have transformers
installed (pip install transformers
).
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Duino/Darija-GPT")
model = AutoModelForCausalLM.from_pretrained("Duino/Darija-GPT")
prompt_text = "هذا نموذج لغوي صغير" # Example prompt in Arabic/Darija
input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to(model.device)
# Generate text (adjust max_length, temperature, top_p as needed)
output = model.generate(input_ids, max_new_tokens=50, temperature=0.9, top_p=0.9)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Prompt:", prompt_text)
print("Generated text:", generated_text)
Training Plot
This plot shows the training and validation loss curves over epochs.
Intended Use
This model is primarily intended for research and educational purposes to demonstrate language modeling, especially in low-resource languages like Darija Arabic.
Limitations
Please be aware of the limitations due to the small model size and limited training data, as detailed in the Model Description.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.