Darija-GPT: Small Multilingual Language Model (Darija Arabic)

Model Description

This is a small multilingual language model based on a Transformer architecture (GPT-like). It is trained from scratch on a subset of Wikipedia data in the ary language for demonstration and experimentation.

Architecture

Transformer-based language model (Decoder-only).
Reduced model dimensions (n_embd=768, n_head=12, n_layer=12) for faster training and smaller model size, making it suitable for resource-constrained environments.
Uses Byte-Pair Encoding (BPE) tokenizer trained on the same Wikipedia data.

Training Data

Trained on a Wikipedia subset in the following language:
ary
The dataset is prepared and encoded to be efficient for training smaller models.

Limitations

Small Model: Parameter count is limited to approximately 30 million, resulting in reduced capacity compared to larger models.
Limited Training Data: Trained on a subset of Wikipedia, which is relatively small compared to massive datasets used for state-of-the-art models.
Not State-of-the-Art: Performance is not expected to be cutting-edge due to size and data limitations.
Potential Biases: May exhibit biases from the Wikipedia training data and may not generalize perfectly to all Darija dialects or real-world text.

Intended Use

Primarily for research and educational purposes.
Demonstrating language modeling in ary.
As a starting point for further experimentation in low-resource NLP, model compression, or fine-tuning on specific Darija tasks.
For non-commercial use only.

How to Use

You can use this model with the transformers library from Hugging Face. Make sure you have transformers installed (pip install transformers).

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Duino/Darija-GPT")
model = AutoModelForCausalLM.from_pretrained("Duino/Darija-GPT")

prompt_text = "هذا نموذج لغوي صغير" # Example prompt in Arabic/Darija
input_ids = tokenizer.encode(prompt_text, return_tensors="pt").to(model.device)

# Generate text (adjust max_length, temperature, top_p as needed)
output = model.generate(input_ids, max_new_tokens=50, temperature=0.9, top_p=0.9)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Prompt:", prompt_text)
print("Generated text:", generated_text)

Training Plot

This plot shows the training and validation loss curves over epochs.

Intended Use

This model is primarily intended for research and educational purposes to demonstrate language modeling, especially in low-resource languages like Darija Arabic.

Limitations

Please be aware of the limitations due to the small model size and limited training data, as detailed in the Model Description.

Duino
/

Darija-GPT-v2

Darija-GPT: Small Multilingual Language Model (Darija Arabic)

Model Description

Architecture

Training Data

Limitations

Intended Use

How to Use

Training Plot

Intended Use

Limitations

Dataset used to train Duino/Darija-GPT-v2