Kirundi Tokenizer

This is a SentencePiece-based tokenizer model trained for the Kirundi language. It can be used for tokenizing text in Kirundi for NLP tasks.

Model Details

  • Model type: SentencePiece
  • Vocabulary size: 32,000
  • Training corpus: A clean corpus of Kirundi text.

Training Data

The tokenizer was trained on a diverse corpus of Kirundi text collected from various sources. The data was preprocessed to remove any unwanted characters and cleaned for tokenization.

How to Use

import sentencepiece as spm

# Load the tokenizer
sp = spm.SentencePieceProcessor(model_file='kirundi.model')

# Tokenize text
text = "Ndakunda igihugu canje."
tokens = sp.encode(text, out_type=str)
print(tokens)

# Detokenize text
decoded_text = sp.decode(tokens)
print(decoded_text)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.