BPE Tokenizer for Nepali LLM

  • This repository contains a Byte Pair Encoding (BPE) tokenizer trained using the Hugging Face transformers package on ~30GB of Nepali LLM dataset (IRIISNEPAL/Nepali-Text-Corpus + nepberta-dataset).
  • The tokenizer has been optimized for handling Nepali text and is intended for use in language modeling and other natural language processing tasks.

Overview

  • Tokenizer Type: Byte Pair Encoding (BPE)
  • Vocabulary Size: 50,006
  • Dataset Used: Nepali LLM Datasets

Special tokens

<id>        <token>

0:        <|endoftext|>
1:        <|unk|>
50000:    <|begin_of_text|>
50001:    <|end_of_text|>
50002:    <|start_header_id|>
50003:    <|end_header_id|>
50004:    <|eot_id|>
50005:    '\n\n'

Installation

To use the tokenizer, you need to install the transformers library. You can install it via pip:

pip install transformers

Usage

You can easily load the tokenizer using the following code:

from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE")

# Example usage
tokenizer.tokenize('राम ले भात खायो ।')
# ['राम</w>', 'ले</w>', 'भात</w>', 'खायो</w>', '।</w>']

tokenizer.encode('राम ले भात खायो ।')
# [1621, 285, 14413, 27675, 251]

tokenizer.decode([1621, 285, 14413, 27675, 251])
# राम ले भात खायो ।
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Space using Aananda-giri/NepaliBPE 1