--- {} --- # BPE Tokenizer for Nepali LLM - This repository contains a Byte Pair Encoding (BPE) tokenizer trained using the Hugging Face `transformers` package on ~30GB of Nepali LLM dataset ([IRIISNEPAL/Nepali-Text-Corpus](https://huggingface.co/datasets/IRIISNEPAL/Nepali-Text-Corpus) + [nepberta-dataset](https://nepberta.github.io/)). - The tokenizer has been optimized for handling Nepali text and is intended for use in language modeling and other natural language processing tasks. ## Overview - **Tokenizer Type**: Byte Pair Encoding (BPE) - **Vocabulary Size**: 50,006 - **Dataset Used**: [Nepali LLM Datasets](https://huggingface.co/datasets/Aananda-giri/nepali_llm_datasets) ## Special tokens ``` 0: <|endoftext|> 1: <|unk|> 50000: <|begin_of_text|> 50001: <|end_of_text|> 50002: <|start_header_id|> 50003: <|end_header_id|> 50004: <|eot_id|> 50005: '\n\n' ``` ## Installation To use the tokenizer, you need to install the `transformers` library. You can install it via pip: ```bash pip install transformers ``` ## Usage You can easily load the tokenizer using the following code: ```python from transformers import PreTrainedTokenizerFast # Load the tokenizer tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE") # Example usage tokenizer.tokenize('राम ले भात खायो ।') # ['राम', 'ले', 'भात', 'खायो', '।'] tokenizer.encode('राम ले भात खायो ।') # [1621, 285, 14413, 27675, 251] tokenizer.decode([1621, 285, 14413, 27675, 251]) # राम ले भात खायो । ```