---
{}
---
# BPE Tokenizer for Nepali LLM

- This repository contains a Byte Pair Encoding (BPE) tokenizer trained using the Hugging Face `transformers` package on ~30GB of Nepali LLM dataset ([IRIISNEPAL/Nepali-Text-Corpus](https://huggingface.co/datasets/IRIISNEPAL/Nepali-Text-Corpus) + [nepberta-dataset](https://nepberta.github.io/)).
- The tokenizer has been optimized for handling Nepali text and is intended for use in language modeling and other natural language processing tasks.

## Overview

- **Tokenizer Type**: Byte Pair Encoding (BPE)
- **Vocabulary Size**: 50,006
- **Dataset Used**: [Nepali LLM Datasets](https://huggingface.co/datasets/Aananda-giri/nepali_llm_datasets)

## Special tokens

```
<id>        <token>

0:        <|endoftext|>
1:        <|unk|>
50000:    <|begin_of_text|>
50001:    <|end_of_text|>
50002:    <|start_header_id|>
50003:    <|end_header_id|>
50004:    <|eot_id|>
50005:    '\n\n'
```
## Installation

To use the tokenizer, you need to install the `transformers` library. You can install it via pip:

```bash
pip install transformers
```
## Usage
You can easily load the tokenizer using the following code:

```python
from transformers import PreTrainedTokenizerFast

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("Aananda-giri/NepaliBPE")

# Example usage
tokenizer.tokenize('राम ले भात खायो ।')
# ['राम</w>', 'ले</w>', 'भात</w>', 'खायो</w>', '।</w>']

tokenizer.encode('राम ले भात खायो ।')
# [1621, 285, 14413, 27675, 251]

tokenizer.decode([1621, 285, 14413, 27675, 251])
# राम ले भात खायो ।
```