File size: 3,354 Bytes
752aaa2 8cf2761 acfee14 8cf2761 b01d107 8cf2761 acfee14 8cf2761 f06dec4 8cf2761 acfee14 8cf2761 f06dec4 8cf2761 f06dec4 8cf2761 acfee14 8cf2761 f06dec4 8cf2761 acfee14 2a9fa9a b01d107 2a9fa9a acfee14 8cf2761 f06dec4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
title: LLMEval Dataset Parser
emoji: ⚡
colorFrom: green
colorTo: gray
sdk: docker
pinned: false
license: mit
short_description: A collection of parsers for LLM benchmark datasets
---
# LLMDataParser
**LLMDataParser** is a Python library that provides parsers for benchmark datasets used in evaluating Large Language Models (LLMs). It offers a unified interface for loading and parsing datasets like **MMLU**, **GSM8k**, and others, streamlining dataset preparation for LLM evaluation. The library aims to simplify the process of working with common LLM benchmark datasets through a consistent API.
**Spaces**: You can also try out the online demo on Hugging Face Spaces:
[LLMEval Dataset Parser Demo](https://huggingface.co/spaces/JeffYang52415/LLMEval-Dataset-Parser)
## Features
- **Unified Interface**: Consistent `DatasetParser` for all datasets.
- **Easy to Use**: Simple methods and built-in Python types.
- **Extensible**: Easily add support for new datasets.
- **Gradio**: Built-in Gradio interface for interactive dataset exploration and testing.
## Installation
### Option 1: Using pip
You can install the package directly using `pip`. Even with only a `pyproject.toml` file, this method works for standard installations.
1. **Clone the Repository**:
```bash
git clone https://github.com/jeff52415/LLMDataParser.git
cd LLMDataParser
```
1. **Install Dependencies with pip**:
```bash
pip install .
```
### Option 2: Using Poetry
Poetry manages the virtual environment and dependencies automatically, so you don't need to create a conda environment first.
1. **Install Dependencies with Poetry**:
```bash
poetry install
```
1. **Activate the Virtual Environment**:
```bash
poetry shell
```
## Available Parsers
- **MMLUDatasetParser**
- **MMLUProDatasetParser**
- **MMLUReduxDatasetParser**
- **TMMLUPlusDatasetParser**
- **GSM8KDatasetParser**
- **MATHDatasetParser**
- **MGSMDatasetParser**
- **HumanEvalDatasetParser**
- **HumanEvalDatasetPlusParser**
- **BBHDatasetParser**
- **MBPPDatasetParser**
- **IFEvalDatasetParser**
- **TWLegalDatasetParser**
- **TMLUDatasetParser**
## Quick Start Guide
Here's a simple example demonstrating how to use the library:
```python
from llmdataparser import ParserRegistry
# list all available parsers
ParserRegistry.list_parsers()
# get a parser
parser = ParserRegistry.get_parser("mmlu")
# load the parser
parser.load() # optional: task_name, split
# parse the parser
parser.parse() # optional: split_names
print(parser.task_names)
print(parser.split_names)
print(parser.get_dataset_description)
print(parser.get_huggingface_link)
print(parser.total_tasks)
data = parser.get_parsed_data
```
We also provide a Gradio demo for interactive testing:
```bash
python app.py
```
## Adding New Dataset Parsers
To add support for a new dataset, please refer to our detailed guide in [docs/adding_new_parser.md](docs/adding_new_parser.md). The guide includes:
- Step-by-step instructions for creating a new parser
- Code examples and templates
- Best practices and common patterns
- Testing guidelines
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Contact
For questions or support, please open an issue on GitHub or contact [[email protected]](mailto:[email protected]).
|