Spaces:

Steven10429
/

apply_lora_and_quantize

Running

App Files Files Community

apply_lora_and_quantize / README.md

Steven10429

init

260542b 13 days ago

preview code

raw

history blame

1.73 kB

	---
	title: Apply Lora And Quantize
	emoji: 💬
	colorFrom: yellow
	colorTo: purple
	sdk: gradio
	sdk_version: 5.0.1
	app_file: app.py
	pinned: false
	license: mit
	short_description: apply_lora_and_quantize
	---

	An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

	# Model Converter for HuggingFace

	A powerful tool for converting and quantizing Large Language Models (LLMs) with LoRA adapters.

	## Features

	- 🚀 Automatic system resource detection (CPU/GPU)
	- 🔄 Merge base models with LoRA adapters
	- 📊 Support for 4-bit and 8-bit quantization
	- ☁️ Automatic upload to HuggingFace Hub

	## Requirements

	- Python 3.8+
	- CUDA compatible GPU (optional, but recommended)
	- HuggingFace account and token

	## Installation

	```bash
	pip install -r requirements.txt
	```

	## Configuration

	Create a `.env` file in the project root:
	```
	HF_TOKEN=your_huggingface_token
	```

	## Usage

	Run the script:
	```bash
	python space_convert.py
	```

	You will be prompted to enter:
	1. Base model path (e.g., "Qwen/Qwen2.5-7B-Instruct")
	2. LoRA model path
	3. Target HuggingFace repository name

	The script will:
	1. Check available system resources
	2. Choose the optimal device (GPU/CPU)
	3. Merge the base model with LoRA
	4. Create 8-bit and 4-bit quantized versions
	5. Upload everything to HuggingFace

	## Memory Requirements

	- 7B models: ~16GB RAM/VRAM
	- 14B models: ~32GB RAM/VRAM
	- Additional disk space: 3x model size

	## Note

	The script automatically handles:
	- Resource availability checks
	- Device selection
	- Error handling
	- Progress tracking
	- Model optimization