Qwen2.5-0.5B-quantized.w8a16

Model Overview

  • Model Architecture: Qwen2
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: INT8
  • Intended Use Cases: Similarly to Qwen2.5-0.5B, this is a base language model.
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
  • Release Date: 10/09/2024
  • Version: 1.0
  • Model Developers: Neural Magic

Quantized version of Qwen2.5-0.5B. It achieves an OpenLLMv1 score of 43.9, compared to 44.0 for Qwen2.5-0.5B.

Model Optimizations

This model was obtained by quantizing the weights of Qwen2.5-0.5B to INT8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/Qwen2.5-0.5B-quantized.w8a16"
number_gpus = 1
max_model_len = 8192

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.

Evaluation

The model was evaluated on the OpenLLMv1 benchmark, composed of MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande and TruthfulQA. Evaluation was conducted using lm-evaluation-harness and the vLLM engine.

Accuracy

Category Benchmark Qwen2.5-0.5B Qwen2.5-0.5B-quantized.w8a16
(this model)
Recovery
OpenLLM v1
MMLU (5-shot) 47.57 47.81 100.5%
ARC Challenge (25-shot) 34.90 34.90 100.0%
GSM-8k (5-shot, strict-match) 34.19 33.51 98.0%
Hellaswag (10-shot) 51.83 51.78 99.9%
Winogrande (5-shot) 55.80 55.49 99.4%
TruthfulQA (0-shot, mc2) 39.90 39.71 99.5%
Average 44.0 43.9 99.6%

Reproduction

The results were obtained using the following command:

lm_eval \
  --model vllm \
  --model_args pretrained="neuralmagic/Qwen2.5-0.5B-quantized.w8a16",dtype=auto,max_model_len=4096,add_bos_token=True,tensor_parallel_size=1 \
  --tasks openllm \
  --batch_size auto
Downloads last month
18
Safetensors
Model size
362M params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for neuralmagic/Qwen2.5-0.5B-quantized.w8a16

Base model

Qwen/Qwen2.5-0.5B
Quantized
(42)
this model