Llama-3.1-Nemotron-70B-Instruct-W8A8-dynamic

Model Overview

  • Model Architecture: Llama-3.1-Nemotron-70B-Instruct-HF
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: INT8
    • Activation quantization: INT8
  • Release Date: 2/12/2025
  • Version: 1.0
  • Model Developers: Elias Oenal

Quantized version of Llama-3.1-Nemotron-70B-Instruct-HF.

Model Optimizations

This model was obtained by quantizing the weights and activations to W8A8 data type, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Creation

This model was created with llm-compressor and the neuralmagic/LLM_compression_calibration dataset.

Downloads last month
0
Safetensors
Model size
70.6B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for EliasOenal/Llama-3.1-Nemotron-70B-Instruct-W8A8-dynamic

Dataset used to train EliasOenal/Llama-3.1-Nemotron-70B-Instruct-W8A8-dynamic