Language Model Evaluation Results

Overview

This document presents the evaluation results of Llama-3.1-8B-Instruct-gptq-4bit using the Language Model Evaluation Harness on the ARC-Challenge benchmark.


πŸ“Š Evaluation Summary

Metric Value Description original
Accuracy (acc,none) 47.1% Raw accuracy - percentage of correct answers. 53.1%
Standard Error (acc_stderr,none) 1.46% Uncertainty in the accuracy estimate. 1.45%
Normalized Accuracy (acc_norm,none) 49.9% Accuracy after dataset-specific normalization. 56.8%
Standard Error (acc_norm_stderr,none) 1.46% Uncertainty for normalized accuracy. 1.45%

πŸ“Œ Interpretation:

  • The model correctly answered 47.1% of the questions.
  • After normalization, the accuracy slightly improves to 49.9%.
  • The standard error (~1.46%) indicates a small margin of uncertainty.

βš™οΈ Model Configuration

  • Model: Llama-3.1-8B-Instruct-gptq-4bit
  • Parameters: 1.05 billion (Quantized 4-bit model)
  • Source: Hugging Face (hf)
  • Precision: torch.float16
  • Hardware: NVIDIA A100 80GB PCIe
  • CUDA Version: 12.4
  • PyTorch Version: 2.6.0+cu124
  • Batch Size: 1
  • Evaluation Time: 365.89 seconds (~6 minutes)

πŸ“Œ Interpretation:

  • The evaluation was performed on a high-performance GPU (A100 80GB).
  • The model is 4-bit quantized, reducing memory usage but possibly affecting accuracy.
  • A single-sample batch size was used, which might slow evaluation speed.

πŸ“‚ Dataset Information

  • Dataset: AI2 ARC-Challenge
  • Task Type: Multiple Choice
  • Number of Samples Evaluated: 1,172
  • Few-shot Examples Used: 0 (Zero-shot setting)

πŸ“Œ Interpretation:

  • This benchmark assesses grade-school-level scientific reasoning.
  • Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

πŸ“ˆ Performance Insights

  • The "higher_is_better" flag confirms that higher accuracy is preferred.
  • The model's raw accuracy (47.1%) is moderate compared to state-of-the-art models (60–80% on ARC-Challenge).
  • Quantization Impact: The 4-bit quantized model might perform slightly worse than a full-precision version.
  • Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

πŸ“Œ Let us know if you need further analysis or model tuning! πŸš€

Citation

If you use this model in your research or project, please cite it as follows:

πŸ“Œ Dr. Wasif Masood (2024). 4bit Llama-3.1-8B-Instruct. Version 1.0.
Available at: https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit

BibTeX:

@dataset{rwmasood2024,
  author    = {Dr. Wasif Masood and Empirisch Tech GmbH},
  title     = {Llama-3.1-8B 4 bit quantized},
  year      = {2024},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit},
  version   = {1.0},
  license   = {llama3.1},
  institution = {Empirisch Tech GmbH}
}
Downloads last month
3
Safetensors
Model size
1.99B params
Tensor type
I32
Β·
BF16
Β·
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit

Quantized
(338)
this model

Dataset used to train empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit