Language Model Evaluation Results

Overview

This document presents the evaluation results of Llama-3.1-8B-Instruct-gptq-4bit using the Language Model Evaluation Harness on the ARC-Challenge benchmark.

📊 Evaluation Summary

Metric	Value	Description	original
Accuracy (acc,none)	`47.1%`	Raw accuracy - percentage of correct answers.	`53.1%`
Standard Error (acc_stderr,none)	`1.46%`	Uncertainty in the accuracy estimate.	`1.45%`
Normalized Accuracy (acc_norm,none)	`49.9%`	Accuracy after dataset-specific normalization.	`56.8%`
Standard Error (acc_norm_stderr,none)	`1.46%`	Uncertainty for normalized accuracy.	`1.45%`

📌 Interpretation:

The model correctly answered 47.1% of the questions.
After normalization, the accuracy slightly improves to 49.9%.
The standard error (~1.46%) indicates a small margin of uncertainty.

⚙️ Model Configuration

Model: Llama-3.1-8B-Instruct-gptq-4bit
Parameters: 1.05 billion (Quantized 4-bit model)
Source: Hugging Face (hf)
Precision: torch.float16
Hardware: NVIDIA A100 80GB PCIe
CUDA Version: 12.4
PyTorch Version: 2.6.0+cu124
Batch Size: 1
Evaluation Time: 365.89 seconds (~6 minutes)

📌 Interpretation:

The evaluation was performed on a high-performance GPU (A100 80GB).
The model is 4-bit quantized, reducing memory usage but possibly affecting accuracy.
A single-sample batch size was used, which might slow evaluation speed.

📂 Dataset Information

Dataset: AI2 ARC-Challenge
Task Type: Multiple Choice
Number of Samples Evaluated: 1,172
Few-shot Examples Used: 0 (Zero-shot setting)

📌 Interpretation:

This benchmark assesses grade-school-level scientific reasoning.
Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

📈 Performance Insights

The "higher_is_better" flag confirms that higher accuracy is preferred.
The model's raw accuracy (47.1%) is moderate compared to state-of-the-art models (60–80% on ARC-Challenge).
Quantization Impact: The 4-bit quantized model might perform slightly worse than a full-precision version.
Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

📌 Let us know if you need further analysis or model tuning! 🚀

Citation

If you use this model in your research or project, please cite it as follows:

📌 Dr. Wasif Masood (2024). 4bit Llama-3.1-8B-Instruct. Version 1.0.
Available at: https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit

BibTeX:

@dataset{rwmasood2024,
  author    = {Dr. Wasif Masood and Empirisch Tech GmbH},
  title     = {Llama-3.1-8B 4 bit quantized},
  year      = {2024},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit},
  version   = {1.0},
  license   = {llama3.1},
  institution = {Empirisch Tech GmbH}
}

empirischtech
/

Meta-Llama-3.1-8B-Instruct-gptq-4bit

Language Model Evaluation Results

Overview

📊 Evaluation Summary

⚙️ Model Configuration

📂 Dataset Information

📈 Performance Insights

Citation

BibTeX:

Model tree for empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit

Dataset used to train empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit