Language Model Evaluation Results
Overview
This document presents the evaluation results of Llama-3.1-8B-Instruct-gptq-4bit
using the Language Model Evaluation Harness on the ARC-Challenge benchmark.
π Evaluation Summary
Metric | Value | Description | original |
---|---|---|---|
Accuracy (acc,none) | 47.1% |
Raw accuracy - percentage of correct answers. | 53.1% |
Standard Error (acc_stderr,none) | 1.46% |
Uncertainty in the accuracy estimate. | 1.45% |
Normalized Accuracy (acc_norm,none) | 49.9% |
Accuracy after dataset-specific normalization. | 56.8% |
Standard Error (acc_norm_stderr,none) | 1.46% |
Uncertainty for normalized accuracy. | 1.45% |
π Interpretation:
- The model correctly answered 47.1% of the questions.
- After normalization, the accuracy slightly improves to 49.9%.
- The standard error (~1.46%) indicates a small margin of uncertainty.
βοΈ Model Configuration
- Model:
Llama-3.1-8B-Instruct-gptq-4bit
- Parameters:
1.05 billion
(Quantized 4-bit model) - Source: Hugging Face (
hf
) - Precision:
torch.float16
- Hardware:
NVIDIA A100 80GB PCIe
- CUDA Version:
12.4
- PyTorch Version:
2.6.0+cu124
- Batch Size:
1
- Evaluation Time:
365.89 seconds (~6 minutes)
π Interpretation:
- The evaluation was performed on a high-performance GPU (A100 80GB).
- The model is 4-bit quantized, reducing memory usage but possibly affecting accuracy.
- A single-sample batch size was used, which might slow evaluation speed.
π Dataset Information
- Dataset:
AI2 ARC-Challenge
- Task Type:
Multiple Choice
- Number of Samples Evaluated:
1,172
- Few-shot Examples Used:
0
(Zero-shot setting)
π Interpretation:
- This benchmark assesses grade-school-level scientific reasoning.
- Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.
π Performance Insights
- The
"higher_is_better"
flag confirms that higher accuracy is preferred. - The model's raw accuracy (47.1%) is moderate compared to state-of-the-art models (60β80% on ARC-Challenge).
- Quantization Impact: The 4-bit quantized model might perform slightly worse than a full-precision version.
- Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).
π Let us know if you need further analysis or model tuning! π
Citation
If you use this model in your research or project, please cite it as follows:
π Dr. Wasif Masood (2024). 4bit Llama-3.1-8B-Instruct. Version 1.0.
Available at: https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit
BibTeX:
@dataset{rwmasood2024,
author = {Dr. Wasif Masood and Empirisch Tech GmbH},
title = {Llama-3.1-8B 4 bit quantized},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit},
version = {1.0},
license = {llama3.1},
institution = {Empirisch Tech GmbH}
}
- Downloads last month
- 3
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.
Model tree for empirischtech/Meta-Llama-3.1-8B-Instruct-gptq-4bit
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct