Overview

This document presents the evaluation results of DeepSeek-R1-Distill-Llama-70B, a 4-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC-Challenge benchmark.

📊 Evaluation Summary

Metric	Value	Description	8bit
Accuracy (acc,none)	`21.2%`	Raw accuracy - percentage of correct answers.	`21.2%`
Standard Error (acc_stderr,none)	`1.19%`	Uncertainty in the accuracy estimate.	`1.2%`
Normalized Accuracy (acc_norm,none)	`25.4%`	Accuracy after dataset-specific normalization.	`25.2%`
Standard Error (acc_norm_stderr,none)	`1.27%`	Uncertainty for normalized accuracy.	`1.3%`

📌 Interpretation:

The model correctly answered 21.2% of the questions.
After normalization, the accuracy slightly improves to 25.4%.
The standard error (~1.27%) indicates a small margin of uncertainty.

⚙️ Model Configuration

Model: DeepSeek-R1-Distill-Llama-70B
Parameters: 70 billion
Quantization: 4-bit GPTQ
Source: Hugging Face (hf)
Precision: torch.float16
Hardware: NVIDIA A100 80GB PCIe
CUDA Version: 12.4
PyTorch Version: 2.6.0+cu124
Batch Size: 1
Evaluation Time: 365.89 seconds (~6 minutes)

📌 Interpretation:

The evaluation was performed on a high-performance GPU (A100 80GB).
The model is significantly larger than the previous 8B version, with GPTQ 4-bit quantization reducing memory footprint.
A single-sample batch size was used, which might slow evaluation speed.

📂 Dataset Information

Dataset: AI2 ARC-Challenge
Task Type: Multiple Choice
Number of Samples Evaluated: 1,172
Few-shot Examples Used: 0 (Zero-shot setting)

📌 Interpretation:

This benchmark assesses grade-school-level scientific reasoning.
Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

📈 Performance Insights

The "higher_is_better" flag confirms that higher accuracy is preferred.
The model's raw accuracy (21.2%) is significantly lower compared to state-of-the-art models (60–80% on ARC-Challenge).
Quantization Impact: The 4-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

📌 Let us know if you need further analysis or model tuning! 🚀

empirischtech
/

DeepSeek-R1-Distill-Llama-70B-gptq-4bit

Overview

📊 Evaluation Summary

⚙️ Model Configuration

📂 Dataset Information

📈 Performance Insights

Model tree for empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit

Dataset used to train empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit