Overview

This document presents the evaluation results of DeepSeek-R1-Distill-Llama-70B, a 4-bit quantized model using GPTQ, evaluated with the Language Model Evaluation Harness on the ARC-Challenge benchmark.


📊 Evaluation Summary

Metric Value Description 8bit
Accuracy (acc,none) 21.2% Raw accuracy - percentage of correct answers. 21.2%
Standard Error (acc_stderr,none) 1.19% Uncertainty in the accuracy estimate. 1.2%
Normalized Accuracy (acc_norm,none) 25.4% Accuracy after dataset-specific normalization. 25.2%
Standard Error (acc_norm_stderr,none) 1.27% Uncertainty for normalized accuracy. 1.3%

📌 Interpretation:

  • The model correctly answered 21.2% of the questions.
  • After normalization, the accuracy slightly improves to 25.4%.
  • The standard error (~1.27%) indicates a small margin of uncertainty.

⚙️ Model Configuration

  • Model: DeepSeek-R1-Distill-Llama-70B
  • Parameters: 70 billion
  • Quantization: 4-bit GPTQ
  • Source: Hugging Face (hf)
  • Precision: torch.float16
  • Hardware: NVIDIA A100 80GB PCIe
  • CUDA Version: 12.4
  • PyTorch Version: 2.6.0+cu124
  • Batch Size: 1
  • Evaluation Time: 365.89 seconds (~6 minutes)

📌 Interpretation:

  • The evaluation was performed on a high-performance GPU (A100 80GB).
  • The model is significantly larger than the previous 8B version, with GPTQ 4-bit quantization reducing memory footprint.
  • A single-sample batch size was used, which might slow evaluation speed.

📂 Dataset Information

  • Dataset: AI2 ARC-Challenge
  • Task Type: Multiple Choice
  • Number of Samples Evaluated: 1,172
  • Few-shot Examples Used: 0 (Zero-shot setting)

📌 Interpretation:

  • This benchmark assesses grade-school-level scientific reasoning.
  • Since no few-shot examples were provided, the model was evaluated in a pure zero-shot setting.

📈 Performance Insights

  • The "higher_is_better" flag confirms that higher accuracy is preferred.
  • The model's raw accuracy (21.2%) is significantly lower compared to state-of-the-art models (60–80% on ARC-Challenge).
  • Quantization Impact: The 4-bit GPTQ quantization reduces memory usage but may also impact accuracy slightly.
  • Zero-shot Limitation: Performance could improve with few-shot prompting (providing examples before testing).

📌 Let us know if you need further analysis or model tuning! 🚀

Downloads last month
473
Safetensors
Model size
11.3B params
Tensor type
I32
·
BF16
·
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit

Quantized
(52)
this model

Dataset used to train empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit