Model Card for Model ID

PPO-C (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to classify responses, and adjusts the reward scores based on model expressed verbalized confidence. Please refer to our preprint (Taming Overconfidence in LLMs: Reward Calibration in RLHF) and repo for more details.

Model Details

Model Description

We train OpenRLHF/Llama-3-8b-sft-mixture on our HINT-lab/prompt-collections-final-v0.3 with a vanilla reward model OpenRLHF/Llama-3-8b-rm-mixture.

Model Sources [optional]

Downloads last month
7
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Collection including HINT-lab/llama3-8b-final-ppo-c-v0.3