CarperAI
/

openai_summarize_tldr_ppo

Text Generation

Inference Endpoints

Model card Files Files and versions Community

openai_summarize_tldr_ppo / README.md

shubhamshinde's picture

Create README.md

eb07498 about 2 years ago

|

1.67 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	{}
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	This model is a GPT-J 6B fine-tuned on the TL;DR dataset using RLHF (reinforcement learning from human feedback), the same
	technique that powers ChatGPT.

	The TL;DR dataset is a summarization dataset, hence this model is fine-tuned for the summarization task as well.

	This is likely the first open-source LLM fine-tuned on RLHF available publicly, thanks to Carper AI.

	It aims to recreate the results of the [original paper by OpenAI](https://arxiv.org/abs/2009.01325).

	# Model Details

	- Base Model : GPT-J 6B
	- Fine-Tuning Method : PPO, RLHF
	- Fine-Tuning Dataset: TL;DR
	- Fine-Tuning Task: Summarization

	## Model Description

	<!-- Provide a longer summary of what this model is. -->

	- Developed by: Duy V. Phung, Ayush Thakur, Louis Castricato, Jonathan Tow, Alex Havrilla
	- Finetuned from model [optional]: GPT-J 6B

	## Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf



	## Results

	SFT vs PPO

	__ROUGE scores__

	\| Model \| Rouge-1 \| Rouge-2 \| Rouge-L \| Average \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| SFT \| 0.334 \| 0.125 \| 0.261 \| 0.240 \|
	\| PPO \| 0.323 \| 0.109 \| 0.238 \| 0.223 \|

	__Reward scores__

	\| Model \| Average Reward \| Reward $\Delta$ \|
	\| --- \| --- \| --- \|
	\| SFT \| 2.729 \| -0.181 \|
	\| PPO \| 3.291 \| +0.411 \|