license: mit
language:
- en
base_model:
- facebook/bart-large-cnn
pipeline_tag: summarization
model-index:
- name: summarize-reddit
results:
- task:
type: summarization
dataset:
name: custom reddit posts
type: custom
metrics:
- name: ROUGE-1
type: ROUGE
value:
recall: 32.2
precision: 22.03
f1-score: 25
- name: ROUGE-2
type: ROUGE
value:
recall: 7.1
precision: 4.9
f1-score: 5.51
- name: ROUGE-L
type: ROUGE
value:
recall: 30.09
precision: 20.5
f1-score: 23.3
- name: BERTScore
type: BERTScore
value:
precision: 0.8704
recall: 0.8517
f1-score: 0.8609
source:
name: summarize-reddit
url: https://huggingface.co/julsCadenas/summarize-reddit
Reddit Summarization Model
This project uses a fine-tuned model for summarizing Reddit posts and their comments. The model has been trained using a dataset of 100 Reddit posts, and the goal is to generate concise and meaningful summaries of the original posts and the associated comments.
You can access the source code and more information about this project on GitHub: GitHub Repository Link
Model on Hugging Face
This project uses a fine-tuned version of the BART model from Facebook for summarizing Reddit posts and their comments. The original model, facebook/bart-large-cnn, is a pre-trained sequence-to-sequence model optimized for summarization tasks. It was fine-tuned on a custom Reddit dataset for this project.
- Original Model: facebook/bart-large-cnn
- Fine-Tuned Model: julsCadenas/summarize-reddit
Installation
To get started, you need to install the required dependencies. You can do this by creating a virtual environment and installing the packages listed in requirements.txt
.
Steps:
- Clone the repository:
git clone https://github.com/your-username/reddit-summarizer.git cd reddit-summarizer
- Set up a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows, use 'venv\Scripts\activate'
- Install depdendencies:
pip install -r requirements.txt
- Set up your environment variables (if needed) by creating a
.env
file. You can refer to the sample.env.example
for the necessary variables.
Usage
- In
src/summarize.py
, the model should be initialized like this:# src/summarize.py self.summarizer = pipeline( "summarization", model = "julsCadenas/summarize-reddit", tokenizer = "julsCadenas/summarize-reddit", )
- Add the URL of your preferred Reddit post on main.py.
- Run
src/main.py
Formatted JSON Output
The model outputs its responses in JSON format, which may not be fully formatted properly. For instance, the output could look like this.
You can see that the output contains escaped quotes within the values. This data should be properly formatted for easier consumption. To fix this, you can use the following function to clean and format the JSON:
def fix_json(raw_data, fixed_path):
if not isinstance(raw_data, dict):
raise ValueError(f"Expected a dictionary, but got: {type(raw_data)}")
try:
formatted_data = {
"post_summary": json.loads(raw_data["post_summary"]),
"comments_summary": json.loads(raw_data["comments_summary"])
}
except json.JSONDecodeError as e:
print("Error decoding JSON:", e)
return
with open(fixed_path, "w") as file:
json.dump(formatted_data, file, indent=4)
print(f"Formatted JSON saved to {fixed_path}")
After using the fix_json() function to clean and format the data, the data will now look like this.
You can view the full notebook on formatting the output here.
Model Evaluation
For a detailed evaluation of the model, including additional analysis and visualizations, refer to the evaluation notebook.
BERTScore
The model’s performance was evaluated using BERTScore (Precision, Recall, and F1).
Average BERTScores
Metric | Value |
---|---|
Precision (p) | 0.8704 |
Recall (r) | 0.8517 |
F1 Score (f) | 0.8609 |
Conclusion
- Precision is strong but can be improved by reducing irrelevant tokens.
- Recall needs improvement to capture more relevant content.
- F1 Score indicates a solid overall performance.
Improvements
- Focus on improving Recall.
- Perform Error Analysis to identify missed content.
- Fine-tune the model for better results.
ROUGE
The following table summarizes the ROUGE scores (Recall, Precision, and F1) for three different metrics: ROUGE-1, ROUGE-2, and ROUGE-L. These values represent the mean scores across all summaries.
Average ROUGE Scores
Metric | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
Recall (r) | 32.20 | 7.10 | 30.09 |
Precision (p) | 22.03 | 4.90 | 20.50 |
F1 Score (f) | 25.00 | 5.51 | 23.30 |
Interpretation
- ROUGE-1: Shows higher recall and precision, indicating that the model is good at capturing single-word overlaps but could reduce irrelevant words.
- ROUGE-2: Exhibits lower recall and precision, indicating the model struggles with bigram relationships and context.
- ROUGE-L: Performs better than ROUGE-2 but still faces challenges with precision. It captures longer subsequences more effectively than bigrams.
Conclusion
- ROUGE-1: The model shows moderate performance but generates some irrelevant words (low precision).
- ROUGE-2: The model performs poorly, indicating difficulty in capturing bigram relationships.
- ROUGE-L: Slightly better than ROUGE-2, with some success in capturing longer sequences.
Improvements
- Focus on enhancing bigram overlap (ROUGE-2) and overall context understanding.
- Reduce irrelevant content for improved precision.
- Improve sequence coherence for better ROUGE-L scores.
METEOR Score
Metric | Meteor Score |
---|---|
Mean | 0.2079 |
Min | 0.0915 |
Max | 0.3216 |
STD | 0.0769 |
Interpretation
- Mean: The average METEOR score indicates good performance in terms of word alignment and synonyms, but there is still room for improvement.
- Min: The lowest METEOR score suggests some summaries may not align well with the reference.
- Max: The highest METEOR score shows the model's potential for generating very well-aligned summaries.
- STD: The standard deviation indicates some variability in the model's performance across different summaries.
Conclusion
- The model's METEOR Score shows a generally solid performance in generating summaries that align well with reference content but still has variability in certain cases.
Improvements
- Focus on improving the alignment and synonym usage to achieve higher and more consistent METEOR scores across summaries.
TLDR
Comparison & Final Evaluation
- BERTScore suggests the model is good at generating relevant tokens (precision) but struggles with capturing all relevant content (recall).
- ROUGE-1 is decent, but ROUGE-2 and ROUGE-L show weak performance, particularly in terms of bigram relationships and sequence coherence.
- METEOR results show solid alignment, but there’s significant variability, especially with lower scores.
Conclusion
- The model performs decently but lacks consistency, especially in bigram overlap (ROUGE-2) and capturing longer sequences (ROUGE-L). There’s room for improvement in recall and precision to make the summaries more relevant and coherent.
- Focus on improving recall, bigram relationships, and precision to achieve more consistent, high-quality summaries.