julsCadenas
/

summarize-reddit

Summarization

Safetensors

English

bart

Eval Results

Model card Files Files and versions Community

julsCadenas commited on 3 days ago

Commit

67ff4a0

verified ·

1 Parent(s): 86a16f1

Update README.md

Browse files

Files changed (1) hide show

README.md +72 -95

README.md CHANGED Viewed

@@ -37,7 +37,7 @@ model-index:
       url: https://huggingface.co/julsCadenas/summarize-reddit
 ---
-# Reddit Summarization Model
 This project uses a fine-tuned model for summarizing Reddit posts and their comments. The model has been trained using a dataset of 100 Reddit posts, and the goal is to generate concise and meaningful summaries of the original posts and the associated comments.
@@ -50,110 +50,87 @@ This project uses a fine-tuned version of the BART model from Facebook for summa
 - **Original Model:** [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)
 - **Fine-Tuned Model:** [julsCadenas/summarize-reddit](https://huggingface.co/julsCadenas/summarize-reddit)
-## **Usage**
-You can use the model to summarize Reddit posts and comments using the following code:
-```python
-from transformers import pipeline
-class Summarize:
-    def __init__(self):
-        self.summarizer = pipeline(
-            "summarization",
-            model = "julsCadenas/summarize-reddit",
-            tokenizer = "julsCadenas/summarize-reddit",
-        )
-    def summarize(self, text, prompt):
-        inputs = f"{prompt}: {text}"
-        input_tokens = self.summarizer.tokenizer.encode(inputs, truncation=False)
-        input_len = len(input_tokens)
-        max_length = min(input_len * 2, 1024) # change depending on your use case
-        min_length = max(32, input_len // 4) # change depending on your use case
-        summary = self.summarizer(
-            inputs,
-            max_length=max_length,
-            min_length=min_length,
-            length_penalty=2.0,
-            num_beams=4,
-        )
-        return summary[0]['summary_text']
-    def process_data(self, response, prompt):
-        post_content = response[0]['data']['children'][0]['data'].get('selftext', '')
-        comments = []
-        for comment in response[1]['data']['children']:
-            if 'body' in comment['data']:
-                comments.append(comment['data']['body'])
-        comments_all = ' '.join(comments)
-        post_summary = self.summarize(post_content, prompt)
-        comments_summary = self.summarize(comments_all, prompt)
-        return {
-            "post_summary": post_summary,
-            "comments_summary": comments_summary
-        }
-```
-You can also use a script to format the JSON:
-```python
-def fix_json(jsonfile, path):
-    improper_json = jsonfile
-    fixed_json = json.loads(improper_json)
-    fixed_post_summary = json.loads(fixed_json['post_summary'])
-    fixed_comments_summary = json.loads(fixed_json['comments_summary'])
-    fixed_json['post_summary'] = fixed_post_summary
-    fixed_json['comments_summary'] = fixed_comments_summary
-    print(json.dumps(fixed_json, indent=4))
-    with open(path, 'w') as file:
-        json.dump(fixed_json, file, indent=4)
-```
-## **Model Evaluation**
-### **ROGUE-1 SCORES:**
-- **Recall (r)** = 57.66%
-- **Precision (p)** = 43.41%
-- **F1-Score (f)** = 49.53%
-### **ROUGE-2 SCORES:**
-- **Recall (r)** = 29.30%
-- **Precision (p)** = 20.72%
-- **F1-Score (f)** = 24.27%
-### **ROUGE-L SCORES:**
-- **Recall (r)** = 56.20%
-- **Precision (p)** = 42.30%
-- **F1-Score (f)** = 48.28%
 <br>
-**ROUGE-1:** also known as unigram, measures the overlap of unigrams (individual words) between the generated summary and the reference summary. It calculates the proportion of words in the generated summary that are also present in the reference summary. Example: Reference text: “The cat is on the rug” Generated text: “The dog is on the rug” ROUGE-1 = 3/5 = 0.6.
-**ROUGE-2:** also known as bigram, measures the overlap of bigrams (pairs of consecutive words) between the generated summary and the reference summary. It calculates the proportion of bigrams in the generated summary that are also present in the reference summary.
-**ROUGE-L:** measures the similarity between the word sequence of the generated abstract and the reference abstract using the longest sequence of words in common. Unlike ROUGE-1 and ROUGE-2, which use a simple word count approach, ROUGE-L uses a string matching approach.
-**Source:** https://fabianofalcao.medium.com/metrics-for-evaluating-summarization-of-texts-performed-by-transformers-how-to-evaluate-the-b3ce68a309c3
-### **CONCLUSIONS**
-**ROGUE-1 SCORE:** The ROUGE-1 score indicates that the model is fairly good at capturing individual words (unigrams) from the reference summaries. With a recall of 57.66%, the model captures more than half of the relevant words from the reference summaries, which suggests that it's effectively capturing key content. The precision of 43.41% indicates that a significant portion of the generated summary’s words also appear in the reference, but there may be some additional, irrelevant words included. The F1-score of 49.53% shows that, overall, there’s a fairly good balance between recall and precision, although there's still room to increase both aspects for better results.
-**ROGUE-2 SCORE:** The ROUGE-2 score, which focuses on bigram overlap, is considerably lower than ROUGE-1. The recall of 29.30% indicates that the model captures roughly 30% of the bigrams from the reference summaries, which is a moderate result but suggests that the model may not be fully preserving the structural relationships between words. The precision of 20.72% suggests that the generated summaries might include bigrams that are not present in the reference summaries. The F1-score of 24.27% is relatively low, which may indicate that the model needs improvement in capturing bigram patterns in the summaries. This is common in summarization tasks, as producing high-quality bigram overlap is challenging.
-**ROGUE-l SCORE:** The ROUGE-L score, which focuses on the longest common subsequence (LCS), shows that the model is able to capture the overall structure of the reference summaries quite well. The recall of 56.20% suggests that a large portion of the key sequences (order-preserving) from the reference summaries appear in the generated summaries, indicating good coherence. The precision of 42.30% shows that the model does well in maintaining relevant sequences in the generated summary but could further reduce redundant or non-informative sequences. The F1-score of 48.28% indicates a solid performance in preserving the flow and structure of the original text.
-### **SUMMARY**
-- The ROUGE-1 score is strong, indicating that the model is capturing individual words well, which is important for summarizing the key points of Reddit posts. This suggests the model is effectively identifying the core content from the original Reddit discussions.
-- The ROUGE-2 score is relatively low, suggesting that the model struggles with preserving the structure and sequence of words, which is crucial for generating coherent summaries of Reddit posts where sentence flow and the connection between ideas are important.
-- The ROUGE-L score shows that the model is effectively capturing meaningful sequences and maintaining coherence in the summaries. This is a positive outcome for summarizing Reddit posts, where keeping the overall message and flow intact is important.
-While the model performs well in certain areas (particularly with ROUGE-1 and ROUGE-L), there is room for improvement, especially with the ROUGE-2 score. Improving bigram overlap could enhance the fluency and structure of the summaries, leading to more readable and coherent summaries of Reddit posts.

       url: https://huggingface.co/julsCadenas/summarize-reddit
 ---
+# **Reddit Summarization Model**
 This project uses a fine-tuned model for summarizing Reddit posts and their comments. The model has been trained using a dataset of 100 Reddit posts, and the goal is to generate concise and meaningful summaries of the original posts and the associated comments.
 - **Original Model:** [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn)
 - **Fine-Tuned Model:** [julsCadenas/summarize-reddit](https://huggingface.co/julsCadenas/summarize-reddit)
+## Installation
+To get started, you need to install the required dependencies. You can do this by creating a virtual environment and installing the packages listed in `requirements.txt`.
+### **Steps:**
+1. Clone the repository:
+   ```bash
+    git clone https://github.com/your-username/reddit-summarizer.git
+    cd reddit-summarizer
+2. Set up a virtual environment:
+   ```bash
+    python3 -m venv venv
+    source venv/bin/activate  # On Windows, use 'venv\Scripts\activate'
+2. Install depdendencies:
+   ```bash
+    pip install -r requirements.txt
+3. Set up your environment variables (if needed) by creating a ```.env``` file. You can refer to the sample ```.env.example``` for the necessary variables.
+### **Usage**
+1. In ```src/summarize.py```, the model should be initialized like this:
+    ```python
+      # src/summarize.py
+      self.summarizer = pipeline(
+        "summarization",
+        model = "julsCadenas/summarize-reddit",
+        tokenizer = "julsCadenas/summarize-reddit",
+      )
+2. Add the *URL* of your preferred Reddit post on main.py.
+3. Run ```src/main.py```
 <br>
+# **Model Evaluation**
+### For a detailed evaluation of the model, including additional analysis and visualizations, refer to the [evaluation notebook](https://github.com/julsCadenas/summarize-reddit/blob/master/notebooks/eval.ipynb).
+## **BERTScore**
+The model’s performance was evaluated using **BERTScore** (Precision, Recall, and F1).
+### **Average BERTScores**
+| Metric              | Value     |
+|---------------------|-----------|
+| **Precision (p)**| 0.8704    |
+| **Recall (r)**   | 0.8517    |
+| **F1 Score (f)** | 0.8609    |
+### **Conclusion**
+- **Precision** is strong but can be improved by reducing irrelevant tokens.
+- **Recall** needs improvement to capture more relevant content.
+- **F1 Score** indicates a solid overall performance.
+### **Improvements**
+- Focus on improving **Recall**.
+- Perform **Error Analysis** to identify missed content.
+- **Fine-tune** the model for better results.
+## **ROUGE**
+The following table summarizes the ROUGE scores (Recall, Precision, and F1) for three different metrics: ROUGE-1, ROUGE-2, and ROUGE-L. These values represent the mean scores across all summaries.
+### **Average ROUGE Scores**
+| Metric       | ROUGE-1   | ROUGE-2   | ROUGE-L   |
+|--------------|-----------|-----------|-----------|
+| **Recall (r)** | 32.20     | 7.10      | 30.09     |
+| **Precision (p)** | 22.03   | 4.90      | 20.50     |
+| **F1 Score (f)**  | 25.00   | 5.51      | 23.30     |
+### **Interpretation**
+- **ROUGE-1**: Shows higher recall and precision, indicating that the model is good at capturing single-word overlaps but could reduce irrelevant words.
+- **ROUGE-2**: Exhibits lower recall and precision, indicating the model struggles with bigram relationships and context.
+- **ROUGE-L**: Performs better than ROUGE-2 but still faces challenges with precision. It captures longer subsequences more effectively than bigrams.
+## **Conclusion**
+- **ROUGE-1**: The model shows moderate performance but generates some irrelevant words (low precision).
+- **ROUGE-2**: The model performs poorly, indicating difficulty in capturing bigram relationships.
+- **ROUGE-L**: Slightly better than ROUGE-2, with some success in capturing longer sequences.
+## **Improvements**
+- Focus on enhancing **bigram overlap** (ROUGE-2) and overall **context understanding**.
+- Reduce **irrelevant content** for improved **precision**.
+- Improve **sequence coherence** for better **ROUGE-L** scores.