jinunyachhyon commited on
Commit
b58106f
·
verified ·
1 Parent(s): a0fd221

Update readme.md

Browse files
Files changed (1) hide show
  1. README.md +422 -102
README.md CHANGED
@@ -1,201 +1,521 @@
1
  ---
2
  library_name: transformers
3
- tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
 
10
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
39
 
40
- ### Direct Use
 
 
 
 
 
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
 
45
 
46
- ### Downstream Use [optional]
 
 
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
75
 
76
- ## Training Details
 
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
87
 
88
- #### Preprocessing [optional]
 
89
 
90
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
 
 
92
 
93
- #### Training Hyperparameters
 
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
96
 
97
- #### Speeds, Sizes, Times [optional]
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
100
 
101
- [More Information Needed]
 
102
 
103
- ## Evaluation
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
 
106
 
107
- ### Testing Data, Factors & Metrics
 
108
 
109
- #### Testing Data
 
 
 
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
- #### Metrics
 
 
 
 
 
 
 
 
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
 
 
 
 
 
 
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
132
 
 
133
 
 
 
 
 
 
134
 
135
- ## Model Examination [optional]
 
 
 
 
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
162
 
163
- #### Hardware
 
 
 
 
 
 
 
 
 
 
 
 
 
164
 
165
- [More Information Needed]
 
 
166
 
167
- #### Software
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
 
 
 
188
 
189
- ## More Information [optional]
 
 
 
 
 
 
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
 
194
 
195
- [More Information Needed]
 
 
196
 
197
- ## Model Card Contact
 
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
200
 
201
 
 
1
  ---
2
  library_name: transformers
3
+ tags: [summarization]
4
  ---
5
 
6
+ # BART
7
 
8
+ BART (Bidirectional and Auto-Regressive Transformers) is a transformer-based model architecture introduced by Facebook AI. It combines elements of both bidirectional (like BERT) and autoregressive (like GPT) models into a single architecture.
9
 
10
+ ## Bidirectional Meaning
11
+ The term "bidirectional" refers to the model's ability to process input sequences in both directions simultaneously during encoding.
12
 
13
+ Specifically, in a bidirectional model:
14
 
15
+ 1. **Forward Encoding:** The input sequence is processed from left to right (forward direction), with each token attending to all tokens that precede it in the sequence. This allows the model to capture contextual information from preceding tokens when encoding each token in the sequence.
16
 
17
+ Let's explain forward encoding in the context of a language model using a simplified example.
18
 
19
+ Imagine you have a language model tasked with predicting the next word in a sentence based on the words that came before it. Let's consider a simple example sentence:
20
 
21
+ ```
22
+ "The cat sat on the ____."
23
+ ```
24
 
25
+ In forward encoding:
26
+ - The language model processes the sentence from left to right, one word at a time.
27
+ - It starts by predicting the next word based on the words that have already been seen in the sentence.
28
+ - For example, given the words "The cat sat on the", the model predicts the next word ("table") based on the context provided by the preceding words.
29
+ - After predicting "table", it moves forward to predict the next word in the sequence, and so on, until the end of the sentence is reached.
 
 
30
 
31
+ In essence, forward encoding in a language model involves generating predictions for each subsequent word in a sentence based on the context provided by the words that precede it, moving forward through the sequence of words.
32
 
33
+ 2. **Backward Encoding:** Similarly, the input sequence is also processed from right to left (backward direction), with each token attending to all tokens that follow it in the sequence. This enables the model to capture contextual information from subsequent tokens when encoding each token in the sequence.
34
 
35
+ Let's consider a different example to illustrate backward encoding in the context of a language model:
 
 
36
 
37
+ Imagine you have a language model trained to generate text based on a given prompt. Let's say the prompt is:
38
 
39
+ ```
40
+ "Once upon a time, there was a ____."
41
+ ```
42
 
43
+ In backward encoding:
44
+ - The language model processes the prompt from right to left, one word at a time.
45
+ - It starts by generating the last word of the sentence based on the context provided by the remaining words in the prompt.
46
+ - For example, given the context "there was a", the model generates the previous word ("dog") based on the context provided by the following words.
47
+ - The model processes the tokens from right to left, starting with "a", "was", "there", etc. It uses the context provided by these tokens on the right side to predict the missing word.
48
+ - After generating "dog", it moves backward to generate the previous word in the sequence, and so on, until the beginning of the prompt is reached.
49
 
50
+ In essence, backward encoding in a language model involves generating text by considering the context provided by the words that come after each word in the sequence, moving backward through the sequence of words.
51
 
52
+ ## Auto-regressive Meaning
53
+ An autoregressive decoder, like the one used in models such as GPT (Generative Pre-trained Transformer), generates output sequentially, predicting one token at a time based on the previously generated tokens. This means that the model generates the output sequence in an autoregressive manner, where each token is generated conditionally on the tokens that have been generated before it.
54
 
55
+ Similarly, BART (Bidirectional and Auto-Regressive Transformers) also employs an autoregressive decoder for tasks like text generation and summarization. In the autoregressive decoding process:
56
+ - The model predicts the next token in the sequence based on the tokens it has already generated.
57
+ - It generates tokens one-by-one, iterating through the sequence until it reaches the desired length or generates an end-of-sequence token.
58
 
59
+ In summary, when we say "autoregressive decoder like GPT for BART," we mean that BART employs a decoder component similar to GPT's, which generates output sequentially based on previously generated tokens. This decoder plays a crucial role in BART's ability to generate coherent and contextually relevant text for tasks like summarization.
60
 
61
+ ## BART Components
62
 
63
+ 1. **Encoder-Decoder Architecture**: BART follows the encoder-decoder architecture commonly used in sequence-to-sequence (seq2seq) models. The encoder processes the input sequence (text) bidirectionally, capturing contextual information from both directions. This bidirectional encoding helps BART understand the input text more comprehensively. The decoder then generates the output sequence autoregressively, one token at a time, based on the encoded input and previous tokens generated.
64
 
65
+ 2. **Bidirectional Encoder**: BART's encoder is similar to the encoder used in BERT (Bidirectional Encoder Representations from Transformers). It processes the input text bidirectionally, allowing it to capture contextual information from both preceding and succeeding tokens in the input sequence. This bidirectional encoding helps BART understand the relationships between different parts of the input text.
66
 
67
+ 3. **Autoregressive Decoder**: BART's decoder is similar to the decoder used in autoregressive models like GPT (Generative Pre-trained Transformer). It generates the output sequence autoregressively, predicting the next token in the sequence based on the previously generated tokens and the encoded input. This autoregressive decoding allows BART to generate coherent and contextually relevant output sequences.
68
 
69
+ 4. **Pre-training with Noising Function**: BART is pre-trained using a denoising autoencoding objective. During pre-training, input text is corrupted with an arbitrary noising function, such as masking, shuffling, or dropping tokens. The model is then trained to reconstruct the original text from the corrupted input. This pre-training strategy encourages the model to learn robust representations of the input text and improves its ability to handle noisy or imperfect input during fine-tuning and inference.
70
 
71
+ 5. **Fine-tuning for Text Generation and Comprehension**: BART is particularly effective when fine-tuned for text generation tasks such as summarization and translation. Its bidirectional encoder and autoregressive decoder make it well-suited for capturing contextual information and generating coherent and contextually relevant output sequences. Additionally, BART also performs well on comprehension tasks such as text classification and question answering, demonstrating its versatility and effectiveness across a range of natural language processing tasks.
72
 
73
+ In summary, BART is a transformer-based model architecture that combines bidirectional encoding with autoregressive decoding. It is pre-trained using a denoising autoencoding objective and is effective for both text generation and comprehension tasks.
74
 
75
+ ## Noising Functions in BART
76
+ Noising functions are used during the pre-training phase of models like BART to introduce noise or alterations to the input text, which helps the model learn to handle various types of noise and improve its robustness. Here are some common types of noising functions used in pre-training:
77
 
78
+ 1. **Masking**: In masking, random tokens in the input text are replaced with a special "mask" token. The model is then trained to predict the original tokens that were masked out. This helps the model learn to fill in missing or masked tokens, which can be useful for tasks like text generation and completion.
79
 
80
+ 2. **Shuffling**: Shuffling involves randomly reordering the tokens in the input text. The model is then trained to reconstruct the original order of the tokens. This helps the model learn the underlying structure and dependencies between tokens in the text, which can improve its ability to understand and generate coherent sequences.
81
 
82
+ 3. **Token Dropout**: Token dropout involves randomly removing tokens from the input text. The model is then trained to reconstruct the original text, even in the presence of missing tokens. This encourages the model to learn more robust representations of the text and improves its ability to handle missing or incomplete input.
83
 
84
+ 4. **Text Infilling**: In text infilling, segments of the input text are replaced with special "mask" tokens, similar to masking. However, instead of predicting the original tokens directly, the model is trained to generate plausible replacements for the masked segments. This helps the model learn to generate fluent and coherent text, even when parts of the input are missing or incomplete.
85
 
86
+ These are just a few examples of the types of noising functions used during pre-training. The goal of using these functions is to expose the model to a diverse range of noisy input conditions, which helps it learn more robust and generalizable representations of the text. By pre-training the model with these variations of input, it becomes better equipped to handle noisy or imperfect input during fine-tuning and inference stages.
87
 
88
+ # BART - Base Model
89
+ When referring to the "BART base" model, it typically refers to the pre-trained BART model before any fine-tuning on downstream tasks has been applied.
90
 
91
+ The "base" variant of the BART model usually denotes a mid-sized architecture with a moderate number of parameters. This base model is pre-trained on large text corpora using denoising autoencoding objectives but has not been fine-tuned for specific tasks such as text summarization, translation, or question answering.
92
 
93
+ After pre-training, the BART base model can be further fine-tuned on downstream tasks by continuing training on task-specific datasets. Fine-tuning allows the model to adapt its pre-learned representations to the specific characteristics of the target task, resulting in improved performance on that task.
94
 
95
+ ## Experiment with BART Base Model for Text Summarization
96
 
97
+ ```python
98
+ # Import necessary libraries
99
+ from transformers import BartTokenizer, BartForConditionalGeneration
100
 
101
+ # Load pre-trained BART tokenizer
102
+ tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
103
 
104
+ # Load pre-trained BART model for conditional generation
105
+ model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
106
 
107
+ # Input text for summarization
108
+ input_text = """
109
+ Queenie: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
110
+ Rebecca: I found it would be a good idea to get a check-up.
111
+ Queenie: Yes, well, you haven't had one for 5 years. You should have one every year.
112
+ Rebecca: I know. I figure as long as there is nothing wrong, why go see the doctor?
113
+ Queenie: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
114
+ Rebecca: Ok.
115
+ Queenie: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
116
+ Rebecca: Yes.
117
+ Queenie: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
118
+ Rebecca: I've tried hundreds of times, but I just can't seem to kick the habit.
119
+ Queenie: Well, we have classes and some medications that might help. I'll give you more information before you leave.
120
+ Rebecca: Ok, thanks doctor.
121
+ """
122
 
123
+ # Tokenize input text
124
+ input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=1024, truncation=True)
125
 
126
+ # Generate summary
127
+ summary_ids = model.generate(input_ids, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
128
 
129
+ # Decode and print the generated summary
130
+ summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
131
+ print("Generated Summary:", summary_text)
132
+ ```
133
+ **Explanation:**
134
 
135
+ 1. **Import Libraries**:
136
+ - We import the necessary libraries from the `transformers` package, including `BartTokenizer` and `BartForConditionalGeneration`, which are required for tokenization and model loading.
137
 
138
+ 2. **Load Pre-trained BART Model**:
139
+ - We load the pre-trained BART tokenizer (`BartTokenizer.from_pretrained`) and BART model for conditional generation (`BartForConditionalGeneration.from_pretrained`) from the `"facebook/bart-base"` checkpoint.
140
 
141
+ 3. **Input Text for Summarization**:
142
+ - We define a multi-line string (`input_text`) containing a conversation between Queenie, Rebecca, and Doctor Hawkins. This serves as input text for the summarization task.
143
 
144
+ 4. **Tokenize Input Text**:
145
+ - We tokenize the input text using the BART tokenizer (`tokenizer.encode`). The `return_tensors="pt"` parameter specifies that the tokenized inputs should be returned as PyTorch tensors.
146
 
147
+ 5. **Generate Summary**:
148
+ - We use the pre-trained BART model to generate a summary (`model.generate`) based on the tokenized input text (`input_ids`). Parameters such as `max_length`, `min_length`, `length_penalty`, `num_beams`, and `early_stopping` are provided to control the generation process.
149
 
150
+ 6. **Decode and Print Generated Summary**:
151
+ - We decode the generated summary (`tokenizer.decode`) to convert the token IDs back into human-readable text and print the result.
152
 
153
+ **Generated Summary:**
154
+ ```
155
+ Queenie: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?Rebecca: I found it would be a good idea to get a check-up. Have you had a heart attack in the past 5 years?Queenie (in a low voice): Yes, well, you haven't had one for 5 years. You should have one every year. Are you sure you don't have a heart disease, Mr?ReRebecca? I know. I figure as long as there is nothing wrong, why go see the doctor? Are you serious about your heart disease?Q: Yes, I am serious about my heart disease.Q: What can I do to help you?QQueenie
156
+ ```
157
 
158
+ **Issue Description:**
159
 
160
+ The generated summary closely resembles the input text and fails to provide a concise and informative summary of the conversation. Instead of condensing the information and highlighting the main points, the generated summary essentially replicates the input text without adding significant value or insight. As a result, it does not fulfill the purpose of summarization, which is to distill the key ideas and convey them concisely.
161
 
 
162
 
163
+ ## Fine Tuning with Dialog Dataset
164
 
165
+ **Dialog Dataset Used:**
166
+ - [Dialog ~ Summary](https://www.kaggle.com/datasets/farneetsingh/chat-conversations-with-summary) : A personally curated dataset, designed to include a variety of chat styles and topics, ensuring robustness and versatility in the model's performance.
167
+ - [DialogSum Dataset](https://www.kaggle.com/datasets/farneetsingh/dialogue-chat) : A specialized dataset for dialogue summarization, offering diverse conversational examples.
168
+ - [Samsum dataset for chat summarization](https://www.kaggle.com/datasets/akch2914/samsum-dataset-for-chat-summarization) : This dataset comprises scripted chat conversations with associated human-written summaries, providing a rich ground for training and validating summarization models.
169
+
170
+ ### Training Dataset
171
+
172
+ 1. **DialogSum Dataset:**
173
+ - **Description:** The DialogSum dataset consists of three subsets: train, val, and test. Each subset contains dialogs paired with one summary in JSONL format.
174
+ - **Conversion Process:**
175
+ - The train and val subsets were parsed and converted into a pandas DataFrame, with two columns: 'dialog' and 'summary'.
176
+ - Since the dialogs contain placeholders like #person1 and #person2, a mapping from placeholder names to actual person names was created using a predefined name list.
177
+ - The placeholders were replaced with actual person names in the 'dialog' column.
178
+ - **Outcome:** Three DataFrames were created, each containing dialogs paired with their corresponding summaries, suitable for model training.
179
+
180
+ 2. **Samsum Dataset:**
181
+ - **Description:** The Samsum dataset includes train, val, and test sets in JSON format, each containing dialogs and their summaries.
182
+ - **Conversion Process:**
183
+ - The train, val, and test sets were parsed and converted into pandas DataFrames, each with two columns: 'dialog' and 'summary'.
184
+ - **Outcome:** Three DataFrames were created, each containing dialogs paired with their corresponding summaries, suitable for model training.
185
+
186
+ 3. **Dialog ~ Summary Dataset:**
187
+ - **Description:** The Dialog ~ Summary dataset consists of a text dataset where each line contains a dialog followed by a summary separated by a tilde (~).
188
+ - **Conversion Process:**
189
+ - The text dataset was parsed, and each line was split at the tilde (~) to separate the dialog and summary components.
190
+ - The dialog and summary pairs were then stored in a pandas DataFrame with two columns: 'dialog' and 'summary'.
191
+ - **Outcome:** A DataFrame was created containing dialogs paired with their corresponding summaries, facilitating further analysis and model training.
192
+
193
+ After processing and consolidating multiple datasets, a training dataset was created by concatenating the following datasets:
194
+
195
+ 1. **DialogSum Dataset (Train and Validation):**
196
+ - **Number of Dialogs:** 12,460 (Train) + 500 (Validation)
197
+ - **Description:** The DialogSum dataset contains dialogs paired with one summary per dialog in JSONL format.
198
+ - **Total Dialogs:** 12,960
199
+
200
+ 2. **Samsum Dataset (Train and Validation):**
201
+ - **Number of Dialogs:** 14,732 (Train) + 818 (Validation)
202
+ - **Description:** The Samsum dataset includes dialogs and summaries in JSON format.
203
+ - **Total Dialogs:** 15,550
204
+
205
+ 3. **Dialog ~ Summary Dataset (Train):**
206
+ - **Number of Dialogs:** 909
207
+ - **Description:** The Dialog ~ Summary dataset consists of dialogs followed by summaries separated by a tilde (~).
208
+ - **Total Dialogs:** 909
209
+
210
+ **Total Training Dataset:**
211
+ - Total number of dialogs after concatenation: 12,960 (DialogSum) + 15,550 (Samsum) + 909 (Dialog ~ Summary) = 29,419 dialogs.
212
+
213
+ | Index | Dialogue | Summary |
214
+ |-------|----------------------------------------------------------|--------------------------------------------------------------|
215
+ | 0 | Miles: Hi, Mr. Smith. I'm Doctor Hawkins. Why ... | Mr. Smith's getting a check-up, and Doctor Hawkins is his doctor. |
216
+ | 1 | Alice: Hello Mrs. Parker, how have you been?\n... | Mrs. Parker takes Ricky for his vaccines, and Dr. Parker administers the vaccines. |
217
+ | 2 | Amelia: Excuse me, did you see a set of keys?\... | Amelia is looking for a set of keys and asks for help from nearby individuals. |
218
+ | 3 | Samuel: Why didn't you tell me you had a girlf... | Samuel's angry because Luna didn't tell Samuel she had a boyfriend. |
219
+ | 4 | Quinn: Watsup, ladies! Y'll looking'fine tonig... | Malik invites Nikki to dance, and Nikki agrees if her friends join too. |
220
+ | ... | ... | ... |
221
+ | 29414 | Ravi: I've been experimenting with cooking dif... | Ravi tells Mei about his culinary experiments and offers to share recipes. |
222
+ | 29415 | Sophie: I'm working on a project to clean up o... | Sophie discusses her project to clean up the local park and seeks volunteers. |
223
+ | 29416 | Neil: I've been exploring historical novels re... | Neil shares his interest in historical novels and recommends some titles to Mary. |
224
+ | 29417 | Grace: I started a blog about sustainable livi... | Grace mentions her new blog on sustainable living and invites feedback from friends. |
225
+ | 29418 | Lena: I've been learning sign language. It's a... | Lena talks about learning sign language, and Mr. Brown expresses appreciation for her efforts. |
226
+
227
+ The combined training dataset contains a total of 29,419 dialogs, with each dialog paired with its corresponding summary. This dataset is now ready for further preprocessing and model training to develop natural language processing models, such as text summarization models.
228
+
229
+ ### Making Training Dataset Ready for Finetuning
230
+ ```python
231
+ class SummaryDataset(Dataset):
232
+ # Initialize the dataset with a tokenizer, data, and maximum token length
233
+ def __init__(self, tokenizer, data, max_length=512):
234
+ self.tokenizer = tokenizer # Tokenizer for encoding text
235
+ self.data = data # Data containing dialogues and summaries
236
+ self.max_length = max_length # Maximum length of tokens
237
+
238
+ # Return the number of items in the dataset
239
+ def __len__(self):
240
+ return len(self.data)
241
+
242
+ # Retrieve an item from the dataset by index
243
+ def __getitem__(self, idx):
244
+ item = self.data.iloc[idx] # Get the row at the specified index
245
+ dialogue = item['dialogue'] # Extract dialogue from the row
246
+ summary = item['summary'] # Extract summary from the row
247
+
248
+ # Encode the dialogue as input data for the model
249
+ source = self.tokenizer.encode_plus(
250
+ dialogue,
251
+ max_length=self.max_length,
252
+ padding='max_length',
253
+ return_tensors='pt',
254
+ truncation=True
255
+ )
256
+
257
+ # Encode the summary as target data for the model
258
+ target = self.tokenizer.encode_plus(
259
+ summary,
260
+ max_length=self.max_length,
261
+ padding='max_length',
262
+ return_tensors='pt',
263
+ truncation=True
264
+ )
265
+
266
+ # Return a dictionary containing input_ids, attention_mask, labels, and the original summary text
267
+ return {
268
+ 'input_ids': source['input_ids'].flatten(),
269
+ 'attention_mask': source['attention_mask'].flatten(),
270
+ 'labels': target['input_ids'].flatten(),
271
+ 'summary': summary
272
+ }
273
+ ```
274
+
275
+ **Tokenization Process Explanation:**
276
+
277
+ 1. **Initialization**: The `SummaryDataset` class is initialized with a tokenizer, data (containing dialogues and summaries), and a maximum token length.
278
+
279
+ 2. **Retrieve Data**: When an item is requested from the dataset (`__getitem__` method), the dialogue and summary from the dataset are retrieved based on the index.
280
+
281
+ 3. **Tokenization**: The dialogue and summary are tokenized separately using the tokenizer's `encode_plus` method.
282
+ - The `encode_plus` method tokenizes the text, adds special tokens (such as `[CLS]` and `[SEP]`), and returns a dictionary containing the tokenized input_ids and attention_mask tensors.
283
+
284
+ 4. **Padding and Truncation**: The tokenized sequences are padded to the maximum token length and truncated if they exceed it. This ensures that all sequences have the same length.
285
+
286
+ 5. **Return**: The tokenized dialogue, attention_mask, and tokenized summary are returned as a dictionary along with the original summary text.
287
+
288
+ **Code Snippet and Documentation:**
289
+
290
+ ```python
291
+ # Encode the dialogue as input data for the model
292
+ source = self.tokenizer.encode_plus(
293
+ dialogue,
294
+ max_length=self.max_length,
295
+ padding='max_length',
296
+ return_tensors='pt',
297
+ truncation=True
298
+ )
299
 
300
+ # Encode the summary as target data for the model
301
+ target = self.tokenizer.encode_plus(
302
+ summary,
303
+ max_length=self.max_length,
304
+ padding='max_length',
305
+ return_tensors='pt',
306
+ truncation=True
307
+ )
308
+ ```
309
 
310
+ **Explanation:**
311
 
312
+ - `encode_plus` method is called on the tokenizer to tokenize the dialogue and summary separately.
313
+ - `dialogue` and `summary` are passed as input to tokenize.
314
+ - `max_length` specifies the maximum length of the tokenized sequences.
315
+ - `padding='max_length'` pads sequences to the maximum length specified.
316
+ - `return_tensors='pt'` returns PyTorch tensors.
317
+ - `truncation=True` truncates sequences that exceed the maximum length.
318
+ - The resulting tokenized sequences are stored in `source` and `target` dictionaries.
319
 
320
+ This code snippet tokenizes the dialogue and summary texts using the tokenizer, ensuring that they are formatted appropriately for input to the model during training. The tokenized sequences are then padded and truncated as necessary before being returned as tensors.
321
 
322
+ **NOTE:**
323
 
324
+ The `encode` and `encode_plus` methods are both provided by the Hugging Face tokenizers library, used for encoding text inputs into numerical representations suitable for input to transformer-based models like BART.
325
 
326
+ Here's the difference between the two methods:
327
 
328
+ 1. **`encode` Method**:
329
+ - The `encode` method is used to encode a single text input.
330
+ - It takes the input text and converts it into a sequence of token IDs.
331
+ - The returned output is a list of token IDs representing the input text.
332
+ - Additional parameters like `max_length` and `truncation` can be specified to control the maximum length of the encoded sequence and whether or not truncation should be applied if the input text exceeds this length.
333
 
334
+ 2. **`encode_plus` Method**:
335
+ - The `encode_plus` method is used to encode multiple text inputs or to include additional information such as attention masks.
336
+ - In addition to encoding the input text, it also performs other tasks such as padding, truncation, and generating attention masks.
337
+ - It returns a dictionary containing the encoded inputs along with attention masks, token type IDs (for models that use segment embeddings), and other optional parameters.
338
+ - This method provides more flexibility and control compared to `encode`, as it allows for the inclusion of additional information and customization of the encoding process.
339
 
340
+ In summary, while both methods are used for encoding text inputs, `encode_plus` offers more functionality and control by providing additional features such as padding, truncation, and attention masks, making it suitable for more complex encoding tasks and scenarios involving multiple text inputs.
341
 
342
+ **ATTENTION MASK IN `encode_plus`**
343
 
344
+ The attention mask is a binary tensor used in transformer-based models like BART to indicate which tokens should be attended to and which ones should be ignored during the self-attention mechanism.
345
 
346
+ In the `encode_plus` method of the Hugging Face tokenizers library, the attention mask is automatically generated based on the encoded input sequence. It has the same length as the input sequence and consists of 1s and 0s. Here's what the attention mask indicates:
347
 
348
+ - **1**: Tokens that should be attended to by the model.
349
+ - **0**: Tokens that should be ignored (masked) by the model.
350
 
351
+ The attention mask helps the model focus on the relevant tokens in the input sequence while ignoring the padding tokens. This is particularly important when dealing with input sequences of varying lengths, as it ensures that the model doesn't pay attention to padded tokens, which don't contain meaningful information.
 
 
 
 
352
 
353
+ For example, consider an input sequence `[CLS] Hello world [PAD] [PAD]`. The attention mask for this sequence would be `[1, 1, 1, 0, 0]`, indicating that the model should attend to the first three tokens (`[CLS]`, `Hello`, `world`) and ignore the padded tokens (`[PAD]`).
354
 
355
+ In summary, the attention mask helps improve the efficiency and effectiveness of transformer-based models by guiding their attention mechanism to focus on the relevant parts of the input sequence while disregarding padding tokens.
356
 
357
+ ### Fine-tune BART base model
358
 
359
+ **Training Arguments**
360
 
361
+ ```python
362
+ from transformers import TrainingArguments
363
 
364
+ # Define training arguments for the model
365
+ training_args = TrainingArguments(
366
+ output_dir='./results', # Directory to save model output and checkpoints
367
+ num_train_epochs=2, # Number of epochs to train the model
368
+ per_device_train_batch_size=8, # Batch size per device during training
369
+ per_device_eval_batch_size=8, # Batch size for evaluation
370
+ warmup_steps=500, # Number of warmup steps for learning rate scheduler
371
+ weight_decay=0.01, # Weight decay for regularization
372
+ logging_dir='./logs', # Directory to save logs
373
+ logging_steps=10, # Log metrics every specified number of steps
374
+ evaluation_strategy="epoch", # Evaluation is done at the end of each epoch
375
+ report_to='none' # Disables reporting to any online services (e.g., TensorBoard, WandB)
376
+ )
377
+ ```
378
 
379
+ - **`output_dir='./results'`**: Specifies the directory where model checkpoints and output files will be saved during training.
380
+
381
+ - **`num_train_epochs=2`**: Defines the number of epochs for training the model, indicating how many times the entire training dataset will be passed through the model.
382
 
383
+ - **`per_device_train_batch_size=8`**: Sets the batch size per device (e.g., GPU) during training, controlling the number of training samples processed simultaneously on each device.
384
 
385
+ - **`per_device_eval_batch_size=8`**: Defines the batch size per device for evaluation, indicating the number of samples evaluated simultaneously on each device during model evaluation.
386
 
387
+ - **`warmup_steps=500`**: Specifies the number of warmup steps for the learning rate scheduler, determining the initial optimization steps during which the learning rate increases gradually.
388
 
389
+ - **`weight_decay=0.01`**: Sets the weight decay coefficient for regularization, controlling the amount of regularization applied to the model's weights during optimization.
390
 
391
+ - **`logging_dir='./logs'`**: Defines the directory where logs, including training metrics and evaluation results, will be saved during training.
392
 
393
+ - **`logging_steps=10`**: Specifies the frequency at which training metrics will be logged, indicating how often (in number of steps) metrics will be recorded during training.
394
 
395
+ - **`evaluation_strategy="epoch"`**: Determines the evaluation strategy, specifying whether evaluation will be performed at the end of each epoch or at specified intervals of steps.
396
 
397
+ - **`report_to='none'`**: Specifies where training progress and results will be reported, with "none" indicating that reporting to any online services (e.g., TensorBoard, Weights & Biases) is disabled.
398
 
 
399
 
400
+ **Issue and Solution**
401
 
402
+ Issue:
403
+ ```
404
+ ImportError Traceback (most recent call last)
405
+ in <cell line: 1>()
406
+ ----> 1 training_args = TrainingArguments(output_dir=“test-trainer”)
407
 
408
+ 4 frames
409
+ /usr/local/lib/python3.10/dist-packages/transformers/training_args.py in _setup_devices(self)
410
+ 1670 if not is_sagemaker_mp_enabled():
411
+ 1671 if not is_accelerate_available(min_version=“0.20.1”):
412
+ → 1672 raise ImportError(
413
+ 1673 “Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U”
414
+ 1674 )
415
 
416
+ ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U
417
 
418
+ NOTE: If your import is failing due to a missing package, you can
419
+ manually install dependencies using either !pip or !apt.
420
 
421
+ To view examples of installing some common dependencies, click the
422
+ “Open Examples” button below.
423
+ ```
424
 
425
+ Solution:
426
+ [TrainingArgument does not work on colab](https://discuss.huggingface.co/t/trainingargument-does-not-work-on-colab/43372/3)
427
 
428
+
429
+ **Training Process**
430
+
431
+ The code initializes a Trainer object with the specified model, training arguments, and datasets. It then starts the training process by calling the `train()` method on the Trainer object.
432
+
433
+ ```python
434
+ # Initializing the Trainer object
435
+ trainer = Trainer(
436
+ model=model, # The model to be trained
437
+ args=training_args, # Training arguments
438
+ train_dataset=train_dataset, # Training dataset
439
+ eval_dataset=eval_dataset # Evaluation dataset
440
+ )
441
+
442
+ # Starting the training process
443
+ trainer.train()
444
+ ```
445
+
446
+ **Training Result**
447
+
448
+ | Epoch | Training Loss | Validation Loss |
449
+ |-------|---------------|-----------------|
450
+ | 1 | 0.095800 | 0.084215 |
451
+ | 2 | 0.075400 | 0.081112 |
452
+
453
+ This table provides a summary of the training and validation losses for each epoch during the training process.
454
+
455
+ ## Evaluation with ROGUE Score
456
+
457
+ | Metric | Threshold | Precision | Recall | F-Measure |
458
+ |----------|-----------|-----------|--------|-----------|
459
+ | rouge1 | low | 0.5203 | 0.4547 | 0.4632 |
460
+ | | mid | 0.5354 | 0.4689 | 0.4753 |
461
+ | | high | 0.5502 | 0.4824 | 0.4874 |
462
+ | rouge2 | low | 0.2507 | 0.2160 | 0.2205 |
463
+ | | mid | 0.2656 | 0.2292 | 0.2331 |
464
+ | | high | 0.2808 | 0.2428 | 0.2459 |
465
+ | rougeL | low | 0.4318 | 0.3784 | 0.3843 |
466
+ | | mid | 0.4465 | 0.3907 | 0.3964 |
467
+ | | high | 0.4613 | 0.4039 | 0.4090 |
468
+ | rougeLsum| low | 0.4324 | 0.3770 | 0.3830 |
469
+ | | mid | 0.4463 | 0.3903 | 0.3960 |
470
+ | | high | 0.4616 | 0.4031 | 0.4075 |
471
+
472
+ Now, let's interpret the scores:
473
+
474
+ - **Precision:** It measures the proportion of generated summaries that are relevant. A higher precision indicates that the generated summaries are more relevant to the reference summaries.
475
+ - **Recall:** It measures the proportion of relevant information in the reference summaries that are correctly captured by the generated summaries. A higher recall indicates that more relevant information is captured.
476
+ - **F-Measure:** It is the harmonic mean of precision and recall. It provides a balance between precision and recall. A higher F-measure indicates a better balance between precision and recall.
477
+
478
+ Interpreting the scores:
479
+ - Higher scores are generally considered better for all metrics.
480
+ - The scores are divided into three categories: low, mid, and high. The scores improve from low to high categories, indicating better performance.
481
+ - A score closer to 1 is desired for all metrics, indicating better performance.
482
+
483
+ Overall, based on these scores, the performance of the BART base model for summarization can be considered relatively good, especially for the high category. However, there is still room for improvement, particularly in terms of capturing more relevant information (recall) while maintaining precision.
484
+
485
+ **NOTE - ROGUE Score:**
486
+
487
+ ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of summaries by comparing them to reference or gold-standard summaries. ROUGE scores measure the overlap of n-grams (contiguous sequences of n words) between the generated summary and the reference summary.
488
+
489
+ There are several variants of ROUGE metrics, including ROUGE-N, ROUGE-L, and ROUGE-W. Here's a brief explanation of each:
490
+
491
+ 1. **ROUGE-N (ROUGE-Ngram)**: This metric computes the overlap of n-grams between the generated summary and the reference summary. ROUGE-N scores are calculated for different values of n (e.g., unigrams, bigrams, trigrams) to capture different levels of phrase overlap. For example, if you calculate ROUGE-2 (bigrams), it counts how many pairs of adjacent words appear in both your summary and your friend's. If you both mention the same key phrases or use similar wording, you'll get a higher ROUGE-N score.
492
+
493
+ - Example: Let's say your summary includes the phrase "action-packed adventure," and your friend's summary mentions "exciting action scenes." Both summaries have the bigram "action scenes," so they'll get a high ROUGE-2 score.
494
+
495
+ 2. **ROUGE-L (ROUGE-Longest Common Subsequence)**: This metric measures the longest common subsequence (LCS) between the generated summary and the reference summary. It considers the longest sequence of words that appears in both the generated and reference summaries.
496
+
497
+ - Example: If your summary says "The movie features amazing special effects," and your friend's summary says "Special effects in the movie are stunning," the longest common sequence is "special effects," so you'll get a high ROUGE-L score.
498
+
499
+ 3. **ROUGE-W (ROUGE-Word)**: This metric is similar to ROUGE-N but considers the weighted overlap of unigrams, with weights assigned based on the distance between matching words.
500
+
501
+ - Example: If your summary says "The movie was great, with fantastic visuals," and your friend's summary says "Visuals in the movie were fantastic," ROUGE-W would give weight to the words "fantastic" and "visuals" appearing together in both summaries, even though the order is slightly different.
502
+
503
+ ROUGE scores are typically reported as F1 scores, which are harmonic means of precision and recall. A higher ROUGE score indicates better agreement between the generated summary and the reference summary, with perfect agreement resulting in a score of 1.0.
504
+
505
+ ROUGE scores are widely used in the evaluation of text summarization systems and other natural language processing tasks where automatic evaluation of generated text is required. They provide objective measures of summary quality that can be used to compare different summarization models or tuning parameters.
506
+
507
+ ## Inference
508
+
509
+ The test conversation:
510
+ ```
511
+ Web Developer (You): Hey, I just launched a new website with some exciting features. Would you like to check it out? Machine Learning Enthusiast: That sounds interesting! I'd love to see how you've integrated machine learning into it. Computer Science Student: Speaking of machine learning, have you heard about the latest breakthroughs in natural language processing? Science Enthusiast: Yes, I've been following those developments closely. It's amazing how AI is transforming language understanding. Mathematics Enthusiast: Absolutely! The mathematical foundations of deep learning play a crucial role in these advancements. News Enthusiast: By the way, did you catch the latest headlines? There's a lot happening in the world right now. Web Developer (You): I did! In fact, my website can recommend personalized news articles based on user preferences. Clinical Medical Assistant: That's impressive! Speaking of recommendations, have you worked on any projects related to healthcare? Machine Learning Enthusiast: Yes, I did a project on hybrid acoustic and facial emotion recognition, which could have applications in mental health. Computer Science Student: That's fascinating! It's incredible how our interests and expertise intersect across various fields of study and technology.
512
+ ```
513
+
514
+ Generated Sumamry:
515
+ ```
516
+ Web Developer (You) introduces the latest breakthroughs in natural language processing to Computer Science Student. Science Enthusiast thinks AI is transforming language understanding and the mathematical foundations of deep learning play a crucial role in these advancements. Web Developer's website can recommend personalized news articles based on user preferences.
517
+ ```
518
+
519
+ Overall, while the summary provides a condensed version of the conversation, it could be improved to include more comprehensive insights and coherence.
520
 
521