Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,65 @@
|
|
1 |
---
|
|
|
|
|
2 |
license: cc-by-nc-sa-4.0
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: cs
|
3 |
+
|
4 |
license: cc-by-nc-sa-4.0
|
5 |
+
datasets:
|
6 |
+
- csTenTen17
|
7 |
---
|
8 |
+
|
9 |
+
# CzeGPT-2_summarizer
|
10 |
+
CzeGPT-2_headline_generator is a Czech summarizer built upon the <a href="https://huggingface.co/MU-NLPC/CzeGPT-2">CzeGPT-2</a> model. The model has the same architectural dimensions as the GPT-2 small (12 layers, 12 heads, 1024 tokens on input/output, and embedding vectors with 768 dimensions) resulting in 124M trainable parameters. It was fine-tuned and evaluated on the <a href="https://aclanthology.org/L18-1551.pdf">SumeCzech</a> summarization dataset containing about 1M Czech news articles.
|
11 |
+
|
12 |
+
## Tokenizer
|
13 |
+
Along, we also provide a Czech trained tokenizer (vocab and merges) with vocab size of 50257 that was used during the pre-training phase and fine-tuning. It is the byte-level BPE tokenizer as used in the original GPT-2 paper.
|
14 |
+
|
15 |
+
## Training results
|
16 |
+
The model was evaluated on the *test* and *ood-test* partitions of the SumeCzech dataset and compared to the best summarizers yet evaluated on this benchmark (the results taken from <a href="https://ufal.mff.cuni.cz/sumeczech">here</a>).
|
17 |
+
The headline generator is trained to decide itself when to stop (generate an <|endoftext|> token). If you want a variable summary length, refer to our <a href="https://huggingface.co/MU-NLPC/CzeGPT-2_summarizer">summary generator</a>
|
18 |
+
|
19 |
+
We manage to exceed current state-of-the art on all standard metrics.
|
20 |
+
|
21 |
+
|
22 |
+
Test set
|
23 |
+
|
24 |
+
| Model | ROUGE<sub>RAW</sub>-1 | ROUGE<sub>RAW</sub>-2 | ROUGE<sub>RAW</sub>-L |
|
25 |
+
| :---: | :------: | :-----: | :-----: |
|
26 |
+
| CzeGPT-2 | **17.3**/**17.0**/**16.7** | **4.4**/**4.3**/**4.2** | **15.5**/**15.2**/**14.9**|
|
27 |
+
| First | 7.4/13.5/8.9 | 1.1/2.2/1.3 | 6.5/11.7/7.7 |
|
28 |
+
| TextRank | 6.0/16.5/8.3 | 0.8/2.3/1.1 | 5.0/13.8/6.9 |
|
29 |
+
|Tensor2Tensor | 8.8/7.0/7.5 | 0.8/0.6/0.7 | 8.1/6.5/7.0 |
|
30 |
+
|NE Density | 6.6/10.7/7.3 | 0.8/1.4/0.9 | 5.9/9.4/6.4 |
|
31 |
+
|Seq2Seq | 16.1/14.1/14.6 | 2.5/2.1/2.2 | 14.6/12.8/13.2|
|
32 |
+
|Seq2Seq<sub>NER</sub> | 16.2/14.1/14.7 | 2.5/2.1/2.2 | 14.7/12.8/13.3|
|
33 |
+
|
34 |
+
|
35 |
+
OOD test set
|
36 |
+
|
37 |
+
| Model | ROUGE<sub>RAW</sub>-1 | ROUGE<sub>RAW</sub>-2 | ROUGE<sub>RAW</sub>-L |
|
38 |
+
| :---: | :------: | :-----: | :-----: |
|
39 |
+
|CzeGPT-2 | **17.9**/**17.6**/**17.2** | **5.9**/**5.7**/**5.5** | **16.4**/**16.2**/**15.8** |
|
40 |
+
|First | 6.7/13.6/8.3 | 1.3/2.8/1.6 | 5.9/12.0/7.4 |
|
41 |
+
|TextRank | 5.8/16.9/8.1 | 1.1/3.4/1.5 | 5.0/14.5/6.9 |
|
42 |
+
|Tensor2Tensor | 6.3/5.1/5.5 | 0.5/0.4/0.4 | 5.9/4.8/5.1 |
|
43 |
+
|NE Density | 6.3/11.4/7.1 | 1.3/2.3/1.4 | 5.7/10.2/6.3 |
|
44 |
+
|Seq2Seq | 13.1/11.8/12.0 | 2.0/1.7/1.8 | 12.1/11.0/11.2 |
|
45 |
+
|Seq2SeqNER | 16.2/14.1/14.7 | 2.5/2.1/2.2 | 14.7/12.8/13.3 |
|
46 |
+
|
47 |
+
The numbers in the tables denote *precision/recall/F1-score*
|
48 |
+
|
49 |
+
## Error Analysis
|
50 |
+
As we think the current standard ROUGE<sub>RAW</sub> metric is not suitable enough for the summarization task (even though it is the best we have at the time), we performed also a manual error analysis of the generated summaries using human annotators. You can find more about the methodology and results in our paper referenced at the bottom of this card.
|
51 |
+
|
52 |
+
## Running the predictions
|
53 |
+
The repository includes a simple Jupyter Notebook that can help with first steps when using the model. (#TODO)
|
54 |
+
|
55 |
+
## Summary generator
|
56 |
+
See also our model fine-tuned for <a href="https://huggingface.co/MU-NLPC/CzeGPT-2_summarizer">summary generation task</a>.
|
57 |
+
|
58 |
+
## How to cite
|
59 |
+
@unpublished{hajek_horak2022,<br>
|
60 |
+
author = "Adam Hájek and Aleš Horák",<br>
|
61 |
+
title = "CzeGPT-2 – New Model for Czech Summarization Task",<br>
|
62 |
+
note = "preprint available at \url{https://openreview.net/forum?id=H43eQtxZefq}",<br>
|
63 |
+
month = "3",<br>
|
64 |
+
year = "2022",<br>
|
65 |
+
}
|