TIGER-Lab
/

TIGERScore-13B

@@ -24,9 +24,8 @@ The models are fine-tuned with the MetricInstruct dataset using the original Lla
 TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTScore, and BLEURT, and emerging LLM-based metrics as reference-free metrics. Though our dataset was originally sourced from ChatGPT, our distilled model actually outperforms ChatGPT itself, which proves the effectiveness of our filtering strategy. On the unseen task of story generation, TIGERScore also demonstrates reasonable generalization capability.
-| Tasks→                                    | Summarization  | Translation    | Data2Text      | Long-form QA    | MathQA         | Inst-Fol       | Story-Gen      | Average        |
 |-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
-| Metrics↓ Datasets→                        | SummaEval      | WMT22-zh-en    | WebNLG2020     | ASQA+           | gsm8k          | LIMA+          | ROC            |                |
 | GPT-3.5-turbo (few-shot)                  | **38.50**      | 40.53          | 40.20          | 29.33           | **66.46**      | 23.20          | 4.77           | 34.71          |
 | GPT-4 (zero-shot)                         | 36.46          | **43.87**      | **44.04**      | **48.95**       | 51.71          | **58.53**      | **32.48**      | **45.15**      |
 | BLEU                                      | 11.98          | 19.73          | 33.29          | 11.38           | 21.12          | **46.61**      | -1.17          | 20.42          |
@@ -48,7 +47,6 @@ TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTSc
 | TIGERScore-13B (ours)                     | 36.81          | 44.99          | **45.88**      | 46.22           | **23.32**      | **47.03**      | **46.36**      | **41.52**      |
 | Δ (ours - best reference-free)            | -2             | -3             | +12            | +5              | +9             | +14            | +13            | +16            |
 ## Formatting

 TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTScore, and BLEURT, and emerging LLM-based metrics as reference-free metrics. Though our dataset was originally sourced from ChatGPT, our distilled model actually outperforms ChatGPT itself, which proves the effectiveness of our filtering strategy. On the unseen task of story generation, TIGERScore also demonstrates reasonable generalization capability.
+| Tasks→                                    | Summarization  | Translation    | Data2Text      | Long-form QA    | MathQA         | Instruction Following   | Story-Gen      | Average        |
 |-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
 | GPT-3.5-turbo (few-shot)                  | **38.50**      | 40.53          | 40.20          | 29.33           | **66.46**      | 23.20          | 4.77           | 34.71          |
 | GPT-4 (zero-shot)                         | 36.46          | **43.87**      | **44.04**      | **48.95**       | 51.71          | **58.53**      | **32.48**      | **45.15**      |
 | BLEU                                      | 11.98          | 19.73          | 33.29          | 11.38           | 21.12          | **46.61**      | -1.17          | 20.42          |
 | TIGERScore-13B (ours)                     | 36.81          | 44.99          | **45.88**      | 46.22           | **23.32**      | **47.03**      | **46.36**      | **41.52**      |
 | Δ (ours - best reference-free)            | -2             | -3             | +12            | +5              | +9             | +14            | +13            | +16            |
 ## Formatting