PyTorch
mistral
Krutrim
language-model
krutrim-admin commited on
Commit
ef469a2
·
verified ·
1 Parent(s): fb9bab0

updated eval tables

Browse files
Files changed (1) hide show
  1. README.md +21 -21
README.md CHANGED
@@ -93,32 +93,32 @@ We use the LM Evaluation Harness to evaluate our model on the En benchmarks task
93
 
94
  ### Indic Benchmarks
95
 
96
- | Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
97
- |--------------------------------------------|------------|--------------|----------------|--------------|--------------|--------------|----------------|--------|
98
- | IndicSentiment (0-shot) | Accuracy | 0.65 | 0.70 | 0.95 | 0.05 | 0.96 | 0.99 | 0.98 |
99
- | IndicCOPA (0-shot) | Accuracy | 0.51 | 0.58 | 0.80 | 0.48 | 0.83 | 0.88 | 0.91 |
100
- | IndicXParaphrase (0-shot) | Accuracy | 0.67 | 0.74 | 0.88 | 0.75 | 0.87 | 0.89 | TBD |
101
- | IndicXNLI (0-shot) | Accuracy | 0.47 | 0.54 | 0.55 | 0.00 | TBD | TBD | TBD |
102
- | IndicQA (0-shot) | Bert Score | 0.90 | 0.90 | 0.91 | TBD | TBD | TBD | TBD |
103
- | CrossSumIN (1-shot) | chrF++ | 0.04 | 0.17 | 0.21 | 0.21 | 0.26 | 0.24 | TBD |
104
- | FloresIN Translation xx-en (1-shot) | chrF++ | 0.54 | 0.50 | 0.58 | 0.54 | 0.60 | 0.62 | 0.63 |
105
- | FloresIN Translation en-xx (1-shot) | chrF++ | 0.41 | 0.34 | 0.48 | 0.37 | 0.46 | 0.47 | 0.48 |
106
- | IN22 Translation xx-en (0-shot) | chrF++ | 0.50 | 0.48 | 0.57 | 0.49 | 0.58 | 0.55 | TBD |
107
- | IN22 Translation en-xx (0-shot) | chrF++ | 0.36 | 0.33 | 0.45 | 0.32 | 0.42 | 0.44 | TBD |
108
 
109
 
110
  ### BharatBench
111
  The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
112
 
113
- | Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
114
- |-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|--------|
115
- | Indian Cultural Context (0-shot) | Bert Score | 0.86 | 0.56 | 0.88 | 0.87 | 0.88 | 0.87 | 0.89 |
116
- | Grammar Correction (5-shot) | Bert Score | 0.96 | 0.94 | 0.98 | 0.95 | 0.98 | 0.96 | 0.97 |
117
- | Multi Turn (0-shot) | Bert Score | 0.88 | 0.87 | 0.91 | 0.88 | 0.90 | 0.89 | 0.92 |
118
- | Multi Turn Comprehension (0-shot) | Bert Score | 0.90 | 0.89 | 0.92 | 0.92 | 0.93 | 0.91 | 0.94 |
119
- | Multi Turn Translation (0-shot) | Bert Score | 0.85 | 0.87 | 0.92 | 0.89 | 0.91 | 0.91 | 0.92 |
120
- | Text Classification (5-shot) | Accuracy | 0.61 | 0.71 | 0.76 | 0.72 | 0.88 | 0.86 | 0.89 |
121
- | Named Entity Recognition (5-shot) | Accuracy | 0.31 | 0.51 | 0.53 | 0.55 | 0.61 | 0.65 | 0.65 |
122
 
123
  ### Qualitative Results
124
  Below are the results from manual evaluation of prompt-response pairs across languages and task categories. Scores are between 1-5 (higher the better). Model names were anonymised during the evaluation.
 
93
 
94
  ### Indic Benchmarks
95
 
96
+ | Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
97
+ |--------------------------------------------|------------|--------------|----------------|--------------|--------------|----------------|--------|
98
+ | IndicSentiment (0-shot) | Accuracy | 0.65 | 0.70 | 0.95 | 0.96 |0.99 | 0.98 |
99
+ | IndicCOPA (0-shot) | Accuracy | 0.51 | 0.58 | 0.80 | 0.83 | 0.88 | 0.91 |
100
+ | IndicXParaphrase (0-shot) | Accuracy | 0.67 | 0.74 | 0.88 | 0.87 | 0.89 | TBD |
101
+ | IndicXNLI (0-shot) | Accuracy | 0.47 | 0.54 | 0.55 | TBD | TBD | 0.67 |
102
+ | IndicQA (0-shot) | Bert Score | 0.90 | 0.90 | 0.91 | TBD | TBD | TBD |
103
+ | CrossSumIN (1-shot) | chrF++ | 0.04 | 0.17 | 0.21 | 0.26 | 0.24 | TBD |
104
+ | FloresIN Translation xx-en (1-shot) | chrF++ | 0.54 | 0.50 | 0.58 | 0.60 | 0.62 | 0.63 |
105
+ | FloresIN Translation en-xx (1-shot) | chrF++ | 0.41 | 0.34 | 0.48 | 0.46 | 0.47 | 0.48 |
106
+ | IN22 Translation xx-en (0-shot) | chrF++ | 0.50 | 0.48 | 0.57 | 0.58 | 0.55 | 0.55 |
107
+ | IN22 Translation en-xx (0-shot) | chrF++ | 0.36 | 0.33 | 0.45 | 0.42 | 0.44 | 0.43 |
108
 
109
 
110
  ### BharatBench
111
  The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
112
 
113
+ | Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-70B | Gemma-2-27B | GPT-4o |
114
+ |-------------------------------------|------------|--------------|-----------------|---------------|--------------|-------------|--------|
115
+ | Indian Cultural Context (0-shot) | Bert Score | 0.86 | 0.56 | 0.88 | 0.88 | 0.87 | 0.89 |
116
+ | Grammar Correction (5-shot) | Bert Score | 0.96 | 0.94 | 0.98 | 0.98 | 0.96 | 0.97 |
117
+ | Multi Turn (0-shot) | Bert Score | 0.88 | 0.87 | 0.91 | 0.90 | 0.89 | 0.92 |
118
+ | Multi Turn Comprehension (0-shot) | Bert Score | 0.90 | 0.89 | 0.92 | 0.93 | 0.91 | 0.94 |
119
+ | Multi Turn Translation (0-shot) | Bert Score | 0.85 | 0.87 | 0.92 | 0.91 | 0.91 | 0.92 |
120
+ | Text Classification (5-shot) | Accuracy | 0.61 | 0.71 | 0.76 | 0.88 | 0.86 | 0.89 |
121
+ | Named Entity Recognition (5-shot) | Accuracy | 0.31 | 0.51 | 0.53 | 0.61 | 0.65 | 0.65 |
122
 
123
  ### Qualitative Results
124
  Below are the results from manual evaluation of prompt-response pairs across languages and task categories. Scores are between 1-5 (higher the better). Model names were anonymised during the evaluation.