tokyotech-llm
/

Llama-3.1-Swallow-70B-Instruct-v0.1

+---
+language:
+  - en
+  - ja
+library_name: transformers
+pipeline_tag: text-generation
+license: llama3.1
+model_type: llama
+---
+# Llama3.1 Swallow
+Our Swallow model has undergone continual pre-training from the [Llama 3.1 family](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f), primarily with the addition of Japanese language data. The Instruct versions use supervised fine-tuning (SFT). Links to other models can be found in the index.
+# Model Release Updates
+We are excited to share the release schedule for our latest models:
+- **October 08, 2024**: Released the [Llama-3.1-Swallow-8B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-v0.1), [Llama-3.1-Swallow-8B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1), [Llama-3.1-Swallow-70B-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1), and [Llama-3.1-Swallow-70B-Instruct-v0.1](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1).
+## Swallow Model Index
+|Model|Llama-3.1-Swallow|Llama-3.1-Swallow-Instruct|
+|---|---|---|
+|8B| [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-v0.1) | [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.1) |
+|70B| [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-v0.1) | [Link](https://huggingface.co/tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1) |
+![logo](./logo.png)
+This repository provides large language models developed by [Swallow-LLM](https://swallow-llm.github.io/).
+## Model Details
+* **Model type**: Please refer to [Llama 3.1 MODEL_CARD](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md) for details on the model architecture.
+* **Language(s)**: Japanese English
+* **Library**: [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
+* **Tokenizer**: Please refer to [Llama 3.1 blog](https://ai.meta.com/blog/meta-llama-3-1) for details on the tokenizer.
+* **Contact**: swallow[at]nlp.c.titech.ac.jp
+## Model Performance
+### Japanese tasks
+|Model|Size|JCom.|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|JMMLU|JHumanEval|Ja Avg|
+|---|---|---|---|---|---|---|---|---|---|---|---|---|
+|   |   |4-shot|4-shot|4-shot|4-shot|1-shot|4-shot|4-shot|4-shot|5-shot|0-shot|   |
+|   |   |EM acc|Char-F1|Char-F1|Char-F1|ROUGE-2|EM acc|BLEU|BLEU|EM acc|pass@1|   |
+| モデル名 | JCom. | JEMHopQA | NIILC | JSQuAD | XL-Sum | MGSM | WMT20-en-ja | WMT20-ja-en | JMMLU | JHumanEval | Ja Avg |
+|----------|-------|----------|-------|--------|--------|------|-------------|-------------|-------|------------|--------|
+| Gemma 2 27B IT | 0.9562 | 0.5413 | 0.5755 | 0.8832 | 0.1648 | 0.7000 | 0.2900 | 0.2500 | 0.6701 | 0.6293 | 0.5660 |
+| Phi-3.5-MoE Instruct | 0.9321 | 0.4416 | 0.4920 | 0.9079 | 0.2255 | 0.7120 | 0.2575 | 0.2024 | 0.6447 | 0.4213 | 0.5237 |
+| GRIN-MoE | 0.8606 | 0.4622 | 0.3943 | 0.8877 | 0.0302 | 0.6400 | 0.2300 | 0.1911 | 0.5696 | 0.4476 | 0.4713 |
+| KARAKURI LM 70B Chat v0.1 | 0.8847 | 0.5139 | 0.5668 | 0.9096 | 0.1369 | 0.2800 | 0.2526 | 0.2095 | 0.4648 | 0.2354 | 0.4454 |
+| Swallow-70b-instruct-v0.1 | 0.9231 | 0.5654 | 0.5751 | 0.9036 | 0.1861 | 0.4160 | 0.2619 | 0.2318 | 0.5727 | 0.2835 | 0.4919 |
+| Llama 3 70B Instruct | 0.9419 | 0.6114 | 0.5506 | 0.9164 | 0.1912 | 0.7200 | 0.2708 | 0.2350 | 0.6789 | 0.6610 | 0.5777 |
+| Llama 3.1 70B Instruct | 0.9482 | 0.6246 | 0.5781 | 0.9201 | 0.1772 | 0.7440 | 0.2805 | 0.2472 | 0.7323 | 0.6933 | 0.5945 |
+| Llama 3 Youko 70B Instruct | 0.9526 | 0.6252 | 0.5853 | 0.9215 | 0.1983 | 0.7400 | 0.2633 | 0.2245 | 0.7170 | 0.6098 | 0.5838 |
+| Llama-3.1-70B-Japanese-Instruct-2407 | 0.9562 | 0.6466 | 0.6602 | 0.9187 | 0.1564 | 0.7480 | 0.2901 | 0.2410 | 0.7227 | 0.6274 | 0.5967 |
+| Llama 3 heron brain 70B v0.3 | 0.9660 | 0.6643 | 0.6817 | 0.9221 | 0.2611 | 0.7720 | 0.3093 | 0.2578 | 0.7077 | 0.6079 | 0.6150 |
+| Llama 3 Swallow 70B Instruct | 0.9607 | 0.6188 | 0.6026 | 0.9236 | 0.1389 | 0.6560 | 0.2724 | 0.2532 | 0.6572 | 0.6000 | 0.5683 |
+| Llama 3.1 Swallow 70B Instruct | 0.9598 | 0.6192 | 0.6605 | 0.9235 | 0.1938 | 0.7760 | 0.3123 | 0.2593 | 0.7117 | 0.4713 | 0.5887 |
+| Qwen2-72B-Instruct | 0.9634 | 0.6268 | 0.5418 | 0.9210 | 0.1644 | 0.7840 | 0.2592 | 0.2327 | 0.7713 | 0.6909 | 0.5955 |
+| Qwen2.5-72B-Instruct | 0.9696 | 0.5699 | 0.5811 | 0.7381 | 0.1706 | 0.8360 | 0.2269 | 0.2179 | 0.7899 | 0.6256 | 0.5726 |
+| Mixtral-8x22B-Instruct-v0.1 | 0.9053 | 0.5001 | 0.4609 | 0.9186 | 0.2060 | 0.6760 | 0.2327 | 0.2313 | 0.6094 | 0.5787 | 0.5319 |
+### English tasks
+|Model|Size|OpenBookQA|TriviaQA|HellaSWAG|SQuAD2.0|XWINO|MMLU|GSM8K|BBH|HumanEval|EnAvg|
+|---|---|---|---|---|---|---|---|---|---|---|---|
+|||4-shot|4-shot|4-shot|4-shot|4-shot|5-shot|4-shot|3-shot|0-shot||
+|||Acc|EMacc|Acc|EMacc|Acc|Acc|EMacc|CoTEMAcc|pass@1||
+| モデル名 | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWIN | MMLU | GSM8K | BBH | HumanEval | En Avg |
+|----------|------------|----------|------------|----------|------|------|-------|-----|-----------|--------|
+| Gemma 2 27B IT | 0.4560 | 0.7660 | 0.6548 | 0.4012 | 0.9101 | 0.7624 | 0.8438 | 0.7876 | 0.6939 | 0.6973 |
+| Phi-3.5-MoE Instruct | 0.4960 | 0.6746 | 0.6901 | 0.3174 | 0.8903 | 0.7872 | 0.8317 | 0.7618 | 0.5561 | 0.6673 |
+| GRIN-MoE | 0.4660 | 0.7035 | 0.7046 | 0.3544 | 0.8976 | 0.7693 | 0.8287 | 0.7533 | 0.6841 | 0.6846 |
+| KARAKURI LM 70B Chat v0.1 | 0.4100 | 0.6873 | 0.6315 | 0.3677 | 0.9049 | 0.5941 | 0.3882 | 0.5724 | 0.2305 | 0.5319 |
+| Swallow-70b-instruct-v0.1 | 0.4440 | 0.7411 | 0.6567 | 0.3529 | 0.9166 | 0.6677 | 0.5095 | 0.6661 | 0.2835 | 0.5820 |
+| Llama 3 70B Instruct | 0.4400 | 0.7999 | 0.6552 | 0.4024 | 0.9127 | 0.7992 | 0.9052 | 0.8326 | 0.7555 | 0.7225 |
+| Llama 3.1 70B Instruct | 0.4300 | 0.8212 | 0.6621 | 0.3921 | 0.9157 | 0.8213 | 0.8764 | 0.8390 | 0.7915 | 0.7277 |
+| Llama 3 Youko 70B Instruct | 0.4500 | 0.7973 | 0.6863 | 0.3914 | 0.9153 | 0.8055 | 0.8923 | 0.7814 | 0.6598 | 0.7088 |
+| Llama-3.1-70B-Japanese-Instruct-2407 | 0.4220 | 0.8104 | 0.6481 | 0.3744 | 0.9170 | 0.8071 | 0.8893 | 0.8228 | 0.7463 | 0.7153 |
+| Llama 3 heron brain 70B v0.3 | 0.4460 | 0.8107 | 0.6682 | 0.4085 | 0.9174 | 0.7898 | 0.8772 | 0.7586 | 0.6713 | 0.7053 |
+| Llama 3 Swallow 70B Instruct | 0.4520 | 0.8174 | 0.6758 | 0.4050 | 0.9230 | 0.7883 | 0.8688 | 0.8152 | 0.6890 | 0.7150 |
+| Llama 3.1 Swallow 70B Instruct | 0.4520 | 0.8148 | 0.6834 | 0.4012 | 0.9157 | 0.7855 | 0.8886 | 0.8486 | 0.5823 | 0.7080 |
+| Qwen2-72B-Instruct | 0.4360 | 0.7588 | 0.6857 | 0.3913 | 0.9110 | 0.8391 | 0.8499 | 0.2436 | 0.6939 | 0.6455 |
+| Qwen2.5-72B-Instruct | 0.4540 | 0.6764 | 0.7064 | 0.3550 | 0.8895 | 0.8478 | 0.9113 | 0.4027 | 0.6165 | 0.6511 |
+| Mixtral-8x22B-Instruct-v0.1 | 0.4540 | 0.8265 | 0.7074 | 0.3927 | 0.9222 | 0.7733 | 0.8324 | 0.8306 | 0.7348 | 0.7193 |
+## MT-Bench JA
+|Model|Size|coding|extraction|humanities|math|reasoning|roleplay|stem|writing|JMTAvg|
+|---|---|---|---|---|---|---|---|---|---|---|
+| Model | coding | extraction | humanities | math | reasoning | roleplay | stem | writing | JMT Avg |
+|-------|--------|------------|------------|------|-----------|----------|------|---------|---------|
+| Gemma 2 27B IT | 0.5467 | 0.6752 | 0.8386 | 0.6246 | 0.7201 | 0.7916 | 0.6787 | 0.807 | 0.7103 |
+| Phi-3.5-MoE Instruct | 0.5214 | 0.8106 | 0.647 | 0.4415 | 0.536 | 0.6712 | 0.5314 | 0.7304 | 0.6112 |
+| GRIN-MoE | 0.5294 | 0.7224 | 0.5923 | 0.5467 | 0.499 | 0.603 | 0.538 | 0.6839 | 0.5893 |
+| KARAKURI LM 70B Chat v0.1 | 0.2804 | 0.5862 | 0.624 | 0.2934 | 0.4183 | 0.553 | 0.4859 | 0.5964 | 0.4797 |
+| Swallow-70b-instruct-v0.1 | 0.303 | 0.55 | 0.565 | 0.3483 | 0.305 | 0.542 | 0.4916 | 0.463 | 0.446 |
+| Llama 3 70B Instruct | 0.5969 | 0.841 | 0.712 | 0.4481 | 0.4884 | 0.7117 | 0.651 | 0.69 | 0.6424 |
+| Llama 3.1 70B Instruct | 0.5252 | 0.7846 | 0.7086 | 0.5063 | 0.6979 | 0.6888 | 0.6402 | 0.6653 | 0.6521 |
+| Llama 3 Youko 70B Instruct | 0.6632 | 0.8387 | 0.8108 | 0.4655 | 0.7013 | 0.7778 | 0.7544 | 0.7662 | 0.7222 |
+| Llama-3.1-70B-Japanese-Instruct-2407 | 0.6267 | 0.7525 | 0.7938 | 0.575 | 0.559 | 0.7725 | 0.724 | 0.718 | 0.6902 |
+| Llama 3 heron brain 70B v0.3 | 0.3762 | 0.7892 | 0.7274 | 0.5589 | 0.507 | 0.6662 | 0.688 | 0.6996 | 0.6266 |
+| Llama 3 Swallow 70B Instruct | 0.5269 | 0.725 | 0.569 | 0.4669 | 0.6121 | 0.6238 | 0.5533 | 0.5698 | 0.5809 |
+| Llama 3.1 Swallow 70B Instruct | 0.5676 | 0.7859 | 0.749 | 0.5437 | 0.6383 | 0.687 | 0.6121 | 0.654 | 0.6547 |
+| Qwen2-72B-Instruct | 0.5699 | 0.7858 | 0.8222 | 0.5096 | 0.7032 | 0.7963 | 0.7728 | 0.8223 | 0.7228 |
+| Qwen2.5-72B-Instruct | 0.706 | 0.7866 | 0.8122 | 0.6968 | 0.6536 | 0.8301 | 0.806 | 0.7841 | 0.7594 |
+| Mixtral-8x22B-Instruct-v0.1 | 0.5061 | 0.7454 | 0.5978 | 0.4772 | 0.476 | 0.542 | 0.4679 | 0.6244 | 0.5546 |
+| Llama 3.1 405B Instruct (deepinfra API) | 0.6464 | 0.8218 | 0.715 | 0.5313 | 0.6447 | 0.716 | 0.6737 | 0.677 | 0.6782 |
+| GPT-3.5 (gpt-3.5-turbo-0125) | 0.6851 | 0.7641 | 0.7414 | 0.5522 | 0.5128 | 0.7104 | 0.6266 | 0.7361 | 0.6661 |
+| GPT-4o (gpt-4o-2024-05-13) | 0.7296 | 0.854 | 0.8646 | 0.6641 | 0.6661 | 0.8274 | 0.8184 | 0.8085 | 0.7791 |
+## Evaluation Benchmarks
+### Japanese evaluation benchmarks
+We used llm-jp-eval(v1.3.0), JP Language Model Evaluation Harness(commit #9b42d41) and Code Generation LM Evaluation Harness(commit #0261c52). The details are as follows:
+- Multiple-choice question answering (JCommonsenseQA [Kurihara et al., 2022])
+- Open-ended question answering (JEMHopQA [Ishii et al., 2024])
+- Open-ended question answering (NIILC [関根, 2003])
+- Machine reading comprehension (JSQuAD [Kurihara et al., 2022])
+- Automatic summarization (XL-Sum [Hasan et al., 2021])
+- Machine translation (WMT2020 ja-en [Barrault et al., 2020])
+- Machine translation (WMT2020 en-ja [Barrault et al., 2020])
+- Mathematical reasoning (MGSM [Shi et al., 2023])
+- Academic exams (JMMLU [尹ら, 2024])
+- Code generation (JHumanEval [佐藤ら, 2024])
+### English evaluation benchmarks
+We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Evaluation Harness(commit #0261c52). The details are as follows:
+- Multiple-choice question answering (OpenBookQA [Mihaylov et al., 2018])
+- Open-ended question answering (TriviaQA [Joshi et al., 2017])
+- Machine reading comprehension (SQuAD2 [Rajpurkar et al., 2018])
+- Commonsense reasoning (XWINO [Tikhonov and Ryabinin, 2021])
+- Natural language inference (HellaSwag [Zellers et al., 2019])
+- Mathematical reasoning (GSM8K [Cobbe et al., 2021])
+- Reasoning (BBH (BIG-Bench-Hard) [Suzgun et al., 2023])
+- Academic exams (MMLU [Hendrycks et al., 2021])
+- Code generation (HumanEval [Chen et al., 2021])
+### MT-Bench JA
+We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
+We utilized the following settings:
+- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
+- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
+- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
+- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
+- Judge: `gpt-4-1106-preview`
+- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
+## Usage
+```sh
+pip install vllm
+```
+```python
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+model_name = "tokyotech-llm/Llama-3.1-Swallow-70B-Instruct-v0.1"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+llm = LLM(
+    model=model_name,
+    tensor_parallel_size=4,
+)
+sampling_params = SamplingParams(
+    temperature=0.6, top_p=0.9, max_tokens=512, stop="<|eot_id|>"
+)
+message = [
+    {"role": "system", "content": "あなたは誠実で優秀な日本人のアシスタントです。"},
+    {
+        "role": "user",
+        "content": "東京の紅葉した公園で、東京タワーと高層ビルを背景に、空を舞うツバメと草地に佇むラマが出会う温かな物語を書いてください。",
+    },
+]
+prompt = tokenizer.apply_chat_template(
+    message, tokenize=False, add_generation_prompt=True
+)
+output = llm.generate(prompt, sampling_params)
+print(output[0].outputs[0].text)
+```
+## Training Datasets
+### Instruction Tuning
+The following datasets were used for the instruction tuning.
+- lmsys-chat-1m-synth-ja-wo-pii
+    - Japanese translation of the lmsys-chat-1m dataset using DeepL, with synthetic instruction data created using the Llama-3.1-405B model.
+    - 'wo-pii' indicates removal of personally identifiable information.
+- filtered magpie-ultra
+    - Subset of the [magpie-ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1) dataset, containing samples rated as 'average,' 'good,' or 'excellent.'.
+- gemma-magpie
+    - Japanese dataset.
+    - Generated using prompts for specific category words.
+## Risks and Limitations
+The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
+## Acknowledgements
+We thank Meta Research for releasing Llama 3.1 under an open license for others to build on.
+Our project is supported by the [Large Generative AI Development Support Program](https://abci.ai/en/link/lfm_support_program.html) of the National Institute of Advanced Industrial Science and Technology.
+## License
+[META LLAMA 3.1 COMMUNITY LICENSE](https://www.llama.com/llama3_1/license/)
+## Authors
+Here are the team members:
+- From [Tokyo Institute of Technology Okazaki Laboratory](https://www.nlp.c.titech.ac.jp/index.en.html), the following members:
+  - [Naoaki Okazaki](https://www.chokkan.org/index.ja.html)
+  - [Sakae Mizuki](https://s-mizuki-nlp.github.io/)
+  - [Youmi Ma](https://www.nlp.c.titech.ac.jp/member/youmi.en.html)
+  - [Koki Maeda](https://sites.google.com/view/silviase)
+  - [Kakeru Hattori](https://aya-se.vercel.app/)
+  - [Masanari Ohi](https://sites.google.com/view/masanariohi)
+  - [Taihei Shiotani](https://github.com/inatoihs)
+  - [Koshiro Saito](https://sites.google.com/view/koshiro-saito)
+- From [Tokyo Institute of Technology YOKOTA Laboratory](https://www.rio.gsic.titech.ac.jp/en/index.html), the following members:
+  - [Rio Yokota](https://twitter.com/rioyokota)
+  - [Kazuki Fujii](https://twitter.com/okoge_kaz)
+  - [Taishi Nakamura](https://twitter.com/Setuna7777_2)
+  - [Takumi Okamoto](https://www.linkedin.com/in/takumi-okamoto)
+  - [Ishida Shigeki](https://www.wantedly.com/id/reborn27)
+- From [Artificial Intelligence Research Center, AIST, Japan](https://www.airc.aist.go.jp/en/teams/), the following members:
+  - [Hiroya Takamura](https://sites.google.com/view/hjtakamura)
+## How to cite
+If you find our work helpful, please feel free to cite us.
+```
+@inproceedings{Fujii:COLM2024,
+   title={Continual Pre-Training for Cross-Lingual LLM Adaptation:
+Enhancing Japanese Language Capabilities},
+   author={Kazuki Fujii and Taishi Nakamura and Mengsay Loem and Hiroki
+Iida and Masanari Ohi and Kakeru Hattori and Hirai Shota and Sakae
+Mizuki and Rio Yokota and Naoaki Okazaki},
+   booktitle="Proceedings of the First Conference on Language Modeling",
+   series={COLM},
+   pages="(to appear)",
+   year="2024",
+   month=oct,
+   address={University of Pennsylvania, USA},
+}
+@inproceedings{Okazaki:COLM2024,
+   title={Building a Large Japanese Web Corpus for Large Language Models},
+   author={Naoaki Okazaki and Kakeru Hattori and Hirai Shota and Hiroki
+Iida and Masanari Ohi and Kazuki Fujii and Taishi Nakamura and Mengsay
+Loem and Rio Yokota and Sakae Mizuki},
+   booktitle="Proceedings of the First Conference on Language Modeling",
+   series={COLM},
+   pages="(to appear)",
+   year="2024",
+   month=oct,
+   address={University of Pennsylvania, USA},
+}
+```
+### References
+```tex
+@misc{dubey2024llama3herdmodels,
+      title={The Llama 3 Herd of Models},
+      author={Abhimanyu Dubey and Abhinav Jauhri and Abhinav Pandey and Abhishek Kadian and Ahmad Al-Dahle and Aiesha Letman and Akhil Mathur and Alan Schelten and Amy Yang and Angela Fan et al.},
+      year={2024},
+      eprint={2407.21783},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2407.21783},
+}
+```