License and Datasets used?
Hi,
What is the License for this model, and what datasets/sources were used to train this model?
Thanks.
Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.
Thanks!
Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.
Forgot to mention, the reason I'm interested in the datasets is because I'm trying to finetune a model specifically for Japanese-to-English web novel translation. I created a very high-quality sentence-aligned parallel dataset of web novel chapters, but the scale (~100mb) wasn't enough for a good result, even using a Japanese-trained base model. So, I'm first finetuning on a large corpus of non-parallel Japanese and English web novels, then doing another finetune with the parallel dataset on top. I started with classical literature in the common domain first (which vastly increased translation quality), but the quantity, quality and relevance of the data wasn't too good (globis-university/aozorabunko-clean and ubaada/booksum-complete-cleaned) so I'm trying to integrate web novels in as well.