Elizezen/Nocturn-7B · License and Datasets used?

NilanE

Feb 6, 2024

•

edited Feb 6, 2024

Hi,
What is the License for this model, and what datasets/sources were used to train this model?

Thanks.

NilanE changed discussion title from Datasets used? to License and Datasets used? Feb 6, 2024

Elizezen

Owner Feb 14, 2024

Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.

NilanE

Feb 14, 2024

Thanks!

NilanE changed discussion status to closed Feb 14, 2024

NilanE

Feb 14, 2024

•

edited Feb 14, 2024

Hi,
The License should be the same as Japanese Stable LM Instruct Gamma 7B's, namely apache-2.0. But I'm not very knowledgeable about licenses, so let's say it's generally advisable to use this model for personal use only. As for datasets, I used less than 1GB of web Fictions for fine-tuning.

Forgot to mention, the reason I'm interested in the datasets is because I'm trying to finetune a model specifically for Japanese-to-English web novel translation. I created a very high-quality sentence-aligned parallel dataset of web novel chapters, but the scale (~100mb) wasn't enough for a good result, even using a Japanese-trained base model. So, I'm first finetuning on a large corpus of non-parallel Japanese and English web novels, then doing another finetune with the parallel dataset on top. I started with classical literature in the common domain first (which vastly increased translation quality), but the quantity, quality and relevance of the data wasn't too good (globis-university/aozorabunko-clean and ubaada/booksum-complete-cleaned) so I'm trying to integrate web novels in as well.

NilanE changed discussion status to open Feb 14, 2024

NilanE changed discussion status to closed Feb 14, 2024