--- license: other license_name: mixed license_link: https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/README.md language: ja pipeline_tag: text-classification library_name: fasttext --- # Swallow Education Classifier [Japanese README](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/README_ja.md) ## Model summary This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers: 1. **Wiki-based classifier**: trained on Japanese Wikipedia text in academic categories. 2. **LLM-based classifier**: trained on educational value annotations generated by LLMs. The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license. The LLM-based classifier is distributed under the same license to the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)). These classifiers were employed as quality-filtering for the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that educational quality-filtering based on the classifier scores effectively enhanced the LLM’s Japanese knowledge, even with the same computational budgets. **NOTE**: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English. \* A large Japanese web corpus extracted from Common Crawl ### How to use The Wiki-based classifier outputs a probability between 0 and 1, indicating how similar a given document is to Wikipedia content. On the other hand, the LLM-based classifier predicts a label in four classes (0, 1, 2, or 3) of a given document, i.e., a 4-class classification problem for a document. An expectation of scores, calculated as the weighted sum of class scores and their probabilities (ranging from 0 to 3) can be used as an educational score. ```bash pip install numpy==1.26.4 fasttext ``` ```python from huggingface_hub import hf_hub_download import fasttext # Example text text = "Llama 3.1 Swallow\nLlama 3.1 SwallowはLlama 3.1の英語の能力を維持しながら、日本語の能力を強化した大規模言語モデル (8B, 70B) です。" text = text.replace("\n", " ") # If you use Wiki-based classifier model = fasttext.load_model(hf_hub_download("tokyotech-llm/edu-classifier", "wiki.bin")) res = model.predict(text, k=-1) ## Use the positive prediction probability as the educational score edu_score = res[1][0] if res[0][0] == "__label__pos" else 1 - res[1][0] # If you use LLM-based classifier model = fasttext.load_model( hf_hub_download("tokyotech-llm/edu-classifier", "llm_llama.bin") ) res = model.predict(text, k=-1) ## Use the weighted sum of the prediction probabilities as the educational score edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])]) ``` ### Best practice In our research, we have demonstrated that both classifiers were effective. However, we recommend using the **LLM-based classifier** if you want to assign appropriate educational scores to a broader range of documents. The Wiki-based classifier, designed to detect Wikipedia academic article-like content, often assigns scores close to 0 for most documents. In contrast, the LLM-based classifier can compute scores based on a more general definition of educational value. ## Training Both classifiers were trained using fastText with 20 epochs on the training data. Character n-grams (_n_=2,3) were used as features. Word n-grams were not applied as they did not contribute to improving accuracy. ### Wiki-based Classifier We built this classifier by treating Wikipedia articles as positive examples of educational documents. Since not all articles, such as those about individuals, are necessarily “educational”, we extracted 37,399 Japanese Wikipedia articles from [academic categories](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/academic_categories_wiki_ja.tsv) as positive examples for training data. We randomly sampled 37,399 documents from the Swallow Corpus Version 2 for the negative examples. ### LLM-based Classifier Inspired by [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), we constructed the classifier through the following steps: 1. Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected. 2. Evaluate the educational value of the documents from step 1 using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (or [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it)) and a [custom prompt](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/prompt.md). The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a 4-point Likert scale. 3. Train a fastText classifier using the automatically scored documents from step 2 as training data. This classifier predicts the probability of each class (0, 1, 2, or 3). ## Acknowledgments This research is based on results obtained from a project, JPNP18002, commissioned by the New Energy and Industrial Technology Development Organization (NEDO) and AIST policy-based budget project "R&D on Generative AI Foundation Models for the Physical Domain". In addition, the experiments of continual pre-training of LLMs was supported by the “Support Program for Building Large Language Models” of the AI Bridging Cloud Infrastructure (ABCI) developed and operated by the National Institute of Advanced Industrial Science and Technology (AIST). ## Citation The preprint can be downloaded [here](https://huggingface.co/tokyotech-llm/edu-classifier/resolve/main/swallow-corpus-v2.pdf) (in Japanese only) ```bibtex @inproceedings{hattori-2025-swallow-v2, author = {服部 翔 and 岡崎 直観 and 水木 栄 and 藤井 一喜 and 中村 泰士 and 大井 聖也 and 塩谷 泰平 and 齋藤 幸史郎 and Youmi Ma and 前田 航希 and 岡本 拓己 and 石田 茂樹 and 横田 理央 and 高村 大也}, title = {Swallowコーパスv2: 教育的な日本語ウェブコーパスの構築}, booktitle = {言語処理学会第31回年次大会 (NLP2025)}, year = {2025}, } ```