tokyotech-llm
/

edu-classifier

Text Classification

fastText

Japanese

Model card Files Files and versions Community

s-mizuki-nlp commited on 8 days ago

Commit

57eca30

verified ·

1 Parent(s): 65e1c88

Improved wordings, Fixed to correct description.

Browse files

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -16,11 +16,11 @@ library_name: fasttext
 This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
 1. **Wiki-based classifier**: trained on Japanese Wikipedia text in academic categories.
-2. **LLM-based classifier**: trained on annotations generated by LLMs.
 The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license. The LLM-based classifier is distributed under the same license to the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
-These classifiers were employed as quality-filtering for the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that applying filtering based on the classifier’s scores enabled more effective improvements in the LLM’s Japanese knowledge, even with the same computational resources.
 **NOTE**: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English.
@@ -28,7 +28,7 @@ These classifiers were employed as quality-filtering for the Swallow Corpus Vers
 ### How to use
-The Wiki-based classifier outputs a probability between 0 and 1, indicating how likely a given document resembles Wikipedia content. On the other hand, the LLM-based classifier predicts a label in four labels (0, 1, 2, or 3) of a given document, i.e., a 4-class classification problem for a document. An expectation of scores (ranging from 0 to 3) can be used for an education score.
 ```bash
 pip install numpy==1.26.4 fasttext
@@ -61,7 +61,7 @@ edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
 ### Best practice
-In our research, we have demonstrated that both classifiers were effective. However, we recommend using the **LLM-based classifier** if you want to assign appropriate education scores to a broader range of documents. Because the Wiki-based classifier is specialized in detecting documents similar to Wikipedia articles, it tends to assign scores close to 0 for most documents. In contrast, the LLM-based classifier can compute scores based on a more general definition of educational value.
 ## Training
@@ -76,8 +76,8 @@ We built this classifier by treating Wikipedia articles as positive examples of
 Inspired by [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), we constructed the classifier through the following steps:
 1. Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected.
-2. Evaluate the educational value of the documents from step 1 using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (or [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it)) and a [custom prompt](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/prompt.md). The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a 3-point scale.
-3. Train a fastText classifier using the automatically scored documents from step 2 as training data. The classifier predicts the educational score as one of four classes (0, 1, 2, or 3).
 ## Acknowledgments
@@ -94,4 +94,4 @@ The preprint can be downloaded [here](https://huggingface.co/tokyotech-llm/edu-c
   booktitle = {言語処理学会第31回年次大会 (NLP2025)},
   year = {2025},
 }
-```

 This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
 1. **Wiki-based classifier**: trained on Japanese Wikipedia text in academic categories.
+2. **LLM-based classifier**: trained on educational value annotations generated by LLMs.
 The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license. The LLM-based classifier is distributed under the same license to the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
+These classifiers were employed as quality-filtering for the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that educational quality-filtering based on the classifier scores effectively enhanced the LLM’s Japanese knowledge, even with the same computational budgets.
 **NOTE**: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English.
 ### How to use
+The Wiki-based classifier outputs a probability between 0 and 1, indicating how similar a given document is to Wikipedia content. On the other hand, the LLM-based classifier predicts a label in four classes (0, 1, 2, or 3) of a given document, i.e., a 4-class classification problem for a document. An expectation of scores, calculated as the weighted sum of class scores and their probabilities (ranging from 0 to 3) can be used as an educational score.
 ```bash
 pip install numpy==1.26.4 fasttext
 ### Best practice
+In our research, we have demonstrated that both classifiers were effective. However, we recommend using the **LLM-based classifier** if you want to assign appropriate educational scores to a broader range of documents. The Wiki-based classifier, designed to detect Wikipedia academic article-like content, often assigns scores close to 0 for most documents. In contrast, the LLM-based classifier can compute scores based on a more general definition of educational value.
 ## Training
 Inspired by [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), we constructed the classifier through the following steps:
 1. Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected.
+2. Evaluate the educational value of the documents from step 1 using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (or [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it)) and a [custom prompt](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/prompt.md). The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a 4-point Likert scale.
+3. Train a fastText classifier using the automatically scored documents from step 2 as training data. This classifier predicts the probability of each class (0, 1, 2, or 3).
 ## Acknowledgments
   booktitle = {言語処理学会第31回年次大会 (NLP2025)},
   year = {2025},
 }
+```