s-mizuki-nlp commited on
Commit
57eca30
·
verified ·
1 Parent(s): 65e1c88

Improved wordings, Fixed to correct description.

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -16,11 +16,11 @@ library_name: fasttext
16
  This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
17
 
18
  1. **Wiki-based classifier**: trained on Japanese Wikipedia text in academic categories.
19
- 2. **LLM-based classifier**: trained on annotations generated by LLMs.
20
 
21
  The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license. The LLM-based classifier is distributed under the same license to the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
22
 
23
- These classifiers were employed as quality-filtering for the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that applying filtering based on the classifier’s scores enabled more effective improvements in the LLM’s Japanese knowledge, even with the same computational resources.
24
 
25
  **NOTE**: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English.
26
 
@@ -28,7 +28,7 @@ These classifiers were employed as quality-filtering for the Swallow Corpus Vers
28
 
29
  ### How to use
30
 
31
- The Wiki-based classifier outputs a probability between 0 and 1, indicating how likely a given document resembles Wikipedia content. On the other hand, the LLM-based classifier predicts a label in four labels (0, 1, 2, or 3) of a given document, i.e., a 4-class classification problem for a document. An expectation of scores (ranging from 0 to 3) can be used for an education score.
32
 
33
  ```bash
34
  pip install numpy==1.26.4 fasttext
@@ -61,7 +61,7 @@ edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
61
 
62
  ### Best practice
63
 
64
- In our research, we have demonstrated that both classifiers were effective. However, we recommend using the **LLM-based classifier** if you want to assign appropriate education scores to a broader range of documents. Because the Wiki-based classifier is specialized in detecting documents similar to Wikipedia articles, it tends to assign scores close to 0 for most documents. In contrast, the LLM-based classifier can compute scores based on a more general definition of educational value.
65
 
66
  ## Training
67
 
@@ -76,8 +76,8 @@ We built this classifier by treating Wikipedia articles as positive examples of
76
  Inspired by [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), we constructed the classifier through the following steps:
77
 
78
  1. Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected.
79
- 2. Evaluate the educational value of the documents from step 1 using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (or [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it)) and a [custom prompt](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/prompt.md). The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a 3-point scale.
80
- 3. Train a fastText classifier using the automatically scored documents from step 2 as training data. The classifier predicts the educational score as one of four classes (0, 1, 2, or 3).
81
 
82
  ## Acknowledgments
83
 
@@ -94,4 +94,4 @@ The preprint can be downloaded [here](https://huggingface.co/tokyotech-llm/edu-c
94
  booktitle = {言語処理学会第31回年次大会 (NLP2025)},
95
  year = {2025},
96
  }
97
- ```
 
16
  This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
17
 
18
  1. **Wiki-based classifier**: trained on Japanese Wikipedia text in academic categories.
19
+ 2. **LLM-based classifier**: trained on educational value annotations generated by LLMs.
20
 
21
  The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license. The LLM-based classifier is distributed under the same license to the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
22
 
23
+ These classifiers were employed as quality-filtering for the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that educational quality-filtering based on the classifier scores effectively enhanced the LLM’s Japanese knowledge, even with the same computational budgets.
24
 
25
  **NOTE**: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English.
26
 
 
28
 
29
  ### How to use
30
 
31
+ The Wiki-based classifier outputs a probability between 0 and 1, indicating how similar a given document is to Wikipedia content. On the other hand, the LLM-based classifier predicts a label in four classes (0, 1, 2, or 3) of a given document, i.e., a 4-class classification problem for a document. An expectation of scores, calculated as the weighted sum of class scores and their probabilities (ranging from 0 to 3) can be used as an educational score.
32
 
33
  ```bash
34
  pip install numpy==1.26.4 fasttext
 
61
 
62
  ### Best practice
63
 
64
+ In our research, we have demonstrated that both classifiers were effective. However, we recommend using the **LLM-based classifier** if you want to assign appropriate educational scores to a broader range of documents. The Wiki-based classifier, designed to detect Wikipedia academic article-like content, often assigns scores close to 0 for most documents. In contrast, the LLM-based classifier can compute scores based on a more general definition of educational value.
65
 
66
  ## Training
67
 
 
76
  Inspired by [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), we constructed the classifier through the following steps:
77
 
78
  1. Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected.
79
+ 2. Evaluate the educational value of the documents from step 1 using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (or [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it)) and a [custom prompt](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/prompt.md). The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a 4-point Likert scale.
80
+ 3. Train a fastText classifier using the automatically scored documents from step 2 as training data. This classifier predicts the probability of each class (0, 1, 2, or 3).
81
 
82
  ## Acknowledgments
83
 
 
94
  booktitle = {言語処理学会第31回年次大会 (NLP2025)},
95
  year = {2025},
96
  }
97
+ ```