s-mizuki-nlp
commited on
Improved wordings, Fixed to correct description.
Browse files
README.md
CHANGED
@@ -16,11 +16,11 @@ library_name: fasttext
|
|
16 |
This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
|
17 |
|
18 |
1. **Wiki-based classifier**: trained on Japanese Wikipedia text in academic categories.
|
19 |
-
2. **LLM-based classifier**: trained on annotations generated by LLMs.
|
20 |
|
21 |
The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license. The LLM-based classifier is distributed under the same license to the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
|
22 |
|
23 |
-
These classifiers were employed as quality-filtering for the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that
|
24 |
|
25 |
**NOTE**: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English.
|
26 |
|
@@ -28,7 +28,7 @@ These classifiers were employed as quality-filtering for the Swallow Corpus Vers
|
|
28 |
|
29 |
### How to use
|
30 |
|
31 |
-
The Wiki-based classifier outputs a probability between 0 and 1, indicating how
|
32 |
|
33 |
```bash
|
34 |
pip install numpy==1.26.4 fasttext
|
@@ -61,7 +61,7 @@ edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
|
|
61 |
|
62 |
### Best practice
|
63 |
|
64 |
-
In our research, we have demonstrated that both classifiers were effective. However, we recommend using the **LLM-based classifier** if you want to assign appropriate
|
65 |
|
66 |
## Training
|
67 |
|
@@ -76,8 +76,8 @@ We built this classifier by treating Wikipedia articles as positive examples of
|
|
76 |
Inspired by [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), we constructed the classifier through the following steps:
|
77 |
|
78 |
1. Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected.
|
79 |
-
2. Evaluate the educational value of the documents from step 1 using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (or [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it)) and a [custom prompt](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/prompt.md). The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a
|
80 |
-
3. Train a fastText classifier using the automatically scored documents from step 2 as training data.
|
81 |
|
82 |
## Acknowledgments
|
83 |
|
@@ -94,4 +94,4 @@ The preprint can be downloaded [here](https://huggingface.co/tokyotech-llm/edu-c
|
|
94 |
booktitle = {言語処理学会第31回年次大会 (NLP2025)},
|
95 |
year = {2025},
|
96 |
}
|
97 |
-
```
|
|
|
16 |
This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
|
17 |
|
18 |
1. **Wiki-based classifier**: trained on Japanese Wikipedia text in academic categories.
|
19 |
+
2. **LLM-based classifier**: trained on educational value annotations generated by LLMs.
|
20 |
|
21 |
The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license. The LLM-based classifier is distributed under the same license to the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
|
22 |
|
23 |
+
These classifiers were employed as quality-filtering for the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that educational quality-filtering based on the classifier scores effectively enhanced the LLM’s Japanese knowledge, even with the same computational budgets.
|
24 |
|
25 |
**NOTE**: This classifier is designed to work with Japanese text. Its functionality and quality are not guaranteed for non-Japanese languages, including English.
|
26 |
|
|
|
28 |
|
29 |
### How to use
|
30 |
|
31 |
+
The Wiki-based classifier outputs a probability between 0 and 1, indicating how similar a given document is to Wikipedia content. On the other hand, the LLM-based classifier predicts a label in four classes (0, 1, 2, or 3) of a given document, i.e., a 4-class classification problem for a document. An expectation of scores, calculated as the weighted sum of class scores and their probabilities (ranging from 0 to 3) can be used as an educational score.
|
32 |
|
33 |
```bash
|
34 |
pip install numpy==1.26.4 fasttext
|
|
|
61 |
|
62 |
### Best practice
|
63 |
|
64 |
+
In our research, we have demonstrated that both classifiers were effective. However, we recommend using the **LLM-based classifier** if you want to assign appropriate educational scores to a broader range of documents. The Wiki-based classifier, designed to detect Wikipedia academic article-like content, often assigns scores close to 0 for most documents. In contrast, the LLM-based classifier can compute scores based on a more general definition of educational value.
|
65 |
|
66 |
## Training
|
67 |
|
|
|
76 |
Inspired by [FineWeb-Edu](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier), we constructed the classifier through the following steps:
|
77 |
|
78 |
1. Use 200,000 documents randomly extracted from the Swallow Corpus Version 2 and an additional 31,059 web articles that were manually selected.
|
79 |
+
2. Evaluate the educational value of the documents from step 1 using [Llama 3.1 70B Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) (or [Gemma 2 27B IT](https://huggingface.co/google/gemma-2-27b-it)) and a [custom prompt](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/utils/prompt.md). The evaluation is based on three criteria: (1) Whether the topic is highly academic. (2) Whether it provides deep insights or discussions. (3) Whether it is easy to understand for a general audience. The educational value is scored on a 4-point Likert scale.
|
80 |
+
3. Train a fastText classifier using the automatically scored documents from step 2 as training data. This classifier predicts the probability of each class (0, 1, 2, or 3).
|
81 |
|
82 |
## Acknowledgments
|
83 |
|
|
|
94 |
booktitle = {言語処理学会第31回年次大会 (NLP2025)},
|
95 |
year = {2025},
|
96 |
}
|
97 |
+
```
|