text-Oriaz

Sleeping

App Files Files Community

text-Oriaz / README.md

Oriaz

Update README.md

c79a82a verified 30 days ago

preview code

raw

history blame contribute delete

3.39 kB

	---
	title: Submission Oriaz
	emoji: 🔥
	colorFrom: yellow
	colorTo: green
	sdk: docker
	pinned: True
	---

	# Benchmarkusing different techniques

	## Global Informations :

	#### Intended Use

	- Primary intended uses: Baseline comparison for climate disinformation classification models
	- Primary intended users: Researchers and developers participating in the Frugal AI Challenge
	- Out-of-scope use cases: Not intended for production use or real-world classification tasks

	### Training Data

	The model uses the QuotaClimat/frugalaichallenge-text-train dataset:
	- Size: ~6000 examples
	- Split: 80% train, 20% test
	- 8 categories of climate disinformation claims

	#### Labels
	0. No relevant claim detected
	1. Global warming is not happening
	2. Not caused by humans
	3. Not bad or beneficial
	4. Solutions harmful/unnecessary
	5. Science is unreliable
	6. Proponents are biased
	7. Fossil fuels are needed

	### Environmental Impact

	Environmental impact is tracked using CodeCarbon, measuring:
	- Carbon emissions during inference
	- Energy consumption during inference

	This tracking helps establish a baseline for the environmental impact of model deployment and inference.

	### Ethical Considerations

	- Dataset contains sensitive topics related to climate disinformation
	- Environmental impact is tracked to promote awareness of AI's carbon footprint


	## ML model for Climate Disinformation Classification

	### Model Description

	Find the best ML model to process vectorized quotes to detect climate change disinformation.

	### Performance

	#### Metrics (I used NVIDIA T4 small GPU)
	- Accuracy: ~69-72%
	- Environmental Impact:
	- Emissions tracked in gCO2eq (~0,7g)
	- Energy consumption tracked in Wh (~1,8wh)

	#### Model Architecture

	ML models prefers numeric values so we need to embed our quotes. I used MTEB Leaderboard on HuggingFace to find the model with the best trade-off between performance and the number of parameters.

	I then chosed "dunzhang/stella_en_400M_v5" model as embedder. It has the 7th best performance score with only 400M parameters.

	Once the quote are embedded, I have 6091 values x 1024 features. After that, train-test split (70%, 30%).

	Using TPOT Classifier, I found that the best model on my data was a Logistic Regressor.

	Then here is the Confusion Matrix :

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/66169e1ce557753f30eab31b/tfAcfFu3Cnc9XJ00ixrWB.png)

	### Limitations
	- Embedding phase take ~30 secondes for 1800 quotes. It can be optimised and can have a real influence on carbon emissions.
	- Hard to go over 70% accuracy with "simple" ML.
	- Textual data have some interpretations limitations that little models can't find.



	## Bert model for Climate Disinformation Classification

	### Model Description

	Fine tune model for model classification.

	### Performance

	#### Metrics (I used NVIDIA T4 small GPU)
	- Accuracy: ~90%
	- Environmental Impact:
	- Emissions tracked in gCO2eq (~0,25g)
	- Energy consumption tracked in Wh (~0.7wh)

	#### Model Architecture

	Fine tuning of "bert-uncased" model with 70% train, 15% eval, 15% test datasets.

	### Limitations
	- Not optimized. I need to try to run it on CPU
	- Little models have limitations. Regularly between 70-80% accuracy. Hard to go over just by changing params.

	# Contacts :
	LinkedIn : Mattéo GIRARDEAU
	email : [email protected]
	```