SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
Abstract
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.
Community
unlearning in diffusion models using sparse autoencoders
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models (2024)
- DebiasDiff: Debiasing Text-to-image Diffusion Models with Self-discovering Latent Attribute Directions (2024)
- EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers (2024)
- Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation (2025)
- AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors (2024)
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders (2025)
- Dataset Augmentation by Mixing Visual Concepts (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper