Abstract
In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.
Community
We introduce LayerNorm Scaling, a simple yet effective modification to mitigate the "Curse of Depth" by stabilizing deep layers in LLMs, thereby improving their contribution and enhancing the overall model quality.
Seems like a "no downsides" replacement to Pre-Layer Normalization. They swap PreLN for "LayerNorm Scaling" which scales the output of the Layer Normalization by a factor inversely proportional to the square root of the layer's depth, which prevents excessive variance growth with depth, stabilizes gradient flow, enhances the contribution of deeper Transformer layers during training, and reduces layerwise output variance, which all together results in lower training loss and faster convergence compared to vanilla Pre-LN.
TLDR; Scale the LN output variance down, as you go deeper in the model. It makes it more stable. Kinda like the Eiffel tower?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN (2024)
- DReSS: Data-driven Regularized Structured Streamlining for Large Language Models (2025)
- QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models (2024)
- Peri-LN: Revisiting Layer Normalization in the Transformer Architecture (2025)
- TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs (2024)
- Tensor Product Attention Is All You Need (2025)
- Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper