arxiv:2502.01637

Scaling Embedding Layers in Language Models

Published on Feb 3

· Submitted by

akhaliq on Feb 4

Upvote

Authors:

Da Yu ,

Yangsibo Huang ,

Pritish Kamath ,

Daogao Liu ,

Chiyuan Zhang

Abstract

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter 2 days ago

Jellyfish0538

Paper author 2 days ago

Thanks a lot for sharing!

A concurrent work, Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling, also decouples the input embedding layer from the decoding layer. That said, we take a completely different approach, leading to notable differences in pros and cons. See our related work section for a more detailed discussion!