Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
Abstract
Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.
Community
We propose a novel speculative decoding leveraging diverse data resources without any parameter update.
I'm paraphrasing here, but TLDR; Lets try branch prediction with LLM's. Run real-time drafting and verification of tokens, store them in a tiered topK or LRU database, then pull them as needed. Small models generating the tokens, Large model (using special parallelized attention mask) verifies the tokens. Gets a 1.5x speedup at T=1.0. The tiered database approach is the game changer, as it optimizes between draft token predictions and misses.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree (2024)
- M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference (2025)
- EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization (2025)
- Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment (2025)
- HADES: Hardware Accelerated Decoding for Efficient Speculation in Large Language Models (2024)
- AdaEAGLE: Optimizing Speculative Decoding via Explicit Modeling of Adaptive Draft Structures (2024)
- Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper