Blazing-Fast Code Editing via Multi-Layer Speculation

Community Article Published February 15, 2025

🚀 We propose Blazedit, an extremely simple yet general speculative decoding method that accelerate whole-file code editing by up to 7.7x over a comprehensive set of editing scenarios. Our repository is available at ise-uiuc/blazedit and we look forward to further optimizations and integrations.

image/gif

Background

Code Editing

Large Language Models (LLMs) have been broadly used in nowadays software engineering. Applications such as code completion have been widely integrated into modern IDEs. More recently, code editing has emerged as a more fundamental approach for LLMs to assist developers. For example, with just a short user prompt, LLMs can make automated edits to fix bugs or improve code quality over large code files.

Applying LLMs to edit large files is however not easy. There are mainly two types of code editing formats:

  1. Whole-file Edit: Generating a new file from scratch --- while this is effective, it's both expensive and slow due to massive token generation, making it unsuitable for interactive programming.
  2. Diff-based Edit: Generating a compact diff --- this is more efficient but challenging for LLMs trained on plain code text.

In this article, we focus on optimizing whole-file code editing.

image/png

Speculative Decoding

Speculative decoding is a well-known technique to accelerate LLM generation by trading off computation for latency. In regular autoregressive decoding, each token requires a full forward pass of the LLM, whose efficiency is bounded by memory bandwidth. Speculative decoding accelerates regular decoding using two steps:

  • Drafting: Speculating a sequence of draft tokens using cheaper methods, such as smaller LLMs or looking up previous tokens
  • Validation: Validating the whole sequence of draft tokens in one model forward pass through token prefilling

Since multiple tokens can be generated in one forward pass, decoding speed can be significantly improved.

More specifically, two types drafting methods are available in Huggingface Transformers🤗 that do not require modifying original models:

  • Assisted Decoding proposes draft tokens by calling a smaller LLM. The current default implementation (Feb 13, 2025) also dynamically adjusts the draft length to optimize the acceptance rate.
  • Prompt Lookup Decoding (PLD) proposes draft tokens by copy-pasting a sub-prompt whose prefix matches the suffix of the prompt, which is almost free. This strategy is very efficient in code editing where oftentimes the post-edit code largely overlaps with the pre-edit code.

Blazedit: Multi-Layer Speculative Decoding

We propose Blazedit, a simplistic framework that combines assisted decoding and PLD to further accelerate whole-file code editing.

image/png

We start the introduction with the limitation of existing speculative decoding methods for code editing.

  • High Overhead in Draft Model: Draft model can generate meaningful draft tokens during real edits instead of simply copying, leading to higher acceptance rates. Nonetheless, draft generation is still autoregressive and thus of non-negligible overhead, especially when the draft length is long.
  • Low Acceptance Rate in PLD: PLD is efficient as the cost of drafting is negiligble. However, the "copying" mechanism can lead to very low acceptance rate in the validation step when the target model is making real edits.

Blazedit addresses these limitations using an elegant multi-layer speculative decoding strategy. In the high level, similar to assisted decoding, Blazedit uses a draft model to propose draft tokens, validated by the target model, for good acceptance rates. Meanwhile, Blazedit uses PLD to accelerate the draft model, reducing the overhead of draft-model generation. Specifically, the PLD step is performed multiple times to accumulate draft tokens before invoking a target-model forward pass. This allows the draft model to propose an adaptive number of draft tokens, which optimizes the target-model acceptance rate:

  • It detects the copy-intensive scenario when the PLD layer gets a high acceptance rate, such that the draft model can propose more draft tokens.
  • It detects the edit-intensive scenario when the PLD layer gets a low acceptance rate, such that the draft model can propose fewer draft tokens.

A more detailed algorithm description is shown below:

image/png

Evaluation

We benchmark Blazedit and baselines on a set of code editing scenarios. Specifically, we ask them to edit 90 source files around 5K characters long, whose expected edit ratio evenly ranges from 10% to 90%. To demonstrate the best possible performance, we grid search the best configuration for different methods on A100 GPUs.

We first demonstrate the end-to-end character per second and speedups when running the instruction-tuned models of Qwen2.5-Coder-32B and DeepSeekCoder-33B. We see a significant speedup over regular decoding, by up to 7.7x for DeepSeekCoder-33B. We also see a 1.15-1.17 speedup over the best baseline (PLD). Blazedit also improves worse case (90th percentile) by 1.29-1.43x over the best baseline (PLD).

Target Model Regular Assisted PLD Ours Speedup (Worst) Speedup (SOTA)
Qwen2.5-Coder-32B Avg. 74.6 134.2 379.3 434.8 5.8x 1.15x
P90 60.7 100.3 130.7 169.0 2.8x 1.29x
DeepSeekCoder-33B Avg. 55.3 123.4 364.2 424.5 7.7x 1.17x
P90 45.1 97.8 120.9 173.4 3.8x 1.43x

Below plots the distribution of character per second and edit ratio for different methods. Overall, both Blazedit and PLD are sensitive to the edit ratio, where a smaller edit ratio leads to faster generation. The distribution of Blazedit data points is overall better than that of PLD. Comparatively, assisted decoding is overall indifferent to the edit ratio, despite using an adaptive draft window for achieving high acceptance rates, meaning that the overhead of draft model generation can be a major impact.

image/png

Below case studies a random editing sample. The X-axis shows the target-model forward pass steps (fewer better) and the Y-axis shows the number of proposed (gray) and accepted (colored) tokens. Compared to PLD, Blazedit can adaptively adjust the draft token schedules to optimize the acceptance rates. Compared to dynamic assisted decoding, draft-model generation in Blazedit is much faster with the PLD layer.

image/png

Looking Forward

Currently, users can use Blazedit in two ways:

  1. Monkey Patch. Simply follow this script to hajack the draft model generation.
  2. Customized Transformers fork. Install the forked branch and simply pass a blazedit_config argument in the generate method:
model_output = target_model.generate(
    input_ids,
    generation_config=generation_config,
    assistant_model=draft_model,
    # to trigger the custom candidate generation
    blazedit_config=dict(
        micro_draft_tokens=80, max_num_run=4, max_matching_ngram_size=10
    )
)

As a next step, we look forward to implementing and integrating Blazedit in other inference frameworks. We will also release a paper soon which includes more details and experiments.

While we demonstrated the effectiveness of multi-layer speculation using PLD and assisted decoding, it is also interesting to explore more advanced combinations of speculation methods in different layers.

Community

Sign up or log in to comment