Support for older GPUs (pre-Ampere)

#4
by qwq38b - opened

Environment:

  • GPU: NVIDIA TITAN Xp
  • CUDA Version: 12.2
  • Python Version: 3.12
  • Transformers Version: 4.48.1

Issue:
Currently, the model only supports FlashAttention2, which requires Ampere or newer GPUs. When running on older GPUs, it fails with the error:
"RuntimeError: FlashAttention only supports Ampere GPUs or newer"

Feature Request:

  1. Add support for standard attention mechanism as a fallback option when FlashAttention2 is not available
  2. Make the attention implementation configurable through the model config

Rationale:

  • Not all users have access to latest GPU hardware
  • Standard attention support would significantly increase the model's accessibility
  • Most other open-source models provide fallback options for different hardware configurations

The current implementation in Baichuan_ATTENTION_CLASSES suggests the possibility of using different attention mechanisms, but the 'eager' mode is not properly implemented.

Would it be possible to add a proper standard attention implementation for broader hardware compatibility?

Sign up or log in to comment