Support for older GPUs (pre-Ampere)
#4
by
qwq38b
- opened
Environment:
- GPU: NVIDIA TITAN Xp
- CUDA Version: 12.2
- Python Version: 3.12
- Transformers Version: 4.48.1
Issue:
Currently, the model only supports FlashAttention2, which requires Ampere or newer GPUs. When running on older GPUs, it fails with the error:
"RuntimeError: FlashAttention only supports Ampere GPUs or newer"
Feature Request:
- Add support for standard attention mechanism as a fallback option when FlashAttention2 is not available
- Make the attention implementation configurable through the model config
Rationale:
- Not all users have access to latest GPU hardware
- Standard attention support would significantly increase the model's accessibility
- Most other open-source models provide fallback options for different hardware configurations
The current implementation in Baichuan_ATTENTION_CLASSES
suggests the possibility of using different attention mechanisms, but the 'eager' mode is not properly implemented.
Would it be possible to add a proper standard attention implementation for broader hardware compatibility?