baichuan-inc/Baichuan-M1-14B-Instruct · Support for older GPUs (pre-Ampere)

Environment:

GPU: NVIDIA TITAN Xp
CUDA Version: 12.2
Python Version: 3.12
Transformers Version: 4.48.1

Issue:
Currently, the model only supports FlashAttention2, which requires Ampere or newer GPUs. When running on older GPUs, it fails with the error:
"RuntimeError: FlashAttention only supports Ampere GPUs or newer"

Feature Request:

Add support for standard attention mechanism as a fallback option when FlashAttention2 is not available
Make the attention implementation configurable through the model config

Rationale:

Not all users have access to latest GPU hardware
Standard attention support would significantly increase the model's accessibility
Most other open-source models provide fallback options for different hardware configurations

The current implementation in Baichuan_ATTENTION_CLASSES suggests the possibility of using different attention mechanisms, but the 'eager' mode is not properly implemented.

Would it be possible to add a proper standard attention implementation for broader hardware compatibility?