Anybody got it to run a quantized version with vLLM

#24
by alecauduro - opened

I'm not having luck getting the quantized versions (unsloth or awq) to work with vLLM.

I completed the W8A8 quantization of its abliterated version and used vllm inference, everything worked fine on a dual card 2080ti-22G.

stelterlab/Mistral-Small-24B-Instruct-2501-AWQ worked for me with a 4090

I was able to get it running, was missing the --enforce-eager parameter.
Now I'm trying to figure out why function calling doesn't work.

mistral_common.exceptions.InvalidMessageStructureException: Unexpected role 'system' after role 'tool'

ok, it was just a question of changing the order for system prompt to come first. Exceptional local model!

Sign up or log in to comment