NOTE: DEPRECIATED, BETTER PEOPLE DO THIS NOW
LLaMa 65B converted to ggml via LLaMa.cpp, then quantized to 4bit.
Legacy is for llama.cpp setups older than https://github.com/ggerganov/llama.cpp/pull/1508, the regular is faster but does not work on old versions.
I recommend the following settings when running as a good starting point:
main.exe -m ggml-LLaMa-65B-q4_0.bin -n -1 -t 32 -c 2048 --temp 0.7 --repeat_penalty 1.2 --mirostat 2 --interactive-first --color
Be aware that LLaMa is a text generation model, not a conversational one, and as such you will have to prompt it differently than, for example, Vicuna or ChatGPT.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.