Reasoning Models
At a reasonable price, I can rent 3 A40s and test a Q6_K 9b model in ~1.2 minutes and a 72b model in ~9.2 minutes. In order to keep the eval times down, many of the test prompts have the llm answer multiple questions in its response, as well as having the model not say anything except for its answer to the questions. Allowing models to explain each answer they give in a couple paragraphs like they normally would would multiply the testing time potentially by 4-5. This would especially be an issue with the political test, which has 288 questions.
I don’t see how I could feasibly benchmark reasoning models. Even just allowing them to think for 2 paragraphs would likely be too time-consuming given the total number of questions in all of the benchmarks. But most reasoning models think for a lot longer than 2 paragraphs, sometimes as much as 5,000 words (like 30-40 paragraphs). Testing a single reasoning model on all of the questions would either take many hours or I’d have to pay a lot more for better gpus. There’s also the question of how long do you let a model think for before deciding that it will never stop its thinking process.
If anyone has a solution to this, I’d love to hear it, but as of now the leaderboard doesn’t really support reasoning models.
Seems I might be able to reduce eval times enough by switching over from llama-cpp-python to Tabby or Aphrodite to utilize batching. Wish I looked into this sooner.