Mozilla
/

Llama-3.2-3B-Instruct-llamafile

@@ -23,8 +23,8 @@ quantized_by: jartine
 # LLaMA 3.2 3B - llamafile
 This is a large language model that was released by Meta on 2024-09-25.
-This edition of LLaMA fits a lot of quality into a size small enough to
-comfortably run on modern laptops with 8GB+ of RAM. See also its sister model
 [Llama-3.2-1B-Instruct-llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile).
 - Model creator: [Meta](https://huggingface.co/meta-llama/)
@@ -83,13 +83,20 @@ Having **trouble?** See the ["Gotchas"
 section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
 of the README.
-This model has a max context window size of 128k tokens. By default, a
-context window size of 8192 tokens is used. You can use the maximum
-context size by passing the `-c 0` flag.
 On Windows there's a 4GB limit on executable sizes. This means you
 should download the Q6\_K llamafile.
 ## GPU Acceleration
 On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
@@ -133,11 +140,19 @@ Here's how fast you can expect these llamafiles to go on flagship CPUs.
 | Intel Core i9-14900K (alderlake)               | Llama-3.2-3B-Instruct.F16                | 6.72 GiB   | tg16          | 14.48           |
 | Intel Core i9-14900K (alderlake)               | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | pp512         | 223.48          |
 | Intel Core i9-14900K (alderlake)               | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | tg16          | 27.50           |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16                | 6.72 GiB   | pp512         | 17.31           |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16                | 6.72 GiB   | tg16          | 1.67            |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | pp512         | 15.79           |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | tg16          | 4.03            |
 ## About llamafile
 llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.

 # LLaMA 3.2 3B - llamafile
 This is a large language model that was released by Meta on 2024-09-25.
+This edition of LLaMA packs a lot of quality, in a size small enough to
+run on computers with 8GB+ of RAM. See also its sister model
 [Llama-3.2-1B-Instruct-llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile).
 - Model creator: [Meta](https://huggingface.co/meta-llama/)
 section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
 of the README.
 On Windows there's a 4GB limit on executable sizes. This means you
 should download the Q6\_K llamafile.
+## Context Window
+This model has a max context window size of 128k tokens. By default, a
+context window size of 8192 tokens is used, which for Q6\_K requires
+3.4GB of RSS RAM in addition to the 2.8GB of memory needed by the
+weights. You can ask llamafile to use the maximum context size by
+passing the `-c 0` flag, which for LLaMA 3.2 is 131072 tokens and that
+requires 16.4GB of RSS RAM. That's big enough for a small book. If you
+want to be able to have a conversation with your book, you can use the
+`-f book.txt` flag.
 ## GPU Acceleration
 On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
 | Intel Core i9-14900K (alderlake)               | Llama-3.2-3B-Instruct.F16                | 6.72 GiB   | tg16          | 14.48           |
 | Intel Core i9-14900K (alderlake)               | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | pp512         | 223.48          |
 | Intel Core i9-14900K (alderlake)               | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | tg16          | 27.50           |
+| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.BF16               | 6.72 GiB   | pp512         | 10.10           |
+| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.BF16               | 6.72 GiB   | tg16          | 1.50            |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16                | 6.72 GiB   | pp512         | 17.31           |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16                | 6.72 GiB   | tg16          | 1.67            |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | pp512         | 15.79           |
 | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K              | 2.76 GiB   | tg16          | 4.03            |
+We see from these benchmarks that the Q6\_K weights are usually the best
+choice, since they're both very high quality and always fast. In some
+cases, the full quality BF16 and F16 might go faster or slower,
+depending on your hardware platform. F16 is particularly good for
+example on GPU and Raspberry Pi 5+. BF16 shines on AMD Zen 4+.
 ## About llamafile
 llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.