Update README.md
Browse files
README.md
CHANGED
@@ -23,8 +23,8 @@ quantized_by: jartine
|
|
23 |
# LLaMA 3.2 3B - llamafile
|
24 |
|
25 |
This is a large language model that was released by Meta on 2024-09-25.
|
26 |
-
This edition of LLaMA
|
27 |
-
|
28 |
[Llama-3.2-1B-Instruct-llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile).
|
29 |
|
30 |
- Model creator: [Meta](https://huggingface.co/meta-llama/)
|
@@ -83,13 +83,20 @@ Having **trouble?** See the ["Gotchas"
|
|
83 |
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
84 |
of the README.
|
85 |
|
86 |
-
This model has a max context window size of 128k tokens. By default, a
|
87 |
-
context window size of 8192 tokens is used. You can use the maximum
|
88 |
-
context size by passing the `-c 0` flag.
|
89 |
-
|
90 |
On Windows there's a 4GB limit on executable sizes. This means you
|
91 |
should download the Q6\_K llamafile.
|
92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
## GPU Acceleration
|
94 |
|
95 |
On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
|
@@ -133,11 +140,19 @@ Here's how fast you can expect these llamafiles to go on flagship CPUs.
|
|
133 |
| Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 14.48 |
|
134 |
| Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 223.48 |
|
135 |
| Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 27.50 |
|
|
|
|
|
136 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | pp512 | 17.31 |
|
137 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 1.67 |
|
138 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 15.79 |
|
139 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 4.03 |
|
140 |
|
|
|
|
|
|
|
|
|
|
|
|
|
141 |
## About llamafile
|
142 |
|
143 |
llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
|
|
|
23 |
# LLaMA 3.2 3B - llamafile
|
24 |
|
25 |
This is a large language model that was released by Meta on 2024-09-25.
|
26 |
+
This edition of LLaMA packs a lot of quality, in a size small enough to
|
27 |
+
run on computers with 8GB+ of RAM. See also its sister model
|
28 |
[Llama-3.2-1B-Instruct-llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile).
|
29 |
|
30 |
- Model creator: [Meta](https://huggingface.co/meta-llama/)
|
|
|
83 |
section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
|
84 |
of the README.
|
85 |
|
|
|
|
|
|
|
|
|
86 |
On Windows there's a 4GB limit on executable sizes. This means you
|
87 |
should download the Q6\_K llamafile.
|
88 |
|
89 |
+
## Context Window
|
90 |
+
|
91 |
+
This model has a max context window size of 128k tokens. By default, a
|
92 |
+
context window size of 8192 tokens is used, which for Q6\_K requires
|
93 |
+
3.4GB of RSS RAM in addition to the 2.8GB of memory needed by the
|
94 |
+
weights. You can ask llamafile to use the maximum context size by
|
95 |
+
passing the `-c 0` flag, which for LLaMA 3.2 is 131072 tokens and that
|
96 |
+
requires 16.4GB of RSS RAM. That's big enough for a small book. If you
|
97 |
+
want to be able to have a conversation with your book, you can use the
|
98 |
+
`-f book.txt` flag.
|
99 |
+
|
100 |
## GPU Acceleration
|
101 |
|
102 |
On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
|
|
|
140 |
| Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 14.48 |
|
141 |
| Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 223.48 |
|
142 |
| Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 27.50 |
|
143 |
+
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.BF16 | 6.72 GiB | pp512 | 10.10 |
|
144 |
+
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.BF16 | 6.72 GiB | tg16 | 1.50 |
|
145 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | pp512 | 17.31 |
|
146 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 1.67 |
|
147 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 15.79 |
|
148 |
| Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 4.03 |
|
149 |
|
150 |
+
We see from these benchmarks that the Q6\_K weights are usually the best
|
151 |
+
choice, since they're both very high quality and always fast. In some
|
152 |
+
cases, the full quality BF16 and F16 might go faster or slower,
|
153 |
+
depending on your hardware platform. F16 is particularly good for
|
154 |
+
example on GPU and Raspberry Pi 5+. BF16 shines on AMD Zen 4+.
|
155 |
+
|
156 |
## About llamafile
|
157 |
|
158 |
llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
|