jartine commited on
Commit
507ec5f
·
verified ·
1 Parent(s): f3724f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -6
README.md CHANGED
@@ -23,8 +23,8 @@ quantized_by: jartine
23
  # LLaMA 3.2 3B - llamafile
24
 
25
  This is a large language model that was released by Meta on 2024-09-25.
26
- This edition of LLaMA fits a lot of quality into a size small enough to
27
- comfortably run on modern laptops with 8GB+ of RAM. See also its sister model
28
  [Llama-3.2-1B-Instruct-llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile).
29
 
30
  - Model creator: [Meta](https://huggingface.co/meta-llama/)
@@ -83,13 +83,20 @@ Having **trouble?** See the ["Gotchas"
83
  section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
84
  of the README.
85
 
86
- This model has a max context window size of 128k tokens. By default, a
87
- context window size of 8192 tokens is used. You can use the maximum
88
- context size by passing the `-c 0` flag.
89
-
90
  On Windows there's a 4GB limit on executable sizes. This means you
91
  should download the Q6\_K llamafile.
92
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## GPU Acceleration
94
 
95
  On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
@@ -133,11 +140,19 @@ Here's how fast you can expect these llamafiles to go on flagship CPUs.
133
  | Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 14.48 |
134
  | Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 223.48 |
135
  | Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 27.50 |
 
 
136
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | pp512 | 17.31 |
137
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 1.67 |
138
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 15.79 |
139
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 4.03 |
140
 
 
 
 
 
 
 
141
  ## About llamafile
142
 
143
  llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
 
23
  # LLaMA 3.2 3B - llamafile
24
 
25
  This is a large language model that was released by Meta on 2024-09-25.
26
+ This edition of LLaMA packs a lot of quality, in a size small enough to
27
+ run on computers with 8GB+ of RAM. See also its sister model
28
  [Llama-3.2-1B-Instruct-llamafile](https://huggingface.co/Mozilla/Llama-3.2-1B-Instruct-llamafile).
29
 
30
  - Model creator: [Meta](https://huggingface.co/meta-llama/)
 
83
  section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
84
  of the README.
85
 
 
 
 
 
86
  On Windows there's a 4GB limit on executable sizes. This means you
87
  should download the Q6\_K llamafile.
88
 
89
+ ## Context Window
90
+
91
+ This model has a max context window size of 128k tokens. By default, a
92
+ context window size of 8192 tokens is used, which for Q6\_K requires
93
+ 3.4GB of RSS RAM in addition to the 2.8GB of memory needed by the
94
+ weights. You can ask llamafile to use the maximum context size by
95
+ passing the `-c 0` flag, which for LLaMA 3.2 is 131072 tokens and that
96
+ requires 16.4GB of RSS RAM. That's big enough for a small book. If you
97
+ want to be able to have a conversation with your book, you can use the
98
+ `-f book.txt` flag.
99
+
100
  ## GPU Acceleration
101
 
102
  On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
 
140
  | Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 14.48 |
141
  | Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 223.48 |
142
  | Intel Core i9-14900K (alderlake) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 27.50 |
143
+ | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.BF16 | 6.72 GiB | pp512 | 10.10 |
144
+ | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.BF16 | 6.72 GiB | tg16 | 1.50 |
145
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | pp512 | 17.31 |
146
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.F16 | 6.72 GiB | tg16 | 1.67 |
147
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | pp512 | 15.79 |
148
  | Raspberry Pi 5 Model B Rev 1.0 (+fp16+dotprod) | Llama-3.2-3B-Instruct.Q6\_K | 2.76 GiB | tg16 | 4.03 |
149
 
150
+ We see from these benchmarks that the Q6\_K weights are usually the best
151
+ choice, since they're both very high quality and always fast. In some
152
+ cases, the full quality BF16 and F16 might go faster or slower,
153
+ depending on your hardware platform. F16 is particularly good for
154
+ example on GPU and Raspberry Pi 5+. BF16 shines on AMD Zen 4+.
155
+
156
  ## About llamafile
157
 
158
  llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.