Update README.md
Browse files
README.md
CHANGED
@@ -29,17 +29,37 @@ where N_N is the average number of bits per parameter.
|
|
29 |
- 5_9 is comfortable on 12 GB cards
|
30 |
```
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
## How is this optimised?
|
33 |
|
34 |
The process for optimisation is as follows:
|
35 |
|
36 |
- 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
|
37 |
- For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
|
38 |
-
- For each layer in turn, and for each
|
39 |
- A single layer was quantized
|
40 |
-
- The initial hidden states were
|
41 |
- The error (MSE) in the final hidden state was calculated
|
42 |
-
- This gives a 'cost' for each possible layer quantization
|
43 |
- An optimised quantization is one that gives the desired reduction in size for the smallest total cost
|
44 |
- A series of recipies for optimization have been created from the calculated costs
|
45 |
- the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
|
@@ -50,57 +70,3 @@ The process for optimisation is as follows:
|
|
50 |
- Different quantizations of different parts of a layer gave significantly worse results
|
51 |
- Leaving bias in 16 bit made no relevant difference
|
52 |
- Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes
|
53 |
-
|
54 |
-
## Details
|
55 |
-
|
56 |
-
The optimisation recipes are as follows (layers 0-18 are the double_block_layers, 19-56 are the single_block_layers)
|
57 |
-
|
58 |
-
```python
|
59 |
-
|
60 |
-
CONFIGURATIONS = {
|
61 |
-
"9_6" : {
|
62 |
-
'casts': [
|
63 |
-
{'layers': '0-10', 'castto': 'BF16'},
|
64 |
-
{'layers': '11-14, 54', 'castto': 'Q8_0'},
|
65 |
-
{'layers': '15-36, 39-53, 55', 'castto': 'Q5_1'},
|
66 |
-
{'layers': '37-38, 56', 'castto': 'Q4_1'},
|
67 |
-
]
|
68 |
-
},
|
69 |
-
"9_2" : {
|
70 |
-
'casts': [
|
71 |
-
{'layers': '0-8, 10, 12', 'castto': 'BF16'},
|
72 |
-
{'layers': '9, 11, 13-21, 49-54', 'castto': 'patch:flux1-dev-Q6_K.gguf'},
|
73 |
-
{'layers': '22-34, 41-48, 55', 'castto': 'patch:flux1-dev-Q5_K_S.gguf'},
|
74 |
-
{'layers': '35-40', 'castto': 'patch:flux1-dev-Q4_K_S.gguf'},
|
75 |
-
{'layers': '56', 'castto': 'Q4_1'},
|
76 |
-
]
|
77 |
-
},
|
78 |
-
"8_4" : {
|
79 |
-
'casts': [
|
80 |
-
{'layers': '0-4, 10', 'castto': 'BF16'},
|
81 |
-
{'layers': '5-9, 11-14', 'castto': 'Q8_0'},
|
82 |
-
{'layers': '15-35, 41-55', 'castto': 'Q5_1'},
|
83 |
-
{'layers': '36-40, 56', 'castto': 'Q4_1'},
|
84 |
-
]
|
85 |
-
},
|
86 |
-
"7_4" : {
|
87 |
-
'casts': [
|
88 |
-
{'layers': '0-2', 'castto': 'BF16'},
|
89 |
-
{'layers': '5, 7-12', 'castto': 'Q8_0'},
|
90 |
-
{'layers': '3-4, 6, 13-33, 42-55', 'castto': 'Q5_1'},
|
91 |
-
{'layers': '34-41, 56', 'castto': 'Q4_1'},
|
92 |
-
]
|
93 |
-
},
|
94 |
-
"5_9" : {
|
95 |
-
'casts': [
|
96 |
-
{'layers': '0-25, 27-28, 44-54', 'castto': 'Q5_1'},
|
97 |
-
{'layers': '26, 29-43, 55-56', 'castto': 'Q4_1'},
|
98 |
-
]
|
99 |
-
},
|
100 |
-
"5_1" : {
|
101 |
-
'casts': [
|
102 |
-
{'layers': '0-56', 'castto': 'Q4_1'},
|
103 |
-
]
|
104 |
-
},
|
105 |
-
}
|
106 |
-
```
|
|
|
29 |
- 5_9 is comfortable on 12 GB cards
|
30 |
```
|
31 |
|
32 |
+
## Speed?
|
33 |
+
|
34 |
+
On an A40 (plenty of VRAM), everything except the model identical, the time taken to generate an image (30 steps, deis sampler) was:
|
35 |
+
|
36 |
+
- 5_1 => 40.1s
|
37 |
+
- 5_9 => 55.4s
|
38 |
+
- 6_9 => 52.1s
|
39 |
+
- 7_4 => 49.7s
|
40 |
+
- 7_6 => 43.6s
|
41 |
+
- 8_4 => 46.8s
|
42 |
+
- 9_2 => 42.8s
|
43 |
+
- 9_6 => 48.2s
|
44 |
+
|
45 |
+
for comparison:
|
46 |
+
- bfloat16 (default) =>
|
47 |
+
- fp8_e4m3fn =>
|
48 |
+
- fp8_e5m2 =>
|
49 |
+
|
50 |
+
|
51 |
+
|
52 |
## How is this optimised?
|
53 |
|
54 |
The process for optimisation is as follows:
|
55 |
|
56 |
- 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
|
57 |
- For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
|
58 |
+
- For each layer in turn, and for each quantization:
|
59 |
- A single layer was quantized
|
60 |
+
- The initial hidden states were processed by the modified layer stack
|
61 |
- The error (MSE) in the final hidden state was calculated
|
62 |
+
- This gives a 'cost' for each possible layer quantization - how much different it is to the full model
|
63 |
- An optimised quantization is one that gives the desired reduction in size for the smallest total cost
|
64 |
- A series of recipies for optimization have been created from the calculated costs
|
65 |
- the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
|
|
|
70 |
- Different quantizations of different parts of a layer gave significantly worse results
|
71 |
- Leaving bias in 16 bit made no relevant difference
|
72 |
- Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|