ChrisGoringe commited on
Commit
f436e3f
·
verified ·
1 Parent(s): b24d330

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -57
README.md CHANGED
@@ -29,17 +29,37 @@ where N_N is the average number of bits per parameter.
29
  - 5_9 is comfortable on 12 GB cards
30
  ```
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ## How is this optimised?
33
 
34
  The process for optimisation is as follows:
35
 
36
  - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
37
  - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
38
- - For each layer in turn, and for each of the Q8_0, Q5_1 and Q4_1 quantizations:
39
  - A single layer was quantized
40
- - The initial hidden states were processed by the modified layer stack
41
  - The error (MSE) in the final hidden state was calculated
42
- - This gives a 'cost' for each possible layer quantization
43
  - An optimised quantization is one that gives the desired reduction in size for the smallest total cost
44
  - A series of recipies for optimization have been created from the calculated costs
45
  - the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
@@ -50,57 +70,3 @@ The process for optimisation is as follows:
50
  - Different quantizations of different parts of a layer gave significantly worse results
51
  - Leaving bias in 16 bit made no relevant difference
52
  - Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes
53
-
54
- ## Details
55
-
56
- The optimisation recipes are as follows (layers 0-18 are the double_block_layers, 19-56 are the single_block_layers)
57
-
58
- ```python
59
-
60
- CONFIGURATIONS = {
61
- "9_6" : {
62
- 'casts': [
63
- {'layers': '0-10', 'castto': 'BF16'},
64
- {'layers': '11-14, 54', 'castto': 'Q8_0'},
65
- {'layers': '15-36, 39-53, 55', 'castto': 'Q5_1'},
66
- {'layers': '37-38, 56', 'castto': 'Q4_1'},
67
- ]
68
- },
69
- "9_2" : {
70
- 'casts': [
71
- {'layers': '0-8, 10, 12', 'castto': 'BF16'},
72
- {'layers': '9, 11, 13-21, 49-54', 'castto': 'patch:flux1-dev-Q6_K.gguf'},
73
- {'layers': '22-34, 41-48, 55', 'castto': 'patch:flux1-dev-Q5_K_S.gguf'},
74
- {'layers': '35-40', 'castto': 'patch:flux1-dev-Q4_K_S.gguf'},
75
- {'layers': '56', 'castto': 'Q4_1'},
76
- ]
77
- },
78
- "8_4" : {
79
- 'casts': [
80
- {'layers': '0-4, 10', 'castto': 'BF16'},
81
- {'layers': '5-9, 11-14', 'castto': 'Q8_0'},
82
- {'layers': '15-35, 41-55', 'castto': 'Q5_1'},
83
- {'layers': '36-40, 56', 'castto': 'Q4_1'},
84
- ]
85
- },
86
- "7_4" : {
87
- 'casts': [
88
- {'layers': '0-2', 'castto': 'BF16'},
89
- {'layers': '5, 7-12', 'castto': 'Q8_0'},
90
- {'layers': '3-4, 6, 13-33, 42-55', 'castto': 'Q5_1'},
91
- {'layers': '34-41, 56', 'castto': 'Q4_1'},
92
- ]
93
- },
94
- "5_9" : {
95
- 'casts': [
96
- {'layers': '0-25, 27-28, 44-54', 'castto': 'Q5_1'},
97
- {'layers': '26, 29-43, 55-56', 'castto': 'Q4_1'},
98
- ]
99
- },
100
- "5_1" : {
101
- 'casts': [
102
- {'layers': '0-56', 'castto': 'Q4_1'},
103
- ]
104
- },
105
- }
106
- ```
 
29
  - 5_9 is comfortable on 12 GB cards
30
  ```
31
 
32
+ ## Speed?
33
+
34
+ On an A40 (plenty of VRAM), everything except the model identical, the time taken to generate an image (30 steps, deis sampler) was:
35
+
36
+ - 5_1 => 40.1s
37
+ - 5_9 => 55.4s
38
+ - 6_9 => 52.1s
39
+ - 7_4 => 49.7s
40
+ - 7_6 => 43.6s
41
+ - 8_4 => 46.8s
42
+ - 9_2 => 42.8s
43
+ - 9_6 => 48.2s
44
+
45
+ for comparison:
46
+ - bfloat16 (default) =>
47
+ - fp8_e4m3fn =>
48
+ - fp8_e5m2 =>
49
+
50
+
51
+
52
  ## How is this optimised?
53
 
54
  The process for optimisation is as follows:
55
 
56
  - 240 prompts used for flux images popular at civit.ai were run through the full Flux.1-dev model with randomised resolution and step count.
57
  - For a randomly selected step in the inference, the hidden states before and after the layer stack were captured.
58
+ - For each layer in turn, and for each quantization:
59
  - A single layer was quantized
60
+ - The initial hidden states were processed by the modified layer stack
61
  - The error (MSE) in the final hidden state was calculated
62
+ - This gives a 'cost' for each possible layer quantization - how much different it is to the full model
63
  - An optimised quantization is one that gives the desired reduction in size for the smallest total cost
64
  - A series of recipies for optimization have been created from the calculated costs
65
  - the various 'in' blocks, the final layer blocks, and all normalization scale parameters are stored in float32
 
70
  - Different quantizations of different parts of a layer gave significantly worse results
71
  - Leaving bias in 16 bit made no relevant difference
72
  - Costs were evaluated for the original Flux.1-dev model. They are assumed to be essentially the same for finetunes