Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,172 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
|
5 |
+
# Densely Connected Convolutional Networks
|
6 |
+
|
7 |
+
Apart from the academic report, this readme mainly discusses technical issues encountered and handled.
|
8 |
+
|
9 |
+
## Building a latest Tensorflow against CUDA 12.6 and CUDNN 9
|
10 |
+
|
11 |
+
We have experienced **significant caveats on tensorflow 2.4**: intolerable silent model compilation, poor default gradient optimization, easy NaN pitchfall, confusing programming interfaces,...
|
12 |
+
|
13 |
+
Therefore, we moved to at least tensorflow 2.10 which brings a much better experience out-of-the-box. While tf 2.10 is capable of doing the job, we want to explore how tensorflow has been since it abandoned Windows CUDA support (just after 2.10). To harness the full power of GPU, a newest (compatible) driver is preferred. Tesla T4 has a [compute capability](https://developer.nvidia.com/cuda-gpus) of **7.5** which corresponds to a good support of recent driver version: [CUDA 12.6 and CUDNN 9](https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html).
|
14 |
+
|
15 |
+
Following the build guide, we have **clang17** (with llvm and lld) and **gcc14** (with libstdc++) installed with c++17 feature enabled, and [bazel6.5.0](https://github.com/tensorflow/tensorflow/blob/master/.bazelversion) installed. The version of bazel cannot be easily changed because in 6.5.0 [Bazel modified some of the key APIs](https://releases.bazel.build/6.5.0/release/index.html) (`label`, refer to the 6th bullet point) that tensorflow uses in a wide scope, resulting that changing the versions to either older and newer will invalidate the build files provided by tensorflow.
|
16 |
+
|
17 |
+
```bash
|
18 |
+
git clone https://github.com/tensorflow/tensorflow.git
|
19 |
+
cd tensorflow
|
20 |
+
```
|
21 |
+
|
22 |
+
Run the configuration file.
|
23 |
+
|
24 |
+
```bash
|
25 |
+
chmod +x ./configuration
|
26 |
+
./configuration
|
27 |
+
```
|
28 |
+
|
29 |
+
Through trial and errors, we knew that the most recent TensorFlow build supports **CUDA 12.6.1 and CUDNN 9.5.0**, both one minor version behind newest release (until 11/15/24).
|
30 |
+
|
31 |
+
Then call bazel to build. We didn't have much knowledge on bazel, but it behaved like `cmake` and we assumed that it was a build tool against given configurations.
|
32 |
+
|
33 |
+
```bash
|
34 |
+
bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel
|
35 |
+
```
|
36 |
+
|
37 |
+
During build, a problem might emerge:
|
38 |
+
|
39 |
+
```
|
40 |
+
__float128 is not supported on this target
|
41 |
+
```
|
42 |
+
|
43 |
+
This is fired through building process of CUDA-related files. Normally this should not happen as `__STRICT_ANSI__` guard which is ensued by specifying c++17 should have invalidated the related float128 codes, but as [this thread](https://github.com/llvm/llvm-project/issues/97866) says, GCC is the main cause of this error. On the other hand, regarding the normal usage of CUDA and Tensorflow, float128 is barely needed, therefore we can invalidate the related code pieces in the standard library (temporarily) manually:
|
44 |
+
|
45 |
+
```
|
46 |
+
// /usr/include/c++/14.2.1/limits
|
47 |
+
...
|
48 |
+
#if defined(_GLIBCXX_USE_FLOAT128)
|
49 |
+
// We either need Q literal suffixes, or IEEE double.
|
50 |
+
#if ! defined(__STRICT_ANSI__) || defined(_GLIBCXX_DOUBLE_IS_IEEE_BINARY64)
|
51 |
+
...
|
52 |
+
```
|
53 |
+
|
54 |
+
To:
|
55 |
+
|
56 |
+
```
|
57 |
+
#if ! defined(_GLIBCXX_USE_FLOAT128)
|
58 |
+
// We either need Q literal suffixes, or IEEE double.
|
59 |
+
#if ! defined(__STRICT_ANSI__) || defined(_GLIBCXX_DOUBLE_IS_IEEE_BINARY64)
|
60 |
+
```
|
61 |
+
|
62 |
+
Simply inverse the macro will prevent related float128 macros to expand. The build was successful then. Install the built package with pip should finish the process.
|
63 |
+
|
64 |
+
```bash
|
65 |
+
pip install bazel-bin/tensorflow/tools/pip_package/wheel_house/tensorflow-version-tags.whl
|
66 |
+
```
|
67 |
+
|
68 |
+
```python
|
69 |
+
import tensorflow as tf
|
70 |
+
tf.__version__
|
71 |
+
```
|
72 |
+
should return `2.19.0`.
|
73 |
+
|
74 |
+
## Tensorflow's Changes Regarding to XLA
|
75 |
+
|
76 |
+
### XLA
|
77 |
+
|
78 |
+
In recent versions of TensorFlow, it is encouraged (and mostly, by default) to use [XLA](https://openxla.org/xla) (Acceclerated Linear Algebra), which is a technology to simplify and merge computational graphs to lower the ops performed on computational intense hardware, to statically build and train the model.
|
79 |
+
|
80 |
+
Basically, the generation of computational graph involves **compilation**, which is a paramount field of study in computer science. XLA leverages **LLVM** as the compilation backend (lexer, parser, IR, etc.) and developed a language set called **StableHLO**:
|
81 |
+
|
82 |
+
```
|
83 |
+
func.func @main(
|
84 |
+
%image: tensor<28x28xf32>,
|
85 |
+
%weights: tensor<784x10xf32>,
|
86 |
+
%bias: tensor<1x10xf32>
|
87 |
+
) -> tensor<1x10xf32> {
|
88 |
+
%0 = "stablehlo.reshape"(%image) : (tensor<28x28xf32>) -> tensor<1x784xf32>
|
89 |
+
%1 = "stablehlo.dot"(%0, %weights) : (tensor<1x784xf32>, tensor<784x10xf32>) -> tensor<1x10xf32>
|
90 |
+
%2 = "stablehlo.add"(%1, %bias) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
|
91 |
+
%3 = "stablehlo.constant"() {value = dense<0.0> : tensor<1x10xf32>} : () -> tensor<1x10xf32>
|
92 |
+
%4 = "stablehlo.maximum"(%2, %3) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
|
93 |
+
"func.return"(%4): (tensor<1x10xf32>) -> ()
|
94 |
+
}
|
95 |
+
```
|
96 |
+
|
97 |
+
It's just like a mix of assembly (regarding register and function notations) and [ONNX](https://onnx.ai/onnx/intro/concepts.html#input-output-node-initializer-attributes) (direct representation of inputs, operations and outputs).
|
98 |
+
|
99 |
+
### Tensorflow
|
100 |
+
|
101 |
+
On tensorflow's side, there are some noticeable changes. Behaviourally, on the first training and validation (corresponding to different computational graphs of training and testing) epoch of the model, the program will compile the model with XLA, introducing a noticeable lag. Programatically, [XLA is insensitive to the random seeds](https://tensorflow.google.cn/xla/known_issues?hl=zh-cn) (this page does not seem to have an English counterpart, so translate if you want to read) so that [stateless randomness](https://tensorflow.google.cn/guide/random_numbers#stateless_rngs) has to be introduced. Using stateless RNGs (Random Number Generators) is also highly suggested in a wider scope by tensorflow instead of `tf.random`.
|
102 |
+
|
103 |
+
Consider a stateless dropout:
|
104 |
+
|
105 |
+
```python
|
106 |
+
from tensorflow._api.v2.nn.experimental import stateless_dropout
|
107 |
+
|
108 |
+
|
109 |
+
class Dropout(Layer):
|
110 |
+
def __init__(self, rate: float, seed: list[float]):
|
111 |
+
super().__init__()
|
112 |
+
self.rate = rate
|
113 |
+
self.seed = seed
|
114 |
+
|
115 |
+
def call(self, x, training=False):
|
116 |
+
if training:
|
117 |
+
return stateless_dropout(x, rate=self.rate, seed=self.seed)
|
118 |
+
return x
|
119 |
+
```
|
120 |
+
|
121 |
+
With given global seed such as `[1, 0]`, the result can keep consistent in the maximum level. However, it might not be guaranteed across hardwares.
|
122 |
+
|
123 |
+
|
124 |
+
## Datasets
|
125 |
+
|
126 |
+
The original paper uses CIFAR-10, CIFAR-100 and SVHN for small picture trainings and benchmarks. The CIFARs are directly available from tensorflow:
|
127 |
+
|
128 |
+
```python
|
129 |
+
import keras
|
130 |
+
if cifar == 10:
|
131 |
+
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
|
132 |
+
elif cifar == 100:
|
133 |
+
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()
|
134 |
+
```
|
135 |
+
|
136 |
+
While SVHN is available on [hugging face](https://huggingface.co/datasets/ufldl-stanford/svhn), uploaded by UFLDL lab from Stanford (official):
|
137 |
+
|
138 |
+
```python
|
139 |
+
from datasets import load_dataset
|
140 |
+
|
141 |
+
# https://huggingface.co/datasets/ufldl-stanford/svhn
|
142 |
+
ds_train, ds_test = load_dataset(
|
143 |
+
"ufldl-stanford/svhn", "cropped_digits", split=["train", "test"]
|
144 |
+
)
|
145 |
+
```
|
146 |
+
|
147 |
+
## Gradient Checkpointing
|
148 |
+
|
149 |
+
Densenets are hungry for computational memories as it concats each previous feature to the next feature, resulting a discrete feature distribution on a physical device. By rearranging memory layout to make these feature maps consecutive in memory, we can read them as a whole with a significanly lower cost. This process is achieved by a technique called [gradient checkpointing](https://arxiv.org/abs/1604.06174v2), by rerunning the forward pass of a critical section (that is memory-hungry) during backprop.
|
150 |
+
|
151 |
+
In a nutshell, gradient checkpointing trades *some* time for *many more* space, especially in this densenet case.
|
152 |
+
|
153 |
+
```python
|
154 |
+
from functools import partial
|
155 |
+
from tensorflow import recompute_grad
|
156 |
+
|
157 |
+
|
158 |
+
class SingleDenseBlock(Layer):
|
159 |
+
...
|
160 |
+
def _call(self, x, training):
|
161 |
+
y = self.bottleneck_layer(x, training=training)
|
162 |
+
y = self.dense_layer(y, training=training)
|
163 |
+
return y
|
164 |
+
|
165 |
+
def _me_call(self, x, training):
|
166 |
+
return recompute_grad(partial(self._call, training=training))(x)
|
167 |
+
|
168 |
+
def call(self, x, training=False):
|
169 |
+
if self.memory_efficient:
|
170 |
+
return self._me_call(x, training)
|
171 |
+
return self._call(x, training)
|
172 |
+
```
|