VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder

GYuchao Gu, Xintao Wang, Liangbin Xie, Chao Dong, Gen Li, Ying Shan, Ming-Ming Cheng

VQFR: There are six resolution levels, i.e., f = {1, 2, 4, 8, 16, 32}, and the quantization operation is conducted on the feature level of f32. Each level of the encoder contains two residual blocks, and each level of the texture branch in the decoder contains three residual blocks. Each level of the main branch in the decoder contains one texture warping module and one residual block. We use a bilinear upsample/downsample followed by a 1×1 convolution to change the resolutions. VQFR has 76.3M params (1.07 TFlops) and takes 0.36s to process a 5122 image on Nvidia A100.

Texture Warping Module (TWM): We use a 3×3 convolution with 32 output channels to extract input information of degraded faces. Then we resize the feature to match all resolution levels (f = 1, 2, 4, 8, 16, 32). The detailed architecture of TWM is shown in Table. 2. For each resolution level, the offset convolution is used to generate offsets from the concatenation of the texture feature and the input features of degraded faces. Then, the offsets and the texture features are fed into the deformable convolution, outputting the warped feature.