Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression

Wenxin Yang

Xiaoli Zhi and Weiqin Tong

Resumen

Current edge devices for neural networks such as FPGA, CPLD, and ASIC can support low bit-width computing to improve the execution latency and energy efficiency, but traditional linear quantization can only maintain the inference accuracy of neural networks at a bit-width above 6 bits. Different from previous studies that address this problem by clipping the outliers, this paper proposes a two-stage quantization method. Before converting the weights into fixed-point numbers, this paper first prunes the network by unstructured pruning and then uses the K-means algorithm to cluster the weights in advance to protect the distribution of the weights. To solve the instability problem of the K-means results, the PSO (particle swarm optimization) algorithm is exploited to obtain the initial cluster centroids. The experimental results on baseline deep networks such as ResNet-50, Inception-v3, and DenseNet-121 show the proposed optimized quantization method can generate a 5-bit network with an accuracy loss of less than 5% and a 4-bit network with only 10% accuracy loss as compared to 8-bit quantization. By quantization and pruning, this method reduces the model bit-width from 32 to 4 and the number of neurons by 80%. Additionally, it can be easily integrated into frameworks such as TensorRt and TensorFlow-Lite for low bit-width network quantization.

Palabras claves

deep neural network - quantized neural networks - particle swarm optimization - clustering

Acceso

PÁGINAS

pp. 0 - 0

NÚMERO

Volumen: 16 Parte: 1 (2023)

MATERIAS

INGENIERÍA Y CONSTRUCCIÓN CIVIL
TECNOLOGÍA

DOI

https://doi.org/10.3390/a16010031

Optimization of Linear Quantization for General and Effective Low Bit-Width Network Compression

Revistas destacadas