We propose an optimized block-floating-point (BFP) arithmetic for efficient inference of deep neural networks. The proposed reconfigurable accelerator with three parallelism dimensions, ping-pong off-chip DDR3 memory access, and an optimized on-chip buffer group is implemented on the Xilinx VC709 evaluation board.
Convolutional neural networks (CNNs) are widelyused and have achieved great success in computer vision andspeech processing applications. However, deploying the largescaleCNN model in the embedded system is subject to theconstraints of computation and memory. An optimized blockfloating-point (BFP) arithmetic is adopted in our accelerator forefficient inference of deep neural networks in this paper. Thefeature maps and model parameters are represented in 16-bitand 8-bit formats, respectively, in the off-chip memory, which canreduce memory and off-chip bandwidth requirements by 50%and 75% compared to the 32-bit FP counterpart. The proposed8-bit BFP arithmetic with optimized rounding and shiftingoperation-based quantization schemes improves the energy andhardware efficiency by three times. One CNN model can bedeployed in our accelerator without retraining at the cost of anaccuracy loss of not more than 0.12%. The proposed reconfigurableaccelerator with three parallelism dimensions, ping-pongoff-chip DDR3 memory access, and an optimized on-chip buffergroup is implemented on the Xilinx VC709 evaluation board.Our accelerator achieves a performance of 760.83 GOP/s and82.88 GOP/s/W under a 200-MHz working frequency, significantlyoutperforming previous accelerators.