Convolutional neural networks (CNNs) require numerous computations and external memory accesses. Frequent accesses to off-chip memory cause slow processing and large power dissipation. Moreover, storing entire data of a CNN layers to on-chip buffer re...
Convolutional neural networks (CNNs) require numerous computations and external memory accesses. Frequent accesses to off-chip memory cause slow processing and large power dissipation. Moreover, storing entire data of a CNN layers to on-chip buffer requires large SRAM resource of a field-programmable gate array (FPGA) chip. This thesis presents three contributions to optimize off-chip memory access while minimize on-chip buffer requirement.
For real-time object detection with high throughput and power efficiency, this work presents a Tera-OPS streaming hardware accelerator implementing a YOLO (You-Only-Look-One) CNN. The parameters of the YOLO CNN are retrained and quantized with PASCAL VOC dataset using binary weight and flexible low-bit activation. The binary weight enables storing the entire network model in Block RAMs of a FPGA to reduce off-chip accesses aggressively and thereby achieve significant performance enhancement. In the proposed design, all convolutional layers are fully pipelined for enhanced hardware utilization. The decreased DRAM accesses reduce DRAM power consumption. This CNN implemented using VC707 FPGA achieves a throughput of 1.877 TOPS at 200 MHz with batch processing while consuming 18.29 W of on-chip power, which shows the best power efficiency compared to previous research.
The characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This thesis proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs a layer-specific mixed data flow. The proposed mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The experiments demonstrate that the proposed scheme significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.
A recent state-of-the-art family of CNNs called EfficientNet/EfficientDet achieved an impressive accuracy for classification, detection and segmentation while being much more compact than previous CNNs. However, their latency on general-purpose processors, such as CPUs/GPUs, tends to be very high due to non-optimized data accesses, which contribute to hardware under-utilization. This work presents an end-to-end framework for an efficient CNN accelerator design. A coarse-grained block-wise data reuse scheme and shared MAC arrays are proposed for the high utilization of resources even for compact networks mainly used in mobile/edge applications. From the network architecture and quantized model from TensorFlow, the accelerator optimizer generates FPGA configurations and layer-wise inference code which uses an efficient data reuse and optimized on/off-chip memory access. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33 faster and 6.7-7.9 more power efficient. Compared to the result from baseline, in which the weights/inputs/outputs are accessed from the off-chip memory exactly once per each layer, our proposed end-to-end framework reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Compared to the layer-wise data reuse design in a previous work, which had the largest reduction in DRAM access among the previous works for the VGGNet inference, the proposed scheme reduces the DRAM access 26.3% with a similar buffer size.