Scalable CNN accelerator design with a coarse-grained block-wise data reuse = 대략적인 블록 단위 데이터 재사용을 통한 확장 가능한 CNN 가속기 설계|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Convolutional neural networks (CNNs) require numerous computations and external memory accesses. Frequent accesses to off-chip memory cause slow processing and large power dissipation. Moreover, storing entire data of a CNN layers to on-chip buffer requires large SRAM resource of a field-programmable gate array (FPGA) chip. This thesis presents three contributions to optimize off-chip memory access while minimize on-chip buffer requirement.
For real-time object detection with high throughput and power efficiency, this work presents a Tera-OPS streaming hardware accelerator implementing a YOLO (You-Only-Look-One) CNN. The parameters of the YOLO CNN are retrained and quantized with PASCAL VOC dataset using binary weight and flexible low-bit activation. The binary weight enables storing the entire network model in Block RAMs of a FPGA to reduce off-chip accesses aggressively and thereby achieve significant performance enhancement. In the proposed design, all convolutional layers are fully pipelined for enhanced hardware utilization. The decreased DRAM accesses reduce DRAM power consumption. This CNN implemented using VC707 FPGA achieves a throughput of 1.877 TOPS at 200 MHz with batch processing while consuming 18.29 W of on-chip power, which shows the best power efficiency compared to previous research.
The characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This thesis proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs a layer-specific mixed data flow. The proposed mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The experiments demonstrate that the proposed scheme significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.
A recent state-of-the-art family of CNNs called EfficientNet/EfficientDet achieved an impressive accuracy for classification, detection and segmentation while being much more compact than previous CNNs. However, their latency on general-purpose processors, such as CPUs/GPUs, tends to be very high due to non-optimized data accesses, which contribute to hardware under-utilization. This work presents an end-to-end framework for an efficient CNN accelerator design. A coarse-grained block-wise data reuse scheme and shared MAC arrays are proposed for the high utilization of resources even for compact networks mainly used in mobile/edge applications. From the network architecture and quantized model from TensorFlow, the accelerator optimizer generates FPGA configurations and layer-wise inference code which uses an efficient data reuse and optimized on/off-chip memory access. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33 faster and 6.7-7.9 more power efficient. Compared to the result from baseline, in which the weights/inputs/outputs are accessed from the off-chip memory exactly once per each layer, our proposed end-to-end framework reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Compared to the layer-wise data reuse design in a previous work, which had the largest reduction in DRAM access among the previous works for the VGGNet inference, the proposed scheme reduces the DRAM access 26.3% with a similar buffer size.

번역하기

국문 초록 (Abstract)

Convolutional neural network (CNN)은 많은 수의 연산과 외부 메모리 접근을 요구한다. 그러나 많은 수의 외부 메모리 접근은 높은 파워 소모와 함께 network의 처리 속도의 저하를 불러 일으킨다. 또한 on-chip buffer에 CNN layer의 전체 data를 저장하는 것은 많은 양의 field-programmable gate array (FPGA) 칩의 SRAM 리소스를 소모하는 결과를 가져온다. 본 논문은 Off-chip 메모리의 접근을 최적화하는 동시에 On-chip buffer의 소모를 최소화하기 위한 세 가지의 하드웨어 설계 방식을 제안한다.
본 논문은 실시간 객체 인식을 위한 YOLO (You-Only-Look-Once) CNN을 구동하며 이와 동시에 높은 throughput과 전력효율성을 달성하는 Tera-OPS streaming 하드웨어 가속기를 제안한다. YOLO CNN의 parameter들은 이진화 된 weight와 가변 bit activation으로 구성되며 PASCAL VOC dataset을 이용하여 retrain 및 quantized 되었다. 이진화 weight는 FPGA의 Block RAM에 전체 network 모델의 weight를 저장할 수 있어 off-chip 메모리의 접근을 감소시키며 이로 인해 높은 성능 향상을 보인다. 제안하는 design은 모든 convolutional layer들의 하드웨어 효율을 높이기 위해 fully pipelined 동작을 지원한다. 감소된 DRAM 접근 횟수로 인해 제안하는 design의 DRAM power 소비는 감소된다. VC707 FPGA를 이용하여 해당 design을 implement 하였을 때 200MHz에서 batch processing으로 1.877 TOPs의 throughput을 보임과 동시에 18.29W의 on-chip power 소비를 달성한다. 이는 기존 연구와 비교 시 가장 높은 전력효율성이다.
CNN은 layer 별로 다른 특징을 가지고 있지만 기존의 하드웨어 design들은 CNN layer들에 대해 공통된 최적화 scheme을 적용한다. 본 논문은 다른 특징을 가지고 있는 layer들에 대해 각각 다른 구성을 적용하는 layer-specific design을 제안한다. 제안하는 design은 layer-specific 한 mixed data flow를 적용하는데 해당 dataflow의 적용으로 FPGA device의 최소한의 on-chip memory (BRAM)의 소모와 함께 off-chip 메모리 접근 또한 최소화하고자 하는 목표를 달성한다. 해당 방식은 실험을 통해 throughput, off-chip 접근, on-chip 메모리 소모 측면에서 기존의 연구보다 더 뛰어난 성능을 증명하였다.
최신 CNN의 state-of-the-art network 중 하나인 EfficientNet/EfficientDet은 기존의 CNN 보다 간소화 되었음에도 불구하고 classification, detection, segmentation 부분에서 뛰어난 정확도를 달성하였다. 그러나 CPU와 GPU와 같은 일반적인 목적으로 사용되는 processor들에서는 해당 network의 data 접근이 최적화되어 있지 않아 latency가 매우 높은 모습을 보이게 되고 이는 하드웨어의 under utilization을 초래한다. 본 논문에서는 효율적인 CNN accelerator design을 위한 end-to-end framework를 제안한다. 본 논문에서 제안하는 Coarse-grained의 block 별 data reuse scheme과 shared MAC array를 통해 Mobile/edge application에 주로 사용되는 작은 network에서도 하드웨어 자원의 효율적인 사용을 이룰 수 있다. 해당 framework를 통해 Tensorflow에서 제작한 network 구조와 quantized model로부터 효율적인 data reuse와 최적화된 on/off chip 메모리 접근을 적용한 FPGA configuration과 layer별 inference code를 생성한다. 본 가속기는 Xilinx KCU1500 FPGA card에서 구현되었으며 EfficientNet inference 구동에서 NVIDIA RTX 2080 Ti, Titan XP, 그리고 GTX 1080 Ti의 성능을 뛰어넘는 모습을 보인다. 특히 RTX 2080 Ti와 비교 시 제안하는 design은 1.35-2.33배 더 빠른 속도와 함께 6.7-7.9배 높은 power 효율성을 보인다. 각각 layer 별로 weight와 input과 output이 off-chip 메모리에서 오직 한번 접근되는 Baseline과 비교 시, 제안하는 framework는 RetinaNet, YOLOv3, ResNet152 그리고 EfficientNet에서 47.8-84.8%의 DRAM 접근 횟수 감소를 보인다. VGGNet inference시에 기존 연구 중 가장 큰 DRAM 접근 감소를 보여주는 Layer-wise data reuse design과 비교 시에 제안하는 scheme은 비슷한 buffer size로 26.3%의 DRAM 접근 감소를 보이고 있다.

번역하기

Convolutional neural network (CNN)은 많은 수의 연산과 외부 메모리 접근을 요구한다. 그러나 많은 수의 외부 메모리 접근은 높은 파워 소모와 함께 network의 처리 속도의 저하를 불러 일으킨다. 또한 ...

목차 (Table of Contents)

Chapter 1: Streaming CNN hardware accelerator design with a row-based weight reuse scheme 1
1.1 Introduction 1
1.2 Background 5
1.2.1 Conventional CNN 6
1.2.2 Quantization of CNN using 1-bit weight + low-bit activation 7

Chapter 1: Streaming CNN hardware accelerator design with a row-based weight reuse scheme 1
1.1 Introduction 1
1.2 Background 5
1.2.1 Conventional CNN 6
1.2.2 Quantization of CNN using 1-bit weight + low-bit activation 7
1.3 Hardware-centric quantization 8
1.4 Row-based weight reuse scheme for streaming hardware accelerator 11
1.5 The proposed streaming hardware accelerator 15
1.5.1 Overview of the accelerator 15
1.5.2 The accelerator design with row-based weight reuse scheme 17
1.5.3 Batch processing 20
1.6 Experimental results 21
1.6.1 Low-bit quantization 21
1.6.2 Comparison of the proposed row-based weight reuse scheme with the frame-based weight reuse scheme 23
1.6.3 Accelerator implementation 25
1.7 Conclusion 30
Chapter 2: Streaming CNN hardware accelerator design with a layer-specific mixed dataflow and mixed precision 31
2.1 Introduction 31
2.2 Related works about CNN accelerator 34
2.3 Layer-specific mixed dataflow design 36
2.4 Layer-specific mixed precision training 42
2.4.1 Motivation of intra-layer mixed precision training 42
2.4.2 Coarse-grained intra-layer mixed precision quantization 44
2.5 The hardware architecture with mixed precision 48
2.6 Experimental results 51
2.6.1 Coarse-grained mixed precision quantization 51
2.6.2 Accelerator design with mixed precision 56
2.6.3 Accelerator design with mixed dataflow 59
2.6.4 Accelerator design with mixed dataflow and mixed precision 64
2.7 Related works 64
2.8 Conclusion 66
Chapter 3: From Tensorflow to FPGA-based CNN accelerator design with a coarse-grained block-wise data reuse 68
3.1 Introduction 68
3.2 The proposed framework 72
3.2.1 Overview of the framework 72
3.2.2 Architecture of the accelerator 74
3.3 Coarse-grained data reuse optimizer 79
3.3.1 On/off-chip memory access with shortcut data reuse 82
3.3.2 Coarse-grained data reuse optimization 84
3.4 Experimental results 89
3.4.1 Single cut-point optimization 89
3.4.2 Multiple cut-point optimization 92
3.4.3 Minimum buffer requirement to satisfy the DRAM access constraints 93
3.4.4 Scalability and power efficiency for edge inference 96
3.5 Related works 98
3.6 Conclusion 99
Bibliography 100
초 록 117
Acknowledgement 120

상세검색

RISS 보유자료

상세검색

해외전자자료

Scalable CNN accelerator design with a coarse-grained block-wise data reuse = 대략적인 블록 단위 데이터 재사용을 통한 확장 가능한 CNN 가속기 설계

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료