http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
이재욱(Jaewook Lee),백윤아(Yoonah Paik),김창현(Chang Hyun Kim),이원준(Wonjun Lee),김선욱(Seon Wook Kim) 대한전자공학회 2019 대한전자공학회 학술대회 Vol.2019 No.11
Deep learning techniques have been applied to various fields, such as image recognition, natural language processing, computer vision, and so on. Therefore, their accelerators are getting attention due to the execution efficiency in terms of speed and power, and the deep learning compilers help programmers to develop optimized code. In this paper, we review TVM, an open-source deep-learning compiler, and analyze its performance by using GoogLeNet. Also, we compare the performance of VTA, the neural network accelerator based on TVM, with CPU by using the quantized ResNet-18.
딥러닝 VTA Accelerator의 Device Address Mapping에 따른 DRAM Throughput 분석
이재욱(Jaewook Lee),백윤아(Yoonah Paik),김석영(Seok Young Kim),김선욱(Seon Wook Kim) 대한전자공학회 2020 대한전자공학회 학술대회 Vol.2020 No.11
Many hardware platforms have emerged to process deep learning algorithms efficiently, and deep learning compiler frameworks have also appeared to optimize their tensor operation graphs. However, the optimization involves a limit to improving the performance because memory operations occupy a large part of the inference time. Therefore, it is essential to analyze the DRAM performance of the accelerator. In this paper, we analyze the DRAM throughput based on the address mapping of a memory controller. We developed a memory trace extraction system for our target hardware platform, VTA. By running Ramulator with the trace from the system, we analyzed the DRAM throughput and the row buffer hit ratio according to the address mapping and identified the optimization opportunity.
Processing-in-Memory을 활용한 스트링 탐색 애플리케이션 가속 기법
권기용(Kiyong Kwon),백윤아(Yoonah Paik),김선욱(Seon Wook Kim) 대한전자공학회 2020 대한전자공학회 학술대회 Vol.2020 No.8
Various devices based on the high communication speed continuously generate data, and eventually, data to be processed from them significantly increases. In particular, the pattern matching used in various data applications has a growing demand. However, the existing CPU-based system has limitations in exploiting execution parallelism and memory bandwidth. For overcoming these limitations, Processing-in-Memory (PIM) system that performs operations inside memory has been proposed. This paper proposes a method that can effectively accelerate the string search application using the PIM research platform. The PIM execution shows up to 1.183 times faster than CPU-only execution.
Optimizing TensorFlow Performance by Reconstructing the Convolution Routine
Minseong Kim,Kyu Hyun Choi,Yoonah Paik,Seon Wook Kim 대한전자공학회 2021 IEIE Transactions on Smart Processing & Computing Vol.10 No.2
Using deep learning, we can currently build computational models composed of multiple processing layers to learn representations of data. Convolutional neural networks (CNNs) have been widely adopted to achieve significant performance in image recognition and classification. TensorFlow, an open-source deep learning framework from Google, uses profiling to select one convolution algorithm, from among several available, as the core of a CNN to deliver the best performance in terms of execution time and memory usage. However, the overhead from profiling is considerably significant, because TensorFlow executes and profiles all the available algorithms for the best selection whenever an application is launched. We observe that memory usage overshoots during profiling, which limits data parallelism, and thus, fails to deliver maximum performance. In this paper, we present a novel profiling method to reduce overhead by storing the profile result from the first run and reusing it from the second run on. Using Inception-V3, we achieved up to 1.12 times and 1.11 times higher throughput, compared to the vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, without losing accuracy.