RISS 검색 - 학위논문 상세보기

다국어 초록 (Multilingual Abstract)

Recently, in-memory big data processing frameworks have emerged, such as Apache Spark and Ignite, to accelerate workloads requiring frequent data reuse. With effective in-memory caching these frameworks eliminate most of I/O operations, which would otherwise be necessary for communication between producer and consumer tasks. However, the benefit of in-memory caching and computation is nullified due to expensive data spills and garbage collection (GC) if the memory footprint exceeds available memory size. For example, in case of Spark, such scenario can lead to a significant amount of garbage collection (GC) operations which can account for nearly 50% of execution time, thus incurring more than a 2x system slowdown. Therefore, the primary challenge for in-memory computing frameworks is to carefully tune memory usage to achieve optimal performance.
This thesis presents three techniques to reduce memory pressure for in-memory MapReduce frameworks. First, we introduce WASP, a workload-ware task scheduler and partitioner, which jointly optimizes both the number of data partitions (Npartitions) and the number of tasks per each executor (Nthreads) at runtime by considering workload characteristics (RDD graph and input data) and the execution environment. WASP first analyzes the DAG structure of a given workload and uses an analytical model that predicts an optimal setting of Npartitions and Nthreads for each stage based on both workload and platform parameters. Taking this as input, the GC-aware task scheduler further optimizes Nthreads during task execution via runtime monitoring of individual tasks. Thus, WASP maximizes CPU utilization while minimizing the overhead of data spills and GCs. Second, we introduce e-spill, an eager spill mechanism, which dynamically finds the optimal spill- threshold by monitoring the GC time at runtime and thereby prevent expensive GC overhead. Our e-spill maintains a feedback loop between the master node and worker node and gradually increase the spill-threshold until it reaches the optimal point without substantial GCs. The proposed e-spill achieves a robust performance for shuffle-heavy workloads without requiring any workload-dependent tuning parameters. Finally, we present SSDStreamer, an SSD-based caching system, which delivers competitive performance to in-memory caching at a fraction of its cost. Instead of using DRAM as primary cache, SSDStreamer uses it as a stream buffer for coarse-grained prefetching from a large SSD cache built on top of a lightweight user-space I/O stack. SSDStreamer delivers robust performance regardless of the working set size as only the first request in a stream misses at DRAM, while the subsequent ones hit with effective prefetching. We integrate these techniques on Apache Spark, and evaluate their performance on a 5-node cluster and a cluster of virtual machines (VMs) on Amazon Elastic Compute Cloud (EC2) using data analytics workloads from Intel HiBench Suite. Our evaluation demonstrates that the proposed three techniques provide robust performance over the baseline following Spark Tuning Guidelines with the state-of-the-art multi-level caching policy.

국문 초록 (Abstract)

최근에, Apache Spark나 Ignite와 같은 인-메모리 컴퓨팅 프레임워크의 등장으로, 같은 데이터 셋에 반복해서 접근하는 반복 알고리즘을 사용하는 워크로드의 성능을 가속화하고 있다. 이러한인-...

최근에, Apache Spark나 Ignite와 같은 인-메모리 컴퓨팅 프레임워크의 등장으로, 같은 데이터 셋에 반복해서 접근하는 반복 알고리즘을 사용하는 워크로드의 성능을 가속화하고 있다. 이러한인-메모리캐싱특징은기존디스크기반의캐싱보다 긴 I/O시간을 숨길수 있기 때문에 반복 알고리즘에 매우 효율적이다. 그러나, 인-메모리 캐싱 및 연산의 이점은 메모리 사용량이 사용 가능한 메모리 크기를 초과하는 경우 가비지 컬렉션 오버헤드로 인해 성능 저하가 발생할 수 있다. 예를 들어, Spark의 경우 메모리 사용량이 사용 가능한 메모리를 초과할 경우 가비지 컬렉션 오버헤드가 실행 시간의 약 50%를 차지하고 있으며, 전체 시스템 성능을 2배 이상 저하시킨다. 따라서, 인-메모리 컴퓨팅 프레임워크를 성능을 최적화하기 위해서는 주어진 시스템 및 워크로드의 특성을 고려하여 메모리 사용량을 조율해야 한다.
본 논문에서는 인-메모리 맵리듀스 프레임워크의 메모리 부담을 줄이기 위한 세 가지 기법을 제안한다. 우선, 워크로드의 특성 (RDD 그래프 및 입력 데이터)을 분석하여 런타임 시에 데이터 파티션 수 (Npartitions)와 동시에 동작하는 태스크의 수(Nthreads)를 함께 고려하 여 최적화하는 WASP (Workload-Aware task Scheduler and Partitioner) 스케줄러를 제안한다. WASP 스케줄러는 주어진 워크로드의 DAG를 분석하고, 워크로드의 특성과 시스템 정보를 고려하여 각 단계에 필요한 최적의 Npartitions 와 Nthreads 을 예측하는 분석 모델을 사용한다. 각 단계에서는 이렇게 예측된 값을 초기 값으로 하여 수행되며, 런타임에는 가비지 컬렉션을 모니터링하여 Nthreads 를 조절하면서 메모리 사용량을 최적화한다. 따라서, WASP은 가비지 컬렉션 오버헤드를 최소화하면서 반면에 CPU 사용을 극대화함으로써 성능 향상을 얻을 수 있 다. 그다음, 실행 시간에 가비지 컬렉션 시간을 모니터링하여 최적의 spill 임계 값을 동적으로 찾는 e-spill (eager spill mechanism) 기법을 제안한다. 제안하는 e-spill은 오프라인 프로파일링 없이 WASP 스케줄러에 비해 데이터 셔플(shuffle)이 많이 발생하는 워크로드에서 강인한 성능을 제공한다. 마지막으로, SSD 기반 캐싱 시스템인 SSDStreamer를 제안한다. DRAM 기반의 캐싱 대신 SSDStremer는 스트림 버퍼를 사용하고 경량 I/O 스택과 효율적인 프리패칭 기법을 통해 적은 비용으로 인-메모리 캐싱에 비교하여 경쟁력 있는 성능을 제공한다. 이러한 기법들을 Spark에 통합하였고 Intel Hibench 워크로드를 사용하여 5-노드 클러스터 및 아마존 EC2의가상시스템에서성능평가하였다. 이러한세가지기법들은멀티레벨캐싱및튜닝 가이드라인을 따른 최신 방식들 보다 성능을 크게 향상시켰다.

목차 (Table of Contents)

I Introduction 1
II Background and Motivation 7
II.1 Spark Execution Model 7
II.2 Finding Optimal Npartitions and Nthreads 10
II.3 In-Memory Caching Models 13

I Introduction 1
II Background and Motivation 7
II.1 Spark Execution Model 7
II.2 Finding Optimal Npartitions and Nthreads 10
II.3 In-Memory Caching Models 13
III Workload-Aware Scheduler and Partitioner 19
III.1 Overview 20
III.2 Analytical Model 21
III.3 Garbage Collection-Aware Task Scheduler 26
III.4 Example Walk-Through 27
III.5 Methodology 30
III.6 Evaluation 33
III.7 Related Work 42
III.8 Summary 44
IV Eager Spill Mechanism 45
IV.1 Spill Mechanism 45
IV.2 e-spill: Eager Spill Mechanism 47
IV.3 Methodology 50
IV.4 Evaluation 52
IV.5 Summary 56
V SSD-based Caching (SSDStreamer) 59
V.1 Overview 60
V.2 Stream Interface 61
V.3 NVMe Abstraction Layer 64
V.4 Stream Manager 66
V.5 Lightweight Serializer 69
V.6 Implementation 70
V.7 Methodology 74
V.8 Evaluation 76
V.9 Related Work 86
V.10 Summary 88
VI Conclusion 91
VI.1 Summary 91
VI.2 Future Work 92
References 95
Korean Abstract 111

상세검색

RISS 보유자료

상세검색

해외전자자료

Reducing memory pressure for in-memory MapReduce frameworks

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료