Recently, in-memory big data processing frameworks have emerged, such as Apache Spark and Ignite, to accelerate workloads requiring frequent data reuse. With effective in-memory caching these frameworks eliminate most of I/O operations, which would ot...
Recently, in-memory big data processing frameworks have emerged, such as Apache Spark and Ignite, to accelerate workloads requiring frequent data reuse. With effective in-memory caching these frameworks eliminate most of I/O operations, which would otherwise be necessary for communication between producer and consumer tasks. However, the benefit of in-memory caching and computation is nullified due to expensive data spills and garbage collection (GC) if the memory footprint exceeds available memory size. For example, in case of Spark, such scenario can lead to a significant amount of garbage collection (GC) operations which can account for nearly 50% of execution time, thus incurring more than a 2x system slowdown. Therefore, the primary challenge for in-memory computing frameworks is to carefully tune memory usage to achieve optimal performance.
This thesis presents three techniques to reduce memory pressure for in-memory MapReduce frameworks. First, we introduce WASP, a workload-ware task scheduler and partitioner, which jointly optimizes both the number of data partitions (Npartitions) and the number of tasks per each executor (Nthreads) at runtime by considering workload characteristics (RDD graph and input data) and the execution environment. WASP first analyzes the DAG structure of a given workload and uses an analytical model that predicts an optimal setting of Npartitions and Nthreads for each stage based on both workload and platform parameters. Taking this as input, the GC-aware task scheduler further optimizes Nthreads during task execution via runtime monitoring of individual tasks. Thus, WASP maximizes CPU utilization while minimizing the overhead of data spills and GCs. Second, we introduce e-spill, an eager spill mechanism, which dynamically finds the optimal spill- threshold by monitoring the GC time at runtime and thereby prevent expensive GC overhead. Our e-spill maintains a feedback loop between the master node and worker node and gradually increase the spill-threshold until it reaches the optimal point without substantial GCs. The proposed e-spill achieves a robust performance for shuffle-heavy workloads without requiring any workload-dependent tuning parameters. Finally, we present SSDStreamer, an SSD-based caching system, which delivers competitive performance to in-memory caching at a fraction of its cost. Instead of using DRAM as primary cache, SSDStreamer uses it as a stream buffer for coarse-grained prefetching from a large SSD cache built on top of a lightweight user-space I/O stack. SSDStreamer delivers robust performance regardless of the working set size as only the first request in a stream misses at DRAM, while the subsequent ones hit with effective prefetching. We integrate these techniques on Apache Spark, and evaluate their performance on a 5-node cluster and a cluster of virtual machines (VMs) on Amazon Elastic Compute Cloud (EC2) using data analytics workloads from Intel HiBench Suite. Our evaluation demonstrates that the proposed three techniques provide robust performance over the baseline following Spark Tuning Guidelines with the state-of-the-art multi-level caching policy.