Hash join is one of the core algorithms in databases management systems. If a hash join cannot complete in one-pass because the available memory is insufficient (i.e. hash table overflow), however, it may incur a few sequential writes and excessive ra...
Hash join is one of the core algorithms in databases management systems. If a hash join cannot complete in one-pass because the available memory is insufficient (i.e. hash table overflow), however, it may incur a few sequential writes and excessive random reads. With harddisk as the tempoary storage for hash joins, the I/O time would be dominated by slow random reads in its probing phase. Meanwhile, flash memory based SSDs (flash SSDs) are becoming popular, and we will witness in the foreseeable future that flash SSDs replace harddisks in enterprise databases. In contrast to harddisk, flash SSD without any mechanical component has fast latency in random reads, and thus it can boost hash join performance. In this paper, we investigate several important and practical issues when flash SSD is used as tempoary storage for hash join. First, we reveal the I/O patterns of hash join in detail and explain why flash SSD can outperform harddisk by more than an order of magnitude. Second, we present and analyze the impact of cluster size (i.e. I/O unit in hash join) on performance. Finally, we emperically demonstrate that, while a commerical query optimizer is error-prone in predicting the execution time with harddisk as temporary storage, it can precisely estimate the execution time with flash SSD. In summary, we show that, when used as temporary storage for hash join, flash SSD will provide more reliable cost estimation as well as fast performance.