Automating distributed tiered storage management in cluster computing
Journal
Proceedings of the VLDB Endowment
Date Issued
2020
Author(s)
DOI
10.14778/3357377.3357381
Abstract
Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed le systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them while trying to optimize their workloads. In this paper, we develop a general framework for automatically moving data across the available storage tiers in distributed le systems. Moreover, we employ machine learning for tracking and predicting le access patterns, which we use to decide when and which data to move up or down the storage tiers for increasing system performance. Our approach uses incremental learning to dynamically rene the models with new le accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases signicant bene ts in terms of both workload performance and cluster effciency.
File(s)![Thumbnail Image]()
Name
3357377.3357381.pdf
Size
974.4 KB
Format
Adobe PDF
Checksum (MD5)
a6eefb24d7e41d8e9c8a7a040d383da1

