Trident: Task Scheduling over Tiered Storage Systems in Big Data Platforms
Journal
Proceedings of the VLDB Endowment
Date Issued
May 2021
Author(s)
DOI
10.14778/3461535.3461545
Abstract
The recent advancements in storage technologies have popularized the use of tiered storage systems in data-intensive compute
clusters. The Hadoop Distributed File System (HDFS), for example, now supports storing data in memory, SSDs, and HDDs, while
OctopusFS and hatS offer fine-grained storage tiering solutions.
However, the task schedulers of big data platforms (such as Hadoop
and Spark) will assign tasks to available resources only based on
data locality information, and completely ignore the fact that local
data is now stored on a variety of storage media with different
performance characteristics. This paper presents Trident, a principled task scheduling approach that is designed to make optimal
task assignment decisions based on both locality and storage tier
information. Trident formulates task scheduling as a minimum
cost maximum matching problem in a bipartite graph and uses a
standard solver for finding the optimal solution. In addition, Trident utilizes two novel pruning algorithms for bounding the size
of the graph, while still guaranteeing optimality. Trident is implemented in both Spark and Hadoop, and evaluated extensively using
a realistic workload derived from Facebook traces as well as an
industry-validated benchmark, demonstrating significant benefits
in terms of application performance and cluster efficiency.
clusters. The Hadoop Distributed File System (HDFS), for example, now supports storing data in memory, SSDs, and HDDs, while
OctopusFS and hatS offer fine-grained storage tiering solutions.
However, the task schedulers of big data platforms (such as Hadoop
and Spark) will assign tasks to available resources only based on
data locality information, and completely ignore the fact that local
data is now stored on a variety of storage media with different
performance characteristics. This paper presents Trident, a principled task scheduling approach that is designed to make optimal
task assignment decisions based on both locality and storage tier
information. Trident formulates task scheduling as a minimum
cost maximum matching problem in a bipartite graph and uses a
standard solver for finding the optimal solution. In addition, Trident utilizes two novel pruning algorithms for bounding the size
of the graph, while still guaranteeing optimality. Trident is implemented in both Spark and Hadoop, and evaluated extensively using
a realistic workload derived from Facebook traces as well as an
industry-validated benchmark, demonstrating significant benefits
in terms of application performance and cluster efficiency.
File(s)![Thumbnail Image]()
Name
p1570-herodotou.pdf
Size
5.41 MB
Format
Adobe PDF
Checksum (MD5)
953d40e861b6da8ec7640dd0c42da7f3

