TY - GEN
T1 - EA2S2
T2 - 26th International Conference on Computer Communications and Networks, ICCCN 2017
AU - Wang, Teng
AU - Wang, Jiayin
AU - Nguyen, Son Nam
AU - Yang, Zhengyu
AU - Mi, Ningfang
AU - Sheng, Bo
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/14
Y1 - 2017/9/14
N2 - Big data processing frameworks such as Hadoop have been widely adopted to process a large volume of data. A lot of prior work has focused on the allocation of resources and the execution order of jobs/tasks to improve the performance in a homogeneous cluster. In this paper, we investigate storage layer design in a heterogeneous system considering a new type of bundled jobs where the input data and associated application jobs are submitted in a bundle. Our goal is to break the barrier between resource management and the underlying storage layer, and improve data locality, an important performance factor for resource management, from the aspect of storage system. We develop a sampling-based randomized algorithm for the network file system to determine the placement of input data blocks. The main idea is to query a selected set of candidate nodes, and estimate their workload at run time combining centralized and per-node information. The node with the smallest workload is selected to host the data block. Our evaluation is based with system implementation and comprehensive experiments on NSF CloudLab platforms. We have also conducted simulation for large-scale clusters. The results show significant performance improvements in terms of execution time and data locality.
AB - Big data processing frameworks such as Hadoop have been widely adopted to process a large volume of data. A lot of prior work has focused on the allocation of resources and the execution order of jobs/tasks to improve the performance in a homogeneous cluster. In this paper, we investigate storage layer design in a heterogeneous system considering a new type of bundled jobs where the input data and associated application jobs are submitted in a bundle. Our goal is to break the barrier between resource management and the underlying storage layer, and improve data locality, an important performance factor for resource management, from the aspect of storage system. We develop a sampling-based randomized algorithm for the network file system to determine the placement of input data blocks. The main idea is to query a selected set of candidate nodes, and estimate their workload at run time combining centralized and per-node information. The node with the smallest workload is selected to host the data block. Our evaluation is based with system implementation and comprehensive experiments on NSF CloudLab platforms. We have also conducted simulation for large-scale clusters. The results show significant performance improvements in terms of execution time and data locality.
UR - http://www.scopus.com/inward/record.url?scp=85032261795&partnerID=8YFLogxK
U2 - 10.1109/ICCCN.2017.8038371
DO - 10.1109/ICCCN.2017.8038371
M3 - Conference contribution
AN - SCOPUS:85032261795
T3 - 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017
BT - 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 31 July 2017 through 3 August 2017
ER -