EA2S2: An efficient application-aware storage system for big data processing in heterogeneous clusters

Teng Wang, Jiayin Wang, Son Nam Nguyen, Zhengyu Yang, Ningfang Mi, Bo Sheng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Big data processing frameworks such as Hadoop have been widely adopted to process a large volume of data. A lot of prior work has focused on the allocation of resources and the execution order of jobs/tasks to improve the performance in a homogeneous cluster. In this paper, we investigate storage layer design in a heterogeneous system considering a new type of bundled jobs where the input data and associated application jobs are submitted in a bundle. Our goal is to break the barrier between resource management and the underlying storage layer, and improve data locality, an important performance factor for resource management, from the aspect of storage system. We develop a sampling-based randomized algorithm for the network file system to determine the placement of input data blocks. The main idea is to query a selected set of candidate nodes, and estimate their workload at run time combining centralized and per-node information. The node with the smallest workload is selected to host the data block. Our evaluation is based with system implementation and comprehensive experiments on NSF CloudLab platforms. We have also conducted simulation for large-scale clusters. The results show significant performance improvements in terms of execution time and data locality.

Original languageEnglish
Title of host publication2017 26th International Conference on Computer Communications and Networks, ICCCN 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509029914
DOIs
StatePublished - 14 Sep 2017
Event26th International Conference on Computer Communications and Networks, ICCCN 2017 - Vancouver, Canada
Duration: 31 Jul 20173 Aug 2017

Other

Other26th International Conference on Computer Communications and Networks, ICCCN 2017
CountryCanada
CityVancouver
Period31/07/173/08/17

Fingerprint

Storage System
Data Locality
Resource Management
Workload
Vertex of a graph
Sampling
File System
Heterogeneous Systems
Randomized Algorithms
Execution Time
Placement
Bundle
Experiments
Query
Resources
Big data
Node
Evaluation
Estimate
Experiment

Cite this

Wang, T., Wang, J., Nguyen, S. N., Yang, Z., Mi, N., & Sheng, B. (2017). EA2S2: An efficient application-aware storage system for big data processing in heterogeneous clusters. In 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017 [8038371] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCCN.2017.8038371
Wang, Teng ; Wang, Jiayin ; Nguyen, Son Nam ; Yang, Zhengyu ; Mi, Ningfang ; Sheng, Bo. / EA2S2 : An efficient application-aware storage system for big data processing in heterogeneous clusters. 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017. Institute of Electrical and Electronics Engineers Inc., 2017.
@inproceedings{a99dac5e096347c786e4b0edfad2e69d,
title = "EA2S2: An efficient application-aware storage system for big data processing in heterogeneous clusters",
abstract = "Big data processing frameworks such as Hadoop have been widely adopted to process a large volume of data. A lot of prior work has focused on the allocation of resources and the execution order of jobs/tasks to improve the performance in a homogeneous cluster. In this paper, we investigate storage layer design in a heterogeneous system considering a new type of bundled jobs where the input data and associated application jobs are submitted in a bundle. Our goal is to break the barrier between resource management and the underlying storage layer, and improve data locality, an important performance factor for resource management, from the aspect of storage system. We develop a sampling-based randomized algorithm for the network file system to determine the placement of input data blocks. The main idea is to query a selected set of candidate nodes, and estimate their workload at run time combining centralized and per-node information. The node with the smallest workload is selected to host the data block. Our evaluation is based with system implementation and comprehensive experiments on NSF CloudLab platforms. We have also conducted simulation for large-scale clusters. The results show significant performance improvements in terms of execution time and data locality.",
author = "Teng Wang and Jiayin Wang and Nguyen, {Son Nam} and Zhengyu Yang and Ningfang Mi and Bo Sheng",
year = "2017",
month = "9",
day = "14",
doi = "10.1109/ICCCN.2017.8038371",
language = "English",
booktitle = "2017 26th International Conference on Computer Communications and Networks, ICCCN 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

Wang, T, Wang, J, Nguyen, SN, Yang, Z, Mi, N & Sheng, B 2017, EA2S2: An efficient application-aware storage system for big data processing in heterogeneous clusters. in 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017., 8038371, Institute of Electrical and Electronics Engineers Inc., 26th International Conference on Computer Communications and Networks, ICCCN 2017, Vancouver, Canada, 31/07/17. https://doi.org/10.1109/ICCCN.2017.8038371

EA2S2 : An efficient application-aware storage system for big data processing in heterogeneous clusters. / Wang, Teng; Wang, Jiayin; Nguyen, Son Nam; Yang, Zhengyu; Mi, Ningfang; Sheng, Bo.

2017 26th International Conference on Computer Communications and Networks, ICCCN 2017. Institute of Electrical and Electronics Engineers Inc., 2017. 8038371.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - EA2S2

T2 - An efficient application-aware storage system for big data processing in heterogeneous clusters

AU - Wang, Teng

AU - Wang, Jiayin

AU - Nguyen, Son Nam

AU - Yang, Zhengyu

AU - Mi, Ningfang

AU - Sheng, Bo

PY - 2017/9/14

Y1 - 2017/9/14

N2 - Big data processing frameworks such as Hadoop have been widely adopted to process a large volume of data. A lot of prior work has focused on the allocation of resources and the execution order of jobs/tasks to improve the performance in a homogeneous cluster. In this paper, we investigate storage layer design in a heterogeneous system considering a new type of bundled jobs where the input data and associated application jobs are submitted in a bundle. Our goal is to break the barrier between resource management and the underlying storage layer, and improve data locality, an important performance factor for resource management, from the aspect of storage system. We develop a sampling-based randomized algorithm for the network file system to determine the placement of input data blocks. The main idea is to query a selected set of candidate nodes, and estimate their workload at run time combining centralized and per-node information. The node with the smallest workload is selected to host the data block. Our evaluation is based with system implementation and comprehensive experiments on NSF CloudLab platforms. We have also conducted simulation for large-scale clusters. The results show significant performance improvements in terms of execution time and data locality.

AB - Big data processing frameworks such as Hadoop have been widely adopted to process a large volume of data. A lot of prior work has focused on the allocation of resources and the execution order of jobs/tasks to improve the performance in a homogeneous cluster. In this paper, we investigate storage layer design in a heterogeneous system considering a new type of bundled jobs where the input data and associated application jobs are submitted in a bundle. Our goal is to break the barrier between resource management and the underlying storage layer, and improve data locality, an important performance factor for resource management, from the aspect of storage system. We develop a sampling-based randomized algorithm for the network file system to determine the placement of input data blocks. The main idea is to query a selected set of candidate nodes, and estimate their workload at run time combining centralized and per-node information. The node with the smallest workload is selected to host the data block. Our evaluation is based with system implementation and comprehensive experiments on NSF CloudLab platforms. We have also conducted simulation for large-scale clusters. The results show significant performance improvements in terms of execution time and data locality.

UR - http://www.scopus.com/inward/record.url?scp=85032261795&partnerID=8YFLogxK

U2 - 10.1109/ICCCN.2017.8038371

DO - 10.1109/ICCCN.2017.8038371

M3 - Conference contribution

AN - SCOPUS:85032261795

BT - 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Wang T, Wang J, Nguyen SN, Yang Z, Mi N, Sheng B. EA2S2: An efficient application-aware storage system for big data processing in heterogeneous clusters. In 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017. Institute of Electrical and Electronics Engineers Inc. 2017. 8038371 https://doi.org/10.1109/ICCCN.2017.8038371