AutoPath

Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks

Han Gao, Zhengyu Yang, Janki Bhimani, Teng Wang, Jiayin Wang, Bo Sheng, Ningfang Mi

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

20 Citations (Scopus)

Abstract

Due to the flexibility of data operations and scalability of in- memory cache, Spark has revealed the potential to become the standard distributed framework to replace Hadoop for data-intensive processing in both industry and academia. However, we observe that the built-in scheduling algorithms in Spark (i.e., FIFO and FAIR) are not optimized for the applications with multiple parallel and independent branches in stages. Specifically, the child stage needs to wait and collect data from all its parent branches, but this wait has no guaranteed upper bound since it is tightly coupled with each branch's workload characteristic, stage order, and their corresponding allocated computing resource. To address this challenge, we investigate a superior solution which ensures all branches acquire suitable resources according to their workload demand in order to let the finish time of each branch be as close as possible. Based on this, we propose a novel scheduling policy, named AutoPath, which can effectively reduce the overall makespan of such kind of applications by detecting and leveraging the parallel path, and adaptively assigning computing resources based on the estimated workload demands during runtime. We implemented the new scheduling scheme in Spark v1.5.0 and evaluated it with selected representative workloads. The experiments demonstrate that our new scheduler effectively reduces the makespan and improves resource utilizations for these applications, compared to the current FIFO and FAIR schedulers.

Original languageEnglish
Title of host publication2017 26th International Conference on Computer Communications and Networks, ICCCN 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509029914
DOIs
StatePublished - 14 Sep 2017
Event26th International Conference on Computer Communications and Networks, ICCCN 2017 - Vancouver, Canada
Duration: 31 Jul 20173 Aug 2017

Publication series

Name2017 26th International Conference on Computer Communications and Networks, ICCCN 2017

Other

Other26th International Conference on Computer Communications and Networks, ICCCN 2017
CountryCanada
CityVancouver
Period31/07/173/08/17

Fingerprint

Electric sparks
Resource Allocation
Resource allocation
Branch
Workload
Path
Scheduling
Resources
Cache memory
Scheduler
Scheduling algorithms
Scalability
No-wait
Scheduling Policy
Computing
Scheduling Algorithm
Processing
Flexibility
Framework
Big data

Keywords

  • Resource management
  • Scheduling
  • Spark
  • Task assignment
  • Workload evaluation & estimation

Cite this

Gao, H., Yang, Z., Bhimani, J., Wang, T., Wang, J., Sheng, B., & Mi, N. (2017). AutoPath: Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks. In 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017 [8038381] (2017 26th International Conference on Computer Communications and Networks, ICCCN 2017). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCCN.2017.8038381
Gao, Han ; Yang, Zhengyu ; Bhimani, Janki ; Wang, Teng ; Wang, Jiayin ; Sheng, Bo ; Mi, Ningfang. / AutoPath : Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks. 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017. Institute of Electrical and Electronics Engineers Inc., 2017. (2017 26th International Conference on Computer Communications and Networks, ICCCN 2017).
@inproceedings{a61b8c10c9bf4477bcc4e476fac6b73d,
title = "AutoPath: Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks",
abstract = "Due to the flexibility of data operations and scalability of in- memory cache, Spark has revealed the potential to become the standard distributed framework to replace Hadoop for data-intensive processing in both industry and academia. However, we observe that the built-in scheduling algorithms in Spark (i.e., FIFO and FAIR) are not optimized for the applications with multiple parallel and independent branches in stages. Specifically, the child stage needs to wait and collect data from all its parent branches, but this wait has no guaranteed upper bound since it is tightly coupled with each branch's workload characteristic, stage order, and their corresponding allocated computing resource. To address this challenge, we investigate a superior solution which ensures all branches acquire suitable resources according to their workload demand in order to let the finish time of each branch be as close as possible. Based on this, we propose a novel scheduling policy, named AutoPath, which can effectively reduce the overall makespan of such kind of applications by detecting and leveraging the parallel path, and adaptively assigning computing resources based on the estimated workload demands during runtime. We implemented the new scheduling scheme in Spark v1.5.0 and evaluated it with selected representative workloads. The experiments demonstrate that our new scheduler effectively reduces the makespan and improves resource utilizations for these applications, compared to the current FIFO and FAIR schedulers.",
keywords = "Resource management, Scheduling, Spark, Task assignment, Workload evaluation & estimation",
author = "Han Gao and Zhengyu Yang and Janki Bhimani and Teng Wang and Jiayin Wang and Bo Sheng and Ningfang Mi",
year = "2017",
month = "9",
day = "14",
doi = "10.1109/ICCCN.2017.8038381",
language = "English",
series = "2017 26th International Conference on Computer Communications and Networks, ICCCN 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "2017 26th International Conference on Computer Communications and Networks, ICCCN 2017",

}

Gao, H, Yang, Z, Bhimani, J, Wang, T, Wang, J, Sheng, B & Mi, N 2017, AutoPath: Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks. in 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017., 8038381, 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017, Institute of Electrical and Electronics Engineers Inc., 26th International Conference on Computer Communications and Networks, ICCCN 2017, Vancouver, Canada, 31/07/17. https://doi.org/10.1109/ICCCN.2017.8038381

AutoPath : Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks. / Gao, Han; Yang, Zhengyu; Bhimani, Janki; Wang, Teng; Wang, Jiayin; Sheng, Bo; Mi, Ningfang.

2017 26th International Conference on Computer Communications and Networks, ICCCN 2017. Institute of Electrical and Electronics Engineers Inc., 2017. 8038381 (2017 26th International Conference on Computer Communications and Networks, ICCCN 2017).

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

TY - GEN

T1 - AutoPath

T2 - Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks

AU - Gao, Han

AU - Yang, Zhengyu

AU - Bhimani, Janki

AU - Wang, Teng

AU - Wang, Jiayin

AU - Sheng, Bo

AU - Mi, Ningfang

PY - 2017/9/14

Y1 - 2017/9/14

N2 - Due to the flexibility of data operations and scalability of in- memory cache, Spark has revealed the potential to become the standard distributed framework to replace Hadoop for data-intensive processing in both industry and academia. However, we observe that the built-in scheduling algorithms in Spark (i.e., FIFO and FAIR) are not optimized for the applications with multiple parallel and independent branches in stages. Specifically, the child stage needs to wait and collect data from all its parent branches, but this wait has no guaranteed upper bound since it is tightly coupled with each branch's workload characteristic, stage order, and their corresponding allocated computing resource. To address this challenge, we investigate a superior solution which ensures all branches acquire suitable resources according to their workload demand in order to let the finish time of each branch be as close as possible. Based on this, we propose a novel scheduling policy, named AutoPath, which can effectively reduce the overall makespan of such kind of applications by detecting and leveraging the parallel path, and adaptively assigning computing resources based on the estimated workload demands during runtime. We implemented the new scheduling scheme in Spark v1.5.0 and evaluated it with selected representative workloads. The experiments demonstrate that our new scheduler effectively reduces the makespan and improves resource utilizations for these applications, compared to the current FIFO and FAIR schedulers.

AB - Due to the flexibility of data operations and scalability of in- memory cache, Spark has revealed the potential to become the standard distributed framework to replace Hadoop for data-intensive processing in both industry and academia. However, we observe that the built-in scheduling algorithms in Spark (i.e., FIFO and FAIR) are not optimized for the applications with multiple parallel and independent branches in stages. Specifically, the child stage needs to wait and collect data from all its parent branches, but this wait has no guaranteed upper bound since it is tightly coupled with each branch's workload characteristic, stage order, and their corresponding allocated computing resource. To address this challenge, we investigate a superior solution which ensures all branches acquire suitable resources according to their workload demand in order to let the finish time of each branch be as close as possible. Based on this, we propose a novel scheduling policy, named AutoPath, which can effectively reduce the overall makespan of such kind of applications by detecting and leveraging the parallel path, and adaptively assigning computing resources based on the estimated workload demands during runtime. We implemented the new scheduling scheme in Spark v1.5.0 and evaluated it with selected representative workloads. The experiments demonstrate that our new scheduler effectively reduces the makespan and improves resource utilizations for these applications, compared to the current FIFO and FAIR schedulers.

KW - Resource management

KW - Scheduling

KW - Spark

KW - Task assignment

KW - Workload evaluation & estimation

UR - http://www.scopus.com/inward/record.url?scp=85024839187&partnerID=8YFLogxK

U2 - 10.1109/ICCCN.2017.8038381

DO - 10.1109/ICCCN.2017.8038381

M3 - Conference contribution

T3 - 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017

BT - 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Gao H, Yang Z, Bhimani J, Wang T, Wang J, Sheng B et al. AutoPath: Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks. In 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017. Institute of Electrical and Electronics Engineers Inc. 2017. 8038381. (2017 26th International Conference on Computer Communications and Networks, ICCCN 2017). https://doi.org/10.1109/ICCCN.2017.8038381