TY - GEN
T1 - AutoPath
T2 - 26th International Conference on Computer Communications and Networks, ICCCN 2017
AU - Gao, Han
AU - Yang, Zhengyu
AU - Bhimani, Janki
AU - Wang, Teng
AU - Wang, Jiayin
AU - Sheng, Bo
AU - Mi, Ningfang
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/14
Y1 - 2017/9/14
N2 - Due to the flexibility of data operations and scalability of in- memory cache, Spark has revealed the potential to become the standard distributed framework to replace Hadoop for data-intensive processing in both industry and academia. However, we observe that the built-in scheduling algorithms in Spark (i.e., FIFO and FAIR) are not optimized for the applications with multiple parallel and independent branches in stages. Specifically, the child stage needs to wait and collect data from all its parent branches, but this wait has no guaranteed upper bound since it is tightly coupled with each branch's workload characteristic, stage order, and their corresponding allocated computing resource. To address this challenge, we investigate a superior solution which ensures all branches acquire suitable resources according to their workload demand in order to let the finish time of each branch be as close as possible. Based on this, we propose a novel scheduling policy, named AutoPath, which can effectively reduce the overall makespan of such kind of applications by detecting and leveraging the parallel path, and adaptively assigning computing resources based on the estimated workload demands during runtime. We implemented the new scheduling scheme in Spark v1.5.0 and evaluated it with selected representative workloads. The experiments demonstrate that our new scheduler effectively reduces the makespan and improves resource utilizations for these applications, compared to the current FIFO and FAIR schedulers.
AB - Due to the flexibility of data operations and scalability of in- memory cache, Spark has revealed the potential to become the standard distributed framework to replace Hadoop for data-intensive processing in both industry and academia. However, we observe that the built-in scheduling algorithms in Spark (i.e., FIFO and FAIR) are not optimized for the applications with multiple parallel and independent branches in stages. Specifically, the child stage needs to wait and collect data from all its parent branches, but this wait has no guaranteed upper bound since it is tightly coupled with each branch's workload characteristic, stage order, and their corresponding allocated computing resource. To address this challenge, we investigate a superior solution which ensures all branches acquire suitable resources according to their workload demand in order to let the finish time of each branch be as close as possible. Based on this, we propose a novel scheduling policy, named AutoPath, which can effectively reduce the overall makespan of such kind of applications by detecting and leveraging the parallel path, and adaptively assigning computing resources based on the estimated workload demands during runtime. We implemented the new scheduling scheme in Spark v1.5.0 and evaluated it with selected representative workloads. The experiments demonstrate that our new scheduler effectively reduces the makespan and improves resource utilizations for these applications, compared to the current FIFO and FAIR schedulers.
KW - Resource management
KW - Scheduling
KW - Spark
KW - Task assignment
KW - Workload evaluation & estimation
UR - http://www.scopus.com/inward/record.url?scp=85024839187&partnerID=8YFLogxK
U2 - 10.1109/ICCCN.2017.8038381
DO - 10.1109/ICCCN.2017.8038381
M3 - Conference contribution
AN - SCOPUS:85024839187
T3 - 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017
BT - 2017 26th International Conference on Computer Communications and Networks, ICCCN 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 31 July 2017 through 3 August 2017
ER -