New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters

Yi Yao, Han Gao, Jiayin Wang, Bo Sheng, Ningfang Mi

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

The MapReduce framework has become the defacto scheme for scalable semi-structured and un-structured data processing in recent years. The Hadoop ecosystem has evolved into its second generation, Hadoop YARN, which adopts fine-grained resource management schemes for job scheduling. Nowadays, fairness and efficiency are two main concerns in YARN resource management because resources in YARN are shared and contended by multiple applications. However, the current scheduling in YARN does not yield the optimal resource arrangement, unnecessarily causing idle resources and inefficient scheduling. It omits the dependency between tasks which is extremely crucial for the efficiency of resource utilization as well as heterogeneous job features in real application environments. We thus propose a new YARN scheduler which can effectively reduce the makespan (i.e., the total execution time) of a batch of MapReduce jobs in Hadoop YARN clusters by leveraging the information of requested resources, resource capacities and dependency between tasks. For accommodating heterogeneity in MapReduce jobs, we also extend our scheduler by further considering the job iteration information in the scheduling decisions. We implemented the new scheduling algorithm as a pluggable scheduler in YARN and evaluated it with a set of classic MapReduce benchmarks. The experimental results demonstrate that our YARN scheduler effectively reduces the makespans and improves resource utilizations.

Original languageEnglish
JournalIEEE Transactions on Cloud Computing
DOIs
StateAccepted/In press - 1 Jan 2019

Fingerprint

Scheduling algorithms
Scheduling
Ecosystems

Cite this

@article{09539ac7ddf94919a7716c4064a6f051,
title = "New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters",
abstract = "The MapReduce framework has become the defacto scheme for scalable semi-structured and un-structured data processing in recent years. The Hadoop ecosystem has evolved into its second generation, Hadoop YARN, which adopts fine-grained resource management schemes for job scheduling. Nowadays, fairness and efficiency are two main concerns in YARN resource management because resources in YARN are shared and contended by multiple applications. However, the current scheduling in YARN does not yield the optimal resource arrangement, unnecessarily causing idle resources and inefficient scheduling. It omits the dependency between tasks which is extremely crucial for the efficiency of resource utilization as well as heterogeneous job features in real application environments. We thus propose a new YARN scheduler which can effectively reduce the makespan (i.e., the total execution time) of a batch of MapReduce jobs in Hadoop YARN clusters by leveraging the information of requested resources, resource capacities and dependency between tasks. For accommodating heterogeneity in MapReduce jobs, we also extend our scheduler by further considering the job iteration information in the scheduling decisions. We implemented the new scheduling algorithm as a pluggable scheduler in YARN and evaluated it with a set of classic MapReduce benchmarks. The experimental results demonstrate that our YARN scheduler effectively reduces the makespans and improves resource utilizations.",
author = "Yi Yao and Han Gao and Jiayin Wang and Bo Sheng and Ningfang Mi",
year = "2019",
month = "1",
day = "1",
doi = "10.1109/TCC.2019.2894779",
language = "English",
journal = "IEEE Transactions on Cloud Computing",
issn = "2168-7161",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters. / Yao, Yi; Gao, Han; Wang, Jiayin; Sheng, Bo; Mi, Ningfang.

In: IEEE Transactions on Cloud Computing, 01.01.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters

AU - Yao, Yi

AU - Gao, Han

AU - Wang, Jiayin

AU - Sheng, Bo

AU - Mi, Ningfang

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The MapReduce framework has become the defacto scheme for scalable semi-structured and un-structured data processing in recent years. The Hadoop ecosystem has evolved into its second generation, Hadoop YARN, which adopts fine-grained resource management schemes for job scheduling. Nowadays, fairness and efficiency are two main concerns in YARN resource management because resources in YARN are shared and contended by multiple applications. However, the current scheduling in YARN does not yield the optimal resource arrangement, unnecessarily causing idle resources and inefficient scheduling. It omits the dependency between tasks which is extremely crucial for the efficiency of resource utilization as well as heterogeneous job features in real application environments. We thus propose a new YARN scheduler which can effectively reduce the makespan (i.e., the total execution time) of a batch of MapReduce jobs in Hadoop YARN clusters by leveraging the information of requested resources, resource capacities and dependency between tasks. For accommodating heterogeneity in MapReduce jobs, we also extend our scheduler by further considering the job iteration information in the scheduling decisions. We implemented the new scheduling algorithm as a pluggable scheduler in YARN and evaluated it with a set of classic MapReduce benchmarks. The experimental results demonstrate that our YARN scheduler effectively reduces the makespans and improves resource utilizations.

AB - The MapReduce framework has become the defacto scheme for scalable semi-structured and un-structured data processing in recent years. The Hadoop ecosystem has evolved into its second generation, Hadoop YARN, which adopts fine-grained resource management schemes for job scheduling. Nowadays, fairness and efficiency are two main concerns in YARN resource management because resources in YARN are shared and contended by multiple applications. However, the current scheduling in YARN does not yield the optimal resource arrangement, unnecessarily causing idle resources and inefficient scheduling. It omits the dependency between tasks which is extremely crucial for the efficiency of resource utilization as well as heterogeneous job features in real application environments. We thus propose a new YARN scheduler which can effectively reduce the makespan (i.e., the total execution time) of a batch of MapReduce jobs in Hadoop YARN clusters by leveraging the information of requested resources, resource capacities and dependency between tasks. For accommodating heterogeneity in MapReduce jobs, we also extend our scheduler by further considering the job iteration information in the scheduling decisions. We implemented the new scheduling algorithm as a pluggable scheduler in YARN and evaluated it with a set of classic MapReduce benchmarks. The experimental results demonstrate that our YARN scheduler effectively reduces the makespans and improves resource utilizations.

UR - http://www.scopus.com/inward/record.url?scp=85060500575&partnerID=8YFLogxK

U2 - 10.1109/TCC.2019.2894779

DO - 10.1109/TCC.2019.2894779

M3 - Article

AN - SCOPUS:85060500575

JO - IEEE Transactions on Cloud Computing

JF - IEEE Transactions on Cloud Computing

SN - 2168-7161

ER -