Abstract
With the rise of big data, more and more users will launch computing systems to process a large volume of data in various applications. A Scheduling algorithm is crucial to the performance of the processing platforms, especially when they are concurrently executing a batch of jobs. Such jobs usually represent multiple stages. Each stage produces the intermediate data which will be piped to the next stage for further processing. However, the scheduling problem in a big data computing system is different from the traditional multi-stage job scheduling problem as for any two consecutive stages, the later stage usually starts before the former stage is finished to 'shuffle' the intermediate data. In this paper, we consider MapReduce/Hadoop as a representative computing system and develop a new strategy named OMO, Optimize MapReduce Overlap with a Good Start (Reduce) and a Good Finish (Map). A MapReduce job contains two consecutive phases: map and reduce. Our general target is to optimize the internal overlap between these two phases. There are two new techniques included in our solution, Lazy start of reduce tasks and Batch finish of map tasks, which aim to approach an effective alignment of the two phases based on the characteristics of the MapReduce process. OMO has been implemented on the Hadoop system with extensive experiments for performance evaluation. The results show that OMO's performance is superior in terms of total completion time (i.e., makespan) of a batch of jobs.
Original language | English |
---|---|
Article number | 9456865 |
Pages (from-to) | 88805-88819 |
Number of pages | 15 |
Journal | IEEE Access |
Volume | 9 |
DOIs | |
State | Published - 2021 |
Keywords
- Hadoop scheduling
- MapReduce jobs
- Reduced makespan
- Resource management