Author : Suyash Mishra 1
Date of Publication :7th September 2017
Abstract: Now a day’s size of the data used in today’s enterprises worlds has been growing at exponential rates day by day. This had triggered need to process and analyze the large volumes of data for business decision making quickly as well. MapReduce is considered as a core-processing engine of Hadoop, which is prominently used to cater continuously increasing demands on computing resources imposed by massive data sets. Highly scalable feature of MapReduce processing, allows parallel and distributed processing on multiple computing nodes. This paper talks about various scheduling methodologies and most appropriate one can be used for improving MapReduce processing .Also tried to identify scheduling methods scaling or processing limitations along with the situations wherein they can be best suited. Map Reduce is used majorly for short jobs, which eventually require low response time. The current Hadoop implementation assumes underline computing nodes in a cluster are homogeneous, have same processing capability and memory. Hadoop’s scheduler suffers from severe performance degradation in heterogeneous environments. In heterogeneous environment, Longest Approximate Time to End (LATE) scheduling can be most efficient in comparison to other scheduling .It has been seen in various studies that LATE has improved Hadoop response times by approximately two times in a clusters.
Reference :
-
- Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI 2004, San Francisco, CA, pp. 137–150 (December 2004)
- Hadoop MapReduce, http://hadoop.apache.org/mapreduce/
- Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at facebook. In: Proceedings of the 2010 International Conference on Management of Data, SIGMOD 2010, pp. 1013–1020. ACM, New York (2010)
- Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: OSDI 2010, pp. 1–16. USENIX Asoc., Berkeley (2010)
- Polo, J., Carrera, D., Becerra, Y., Steinder, M., Whalley, I.: Performance-driven task co-scheduling for MapReduce environments. In: Network Operations and Management Symposium, NOMS, pp. 373–380. IEEE, Osaka (2010)
- Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K.-L., Balmin, A.: Flex: A Slot Allocation Scheduling Optimizer for Mapreduce Workloads. In: Gupta, I., Mascolo, C. (eds.) Middleware 2010. LNCS, vol. 6452, pp. 1–20. Springer, Heidelberg (2010)
- Dynamic Proportional share scheduling in Hadoop Thomas sandholm and Kevin Springer Berlin Heidelberg Volume 6253, 2010, pp 110-131
- Improving Map Reduce Performance through Data Placement in Heterogeneous Hadoop Clusters- Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, and Xiao Qin -Department of Computer Science and Software Engineering Auburn University, Auburn, AL 36849-5347
- An Empirical Analysis of Scheduling techniques for Real-time cloud based data processing-linh T.X. Phan Zhuoyao zhang, Qi Zheng Boon Thau Loo University of Pennsylvania
- Herodotou, H., and Babu, S. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In Proc. Int’ Conf. on Very Large Data Bases (VLDB) (2011).
- MapR. The executive’s guide to big data. http://www.mapr.com/resources/white-papers.
- Pettijohn, E., Guo, Y., Lama, P., and Zhou, X. Usercentric heterogeneity-aware mapreduce job provisioning in the public cloud. In Proc. Int’l Conference on Autonomic Computing (ICAC) (2014).
- Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., and Babu, S. Starfish: A self-tuning system for big data analytics. In Proc. Conference on Innovative Data Systems Research (CIDR) (2011).
- Lama, P., and Zhou, X. Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proc. Int’l Conf. on Autonomic computing (ICAC) (2012).
- Li, X., Wang, Y., Jiao, Y., Xu, C., and Yu, W. Coomr: Cross-task coordination for efficient data management in mapreduce programs. In Proc. Int’l Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2013).
- Kambatla, K., Pathak, A., and Pucha, H. Towards optimizing hadoop provisioning in the cloud. In Proc. USENIX HotCloud Workshop (2009).
- Li, Z., Cheng, Y., Liu, C., and Zhao, C. Minimum standard deviation difference-based thresholding. In Proc. Int’l Conference on Measuring Technology and Mechatronics Automation (ICMTMA) (2010).
- Jinda, A., Quian-Ruiz, J., and Dittrich, J. Trojan data layouts: Right shoes for a running elephant. In Proc. of ACM Symposium on Cloud Computing (SoCC) (2011).