Abstract
In this study, we investigated the problem of scheduling streaming applications on a heterogeneous cluster environment and, based on our previous work, developed the maximum throughput scheduler algorithm (MT-Scheduler) for streaming applications. The proposed algorithm uses a dynamic programming technique to efficiently map the application topology onto the heterogeneous distributed system based on computing and data transfer requirements, while also taking into account the capacity of the underlying cluster resources. The proposed approach maximizes the system throughput by identifying and minimizing the time incurred at the computing/transfer bottleneck. The MT-Scheduler supports scheduling applications structured as a directed acyclic graph. We conducted experiments using three Storm microbenchmark topologies in both simulation and real Apache Storm environments. In terms of the performance evaluation, we compared the proposed MT-Scheduler with the simulated round robin and the default Storm scheduler algorithms. The results indicated that the MT-Scheduler outperforms the default round robin approach in terms of both the average system latency and throughput.
Similar content being viewed by others
References
Diasde Assunção M, da Silva Veith A, Buyya R (2018) Distributed data stream processing and edge computing: a survey on resource elasticity and future directions. J Netw Comput Appl 103:1–17
Imai S, Patterson S, Varela CA (2017) Maximum sustainable throughput prediction for data stream processing over public clouds. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp 504–513
Khan S, Shakil KA, Alam M (2018) Cloud-based big data analytics—a survey of current research and future directions. In: Aggarwal VB, Bhatnagar V, Mishra DK (eds) Big data analytics, vol 654. Springer Singapore, Singapore, pp 595–604
To Q-C, Soto J, Markl V (2018) A survey of state management in big data processing systems. VLDB J 27(6):847–872
Teixeira FA, Pereira FMQ, Wong H-C, Nogueira JMS, Oliveira LB (2019) SIoT: securing internet of things through distributed systems analysis. Future Gener Comput Syst 92:1172–1186
Caneill M, El Rheddane A, Leroy V, De Palma N (2016) Locality-aware routing in stateful streaming applications. In: Proceedings of the 17th International Middleware Conference on—Middleware ’16, Trento, Italy, pp 1–13
Yi S, Li C, Li Q (2015) A survey of fog computing: concepts, applications and issues. In: Proceedings of the 2015 Workshop on Mobile Big Data—Mobidata’15, Hangzhou, China, pp 37–4
Jansen G, Verbitskiy I, Renner T, Thamsen L (2018) Scheduling stream processing tasks on geo-distributed heterogeneous resources. In: 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, pp 5159–5164
Zhu M, Wu Q, Rao NSV, Iyengar S (2007) Optimal pipeline decomposition and adaptive network mapping to support distributed remote visualization. J Parallel Distrib Comput 67(8):947–956
Wu Q, Zhu M, Gu Y, Rao NSV (2010) System design and algorithmic development for computational steering in distributed environments. IEEE Trans Parallel Distrib Syst 21(4):438–451
Blum L, Shub M, Smale S (1988) On a theory of computation over the real numbers; NP-completeness, recursive functions and universal machines. In: Proceedings 1988 29th Annual Symposium on Foundations of Computer Science, pp 387–397
Xue J, Yang Z, Hou S, Dai Y (2015) When computing meets heterogeneous cluster: workload assignment in graph computation. In: 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, pp 154–163
Aljoby WAY, Fu TZJ, Ma RTB (2017) Impacts of task placement and bandwidth allocation on stream analytics. In: 2017 IEEE 25th International Conference on Network Protocols (ICNP), Toronto, ON, pp 1–6
Kaur N, Sood SK (2017) Dynamic resource allocation for big data streams based on data characteristics (5Vs). Int J Netw Manag 27(4):e1978
Mortazavi-Dehkordi M, Zamanifar K (2019) Efficient resource scheduling for the analysis of Big Data streams. Intell Data Anal 23(1):77–102
Vasile M-A, Pop F, Tutueanu R-I, Cristea V, Kołodziej J (2015) Resource-aware hybrid scheduling algorithm in heterogeneous distributed computing. Future Gener Comput Syst 51:61–71
Qian Z et al. (2013) Timestream: reliable stream computation in the cloud. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp 1–14
Akidau T et al (2013) MillWheel: fault-tolerant stream processing at internet scale. Proc VLDB Endow 6(11):1033–1044
Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: 2010 IEEE International Conference on Data Mining Workshops, pp 170–177
Fu M et al (2017) Twitter Heron: towards extensible streaming engines. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp 1165–1172
Apache ZooKeeper. https://zookeeper.apache.org/. Accessed 10 Feb 2020
Amazon Timestream. Amazon Web Services, Inc. https://aws.amazon.com/timestream/. Accessed 10 Feb 2020
S4 Incubation Status—Apache Incubator. https://incubator.apache.org/projects/s4.html. Accessed 10 Feb 2020
Apache Storm. https://Storm.apache.org/. Accessed 10 Feb 2020
Peng B, Hosseini M, Hong Z, Farivar R, Campbell R (2015) R-Storm: resource-aware scheduling in storm. In: Proceedings of the 16th Annual Middleware Conference on—Middleware ’15, Vancouver, BC, Canada, pp 149–161
Xu J, Chen Z, Tang J, Su S (2014) T-Storm: traffic-aware [Online] scheduling in Storm. In: 2014 IEEE 34th International Conference on Distributed Computing Systems, pp 535–544
Li T, Tang J, Xu J (2015) A predictive scheduling framework for fast and distributed stream data processing. In: 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA, pp 333–338
Eskandari L, Mair J, Huang Z, Eyers D (2018) T3-Scheduler: a topology and traffic aware two-level Scheduler for stream processing systems in a heterogeneous cluster. Future Gener Comput Syst 89:617–632
Aniello L, Baldoni R, Querzoni L (2013) Adaptive [Online] scheduling in Storm. In: Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems—DEBS ’13, Arlington, Texas, USA, p 207
Tantalaki N, Souravlas S, Roumeliotis M (2019) A review on big data real-time stream processing and its scheduling techniques. Int J Parallel Emerg Distrib Syst. https://doi.org/10.1080/17445760.2019.1585848
Röger H, Mayer R (2019) A comprehensive survey on parallelization and elasticity in stream processing. arXiv:1901.09716 [cs.DC]
Sliwko L (2019) A taxonomy of schedulers—operating systems, clusters and big data frameworks. Glob J Comput Sci Technol 19:25–40
Mahmud R, Kotagiri R, Buyya R (2018) Fog computing: a taxonomy, survey and future directions, pp 103–130. arXiv:1611.05539 [cs.DC]
Liu J, Pacitti E, Valduriez P (2018) A survey of scheduling frameworks in big data systems, p 28
Rychly M, Koda P, Mr P (2014) Scheduling decisions in stream processing on heterogeneous clusters. In: 2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems, Birmingham, UK, pp 614–619
Cardellini V, Lo Presti F, Nardelli M, Russo Russo G (2018) Optimal operator deployment and replication for elastic distributed data stream processing: optimal deployment and replication for elastic data stream processing. Concurr Comput Pract Exp 30(9):e4334
Cardellini V, Grassi V, Lo Presti F, Nardelli M (2016) Optimal operator placement for distributed stream processing applications. In: Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems—DEBS ’16, Irvine, California, pp 69–80
Nardelli M, Cardellini V, Grassi V, Presti FL (2019) Efficient operator placement for distributed data stream processing applications. IEEE Trans Parallel Distrib Syst 30(8):1753–1767
Nardelli M (2018) QoS-aware deployment and adaptation of data stream processing applications in geo-distributed environments. Ph.D. thesis, University of Rome Tor Vergata
Li C, Zhang J, Luo Y (2017) Real-time scheduling based on optimized topology and communication traffic in distributed real-time computation platform of Storm. J Netw Comput Appl 87:100–115
Zhang W, Li S, Liu L, Jia Z, Zhang Y, Raychaudhuri D (2019) Hetero-edge: orchestration of real-time vision applications on heterogeneous edge clouds. In: IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Paris, France, pp 1270–1278
Liu S, Weng J, Wang JH, An C, Zhou Y, Wang J (2019) An adaptive [online] scheme for scheduling and resource enforcement in storm. IEEE ACM Trans Netw 27:1373–1386
Shukla A, Simmhan Y (2018) Model-driven scheduling for distributed stream processing systems. J Parallel Distrib Comput 117:98–114
Kombi RK, Lumineau N, Lamarre P, Rivetti N, Busnel Y (2019) DABS-Storm: a data-aware approach for elastic stream processing. In: Hameurlain A, Wagner R, Morvan F, Tamine L (eds) Transactions on large-scale data- and knowledge-centered systems XL. vol 11360. Springer, Berlin, pp 58–93
Liu X, Buyya R (2017) D-Storm: dynamic resource-efficient scheduling of stream processing applications. In: 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS), Shenzhen, pp 485–492
Apache Flink: Stateful Computations over Data Streams. https://flink.apache.org/. Accessed 10 Feb 2020
Apache SparkTM—Unified Analytics Engine for Big Data. https://spark.apache.org/. Accessed 10 Feb 2020
Al-Sinayyid A,Zhu M (2018) Maximizing the processing rate for streaming applications in Apache Storm. In: Proceedings of the 14th International Conference on Data Science (ICDATA’18)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Al-Sinayyid, A., Zhu, M. Job scheduler for streaming applications in heterogeneous distributed processing systems. J Supercomput 76, 9609–9628 (2020). https://doi.org/10.1007/s11227-020-03223-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03223-z