Abstract
In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the recomputation and I/O cost in big data processing systems such as Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca,<sup;>1</sup;> a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency. Extensive experimental studies using both synthetic and real datasets show that, in comparing to Spark, Deca is able to (1) reduce the garbage collection time by up to 99.9%, (2) reduce the memory consumption by up to 46.6% and the storage space by 23.4%, (3) achieve 1.2× to 22.7× speedup in terms of execution time in cases without data spilling and 16× to 41.6× speedup in cases with data spilling, and (4) provide similar performance compared to domain-specific systems.
- Bowen Alpern, Dick Attanasio, John J. Barton, Anthony Cocchi, Susan Flynn Hummel, Derek Lieber, Mark Mergen, Ton Ngo, Janice Shepherd, and Stephen Smith. 1999. Implementing JalapeÑO in java. In Proceedings of the 14th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’99). ACM, New York, 314--324. Google ScholarDigital Library
- Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the outliers in map-reduce clusters using Mantri. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, Berkeley, CA, 265--278. Google ScholarDigital Library
- Eric Anderson and Joseph Tucek. 2010. Efficiency matters! Special Interest Group on Operating Systems (SIGOPS) 44, 1 (March 2010), 40--45. Google ScholarDigital Library
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, 1383--1394. Google ScholarDigital Library
- Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. 2006. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, 44--54. Google ScholarDigital Library
- Benchmark. 2014. Big Data Benchmark. Retrieved from http://tinyurl.com/qg93r43.Google Scholar
- Stephen M. Blackburn, Perry Cheng, and Kathryn S. McKinley. 2004. Oil and water? High performance garbage collection in Java with MMTk. In Proceedings of the 26th International Conference on Software Engineering (ICSE’04). IEEE Computer Society, Washington, DC, 137--146. Google ScholarDigital Library
- Paolo Boldi and Sebastiano Vigna. 2004. The webgraph framework I: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). ACM, New York, 595--602. Google ScholarDigital Library
- Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira. 2017. NG2C: Pretenuring garbage collection with dynamic generations for HotSpot big data applications. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM’17). ACM, New York, 2--13. Google ScholarDigital Library
- Yingyi Bu, Vinayak Borkar, Guoqing Xu, and Michael J. Carey. 2013. A bloat-aware design for big data applications. In Proceedings of the 2013 International Symposium on Memory Management (ISMM’13). ACM, New York, 119--130. Google ScholarDigital Library
- Bryan Carpenter, Geoffrey Fox, Sung Hoon Ko, and Sang Lim. 1999. Object serialization for marshalling data in a java interface to MPI. In Proceedings of the ACM 1999 Conference on Java Grande (JAVA’99). ACM, New York, 66--71. Google ScholarDigital Library
- Cassandra. 2010. Cassandra Garbage Collection Tuning. Retrieved from http://tinyurl.com/5u58mzc.Google Scholar
- Databricks. 2015. Tuning Java Garbage Collection for Spark Applications. Retrieved from http://tinyurl.com/pd8kkau.Google Scholar
- Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC’10). ACM, New York, 810--818. Google ScholarDigital Library
- Lu Fang, Khanh Nguyen, Guoqing Xu, Brian Demsky, and Shan Lu. 2015. Interruptible tasks: Treating memory pressure as interrupts for highly scalable data-parallel programs. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15). ACM, New York, 394--409. Google ScholarDigital Library
- Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan Nguyen. 2015. NumaGiC: A garbage collector for big data on big NUMA machines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, 661--673. Google ScholarDigital Library
- Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytiniotis, Ganesan Ramalingan, Derek Murray, Steven Hand, and Michael Isard. 2015. Broom: Sweeping out garbage collection from big data systems. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (HOTOS’15). USENIX Association, Berkeley, CA, 2--2. Google ScholarDigital Library
- GoGC. 2015. Go GC: Prioritizing Low Latency and Simplicity. Retrieved from https://blog.golang.org/go15gc.Google Scholar
- Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 599--613. Google ScholarDigital Library
- Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting code optimizations in data-parallel pipelines through PeriSCOPE. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, 121--133. Google ScholarDigital Library
- HBaseGC. 2016. Tuning Java Garbage Collection for HBase. Retrieved from http://tinyurl.com/j5hsd3x.Google Scholar
- HiBench. 2016. HiBench Suite. Retrieved from http://tinyurl.com/cns79vt.Google Scholar
- Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: Fair scheduling for distributed computing clusters. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles 2009 (SOSP’09). ACM, New York, 261--276. Google ScholarDigital Library
- Richard Jones, Antony Hosking, and Eliot Moss. 2011. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman and Hall/CRC. Google ScholarDigital Library
- Ondřej Lhoták and Laurie Hendren. 2003. Scaling java points-to analysis using SPARK. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). Springer-Verlag, Berlin, 153--169. Google ScholarDigital Library
- Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, and Prashant Shenoy. 2011. A platform for scalable one-pass analytics using MapReduce. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, New York, 985--996. Google ScholarDigital Library
- Henry Lieberman and Carl Hewitt. 1983. A real-time garbage collector based on the lifetimes of objects. Communications of the ACM 26, 6 (June 1983), 419--429. Google ScholarDigital Library
- Lu Lu, Xuanhua Shi, Yongluan Zhou, Xiong Zhang, Hai Jin, Cheng Pei, Ligang He, and Yuanzhen Geng. 2016. Lifetime-based memory management for distributed data processing systems. Proceedings of the VLDB Endowment (PVLDB) 9, 12 (Aug. 2016), 936--947. Google ScholarDigital Library
- Martin Maas, Krste Asanović, Tim Harris, and John Kubiatowicz. 2016. Taurus: A holistic language runtime system for coordinating distributed managed-language applications. In Proceedings of the 21th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16). ACM, New York, 457--471. Google ScholarDigital Library
- Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). ACM, New York, 43--52. Google ScholarDigital Library
- Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scalability! but at what cost? In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems (HOTOS’15). USENIX Association, Berkeley, CA, 14--14. Google ScholarDigital Library
- Heather Miller, Philipp Haller, Eugene Burmako, and Martin Odersky. 2013. Instant pickles: Generating object-oriented pickler combinators for fast and extensible serialization. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages 8 Applications (OOPSLA’13). ACM, New York, 183--202. Google ScholarDigital Library
- Derek Gordon Murray, Michael Isard, and Yuan Yu. 2011. Steno: Automatic optimization of declarative queries. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). ACM, New York, 121--131. Google ScholarDigital Library
- Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Shan Lu, Sanazsadat Alamian, and Onur Mutlu. 2016. Yak: A high-performance big-data-friendly garbage collector. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Berkeley, CA, 349--365. Google ScholarDigital Library
- Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu. 2015. FACADE: A compiler and runtime for (almost) object-bounded big data applications. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, 675--690. Google ScholarDigital Library
- Russell Power and Jinyang Li. 2010. Piccolo: Building fast, distributed programs with partitioned tables. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10). USENIX Association, Berkeley, CA, 293--306. Google ScholarDigital Library
- Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing user-defined aggregations using symbolic execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, 153--167. Google ScholarDigital Library
- Mehul A. Shah, Michael J. Franklin, Samuel Madden, and Joseph M. Hellerstein. 2001. Java support for data-intensive systems: Experiences building the telegraph dataflow system. SIGMOD Record 30, 4 (Dec. 2001), 103--114. Google ScholarDigital Library
- Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the titans: MapReduce vs. spark for large scale data analytics. The Proceedings of the VLDB Endowment (PVLDB) 8, 13 (Sept. 2015), 2110--2121. Google ScholarDigital Library
- Avraham Shinnar, David Cunningham, Vijay Saraswat, and Benjamin Herta. 2012. M3R: Increased performance for in-memory hadoop jobs. The Proceedings of the VLDB Endowment (PVLDB) 5, 12 (Aug. 2012), 1736--1747. Google ScholarDigital Library
- SNAP. 2016. SNAP Dataset Collection. Retrieved from https://snap.stanford.edu/data/index.html.Google Scholar
- Soot. 2016. Soot Framework. Retrieved from http://sable.github.io/soot/.Google Scholar
- SparkGC. 2016. Spark Garbage Collection Tuning. Retrieved from http://tinyurl.com/hzf3gqm.Google Scholar
- Mads Tofte and Jean-Pierre Talpin. 1997. Region-based memory management. Information and Computation 132, 2 (Feb. 1997), 109--176. Google ScholarDigital Library
- Tungsten. 2015. Project Tungsten of Spark. Retrieved from http://tinyurl.com/mzw7hew.Google Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). ACM, New York, Article 5, 16 pages. Google ScholarDigital Library
- Matt Welsh and David Culler. 2000. Jaguar: Enabling efficient communication and I/O in Java. Concurrency - Practice and Experience 12, 7 (2000), 519--538.Google ScholarCross Ref
- Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, 265--278. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, Berkeley, CA, 2--2. Google ScholarDigital Library
- Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08). USENIX Association, Berkeley, CA, 29--42. Google ScholarDigital Library
- Hao Zhang, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, and Meihui Zhang. 2015. In-memory big data management and processing: A survey. IEEE Transactions on Knowledge and Data Engineering 27, 7 (July 2015), 1920--1948.Google ScholarDigital Library
- Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. 2011. PrIter: A distributed framework for prioritized iterative computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC’11). ACM, New York, Article 13, 14 pages. Google ScholarDigital Library
- Chaojun Zhao, Chen Chen, Zhijian Chen, and Jianyi Meng. 2016. Value locality based storage compression memory architecture for ECG sensor node. Science China Information Sciences 59, 4 (2016), 1--11.Google ScholarCross Ref
- Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (May 1977), 337--343. Google ScholarDigital Library
Index Terms
- Deca: A Garbage Collection Optimizer for In-Memory Data Processing
Recommendations
A generational on-the-fly garbage collector for Java
PLDI '00: Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementationAn on-the-fly garbage collector does not stop the program threads to perform the collection. Instead, the collector executes in a separate thread (or process) in parallel to the program. On-the-fly collectors are useful for multi-threaded applications ...
Controlling garbage collection and heap growth to reduce the execution time of Java applications
In systems that support garbage collection, a tension exists between collecting garbage too frequently and not collecting it frequently enough. Garbage collection that occurs too frequently may introduce unnecessary overheads at the risk of not ...
An on-the-fly mark and sweep garbage collector based on sliding views
Special Issue: Proceedings of the OOPSLA '03 conferenceWith concurrent and garbage collected languages like Java and C# becoming popular, the need for a suitable non-intrusive, efficient, and concurrent multiprocessor garbage collector has become acute. We propose a novel mark and sweep on-the-fly algorithm ...
Comments