Skip to main content
Log in

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

This paper presents CirroData, a high-performance SQL-on-Hadoop system designed for Big Data analytics workloads. As a home-grown enterprise-level online analytical processing (OLAP) system with more than seven-year research and development (R&D) experiences, we share our design details to the community about how to achieve high performance in CirroData. Multiple optimization techniques have been discussed in the paper. The effectiveness and the efficiency of all these techniques have been proved by our customers’ daily usage. Benchmark-level studies, as well as several real application case studies of CirroData, have been presented in this paper. Our evaluations show that CirroData can outperform various types of counterpart database systems in the community, such as “Spark+Hive”, “Spark+HBase”, Impala, DB-X/Y, Greenplum, HAWQ, and others. CirroData can achieve up to 4.99x speedup compared with Greenplum, HAWQ, and Spark in the standard TPC-H queries. Application-level evaluations demonstrate that CirroData outperforms “Spark+Hive” and “Spark+HBase” by up to 8.4x and 38.8x, respectively. In the meantime, CirroData achieves the performance speedups for some application workloads by up to 20x, 100x, 182.5x, 92.6x, and 55.5x as compared with Greenplum, DB-X, Impala, DB-Y, and HAWQ, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Fox G C, Qiu J, Kamburugamuve S, Jha S, Luckow A. HPC-ABDS high performance computing enhanced Apache big data stack. In Proc. the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2015, pp.1057-1066.

  2. Qiu J, Jha S, Luckow A, Fox G C. Towards HPCABDS: An initial high-performance big data stack. http://grids.ucs.indiana.edu/ptliupages/publications/nisthpc-abds.pdf, June 2019.

  3. Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010, Article No. 9.

  4. Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: Cluster computing with working sets. In Proc. the 2nd USENIX Workshop on Hot Topics in Cloud Computing, June 2010, Article No. 5.

  5. Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive — A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629.

    Article  Google Scholar 

  6. Kornacker M, Behm A, Bittorf V et al. Impala: A modern, open-source SQL engine for Hadoop. In Proc. the 7th Biennial Conference on Innovative Data Systems Research, January 2015, Article No. 5.

  7. Chang L, Wang ZW, Ma T et al. HAWQ: A massively parallel processing SQL engine in Hadoop. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, June 2014, pp.1223-1234.

  8. Costea A, Ionescu A, Raducanu B et al. VectorH: Taking SQL-on-Hadoop to the next level. In Proc. the 2016 International Conference on Management of Data, June 2016, pp.1105-1117.

  9. Hunt P, Konar M, Junqueira F P, Reed B. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proc. the 2010 USENIX Annual Technical Conference, June 2010, Article No. 14.

  10. Chris L, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. the 2nd IEEE/ACM International Symposium on Code Generation and Optimization, March 2004, pp.75-88.

  11. Neumann T. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment, 2011, 4(9): 539-550.

    Article  Google Scholar 

  12. Neumann T, Leis V. Compiling database queries into machine code. IEEE Data Eng. Bull., 2014, 37(1): 3-11.

    Google Scholar 

  13. Shamgunov N. The MemSQL in-memory database system. In Proc. the 2nd International Workshop on in Memory Data Management and Analytics, September 2014, Article No. 1.

  14. Tan C K. Vitesse DB: 100% PostgreSQL, 100X faster for analytics. The 2nd South Bay PostgreSQL Meetup, 2015. https://www.meetup.com/postgresql-1/events/221039792/, Nov. 2019.

  15. Lu X, Liang F, Wang B, Zha L, Xu Z. DataMPI: Extending MPI to Hadoop-like big data computing. In Proc. the 28th International Parallel and Distributed Processing Symposium, May 2014, pp.829-838.

  16. Liang F, Lu X. Accelerating iterative big data computing through MPI. Journal of Computer Science and Technology, Mar. 2015, 30(2): 283-294.

    Article  MathSciNet  Google Scholar 

  17. Gugnani S, Lu X, Qi H L, Zha L, Panda D K. Characterizing and accelerating indexing techniques on distributed ordered tables. In Proc. the 2017 IEEE International Conference on Big Data, December 2017, pp.173-182.

  18. Kemper A, Neumann T. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.195-206.

  19. Pavlo A, Angulo G, Arulraj J et al. Self-driving database management systems. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 14.

  20. Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R. Hive — A petabyte scale data warehouse using Hadoop. In Proc. the 26th IEEE International Conference on Data Engineering, March 2010, pp.996-1005.

  21. Barber R, Garcia-Arellano C, Grosman R et al. Evolving databases for new-gen big data applications. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 2.

  22. Kallman R, Kimura H, Natkins J et al. H-Store: A high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 2008, 1(2): 1496-1499.

    Article  Google Scholar 

  23. Pavlo A, Jones E P, Zdonik S. On predictive modeling for optimizing transaction execution in parallel OLTP systems. Proceedings of the VLDB Endowment, 2011, 5(2): 85-96.

    Article  Google Scholar 

  24. Serafini M, Mansour E, Aboulnaga A, Salem K, Rafiq T, Minhas U F. Accordion: Elastic scalability for database systems supporting distributed transactions. Proceedings of the VLDB Endowment, 2014, 7(12): 1035-1046.

    Article  Google Scholar 

  25. Taft R, Mansour E, Serafini M, Duggan J, Elmore A J, Aboulnaga A, Pavlo A, Stonebraker M. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment, 2014, 8(3): 245-256.

    Article  Google Scholar 

  26. Mahajan K, Chowdhury M, Akella A, Chawla S. Dynamic query re-planning using QOOP. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, October 2018, pp.253-267.

  27. Cowling J A, Liskov B. Granola: Low-overhead distributed transaction coordination. In Proc. the 2012 USENIX Annual Technical Conference, June 2012, pp.223-235.

  28. Färber F, May N, Lehner W et al. The SAP HANA database — An architecture overview. IEEE Data Eng. Bull., 2012, 35(1): 28-33.

  29. Lee J, Kwon Y S, Färber F et al. SAP HANA distributed inmemory database system: Transaction, session, and metadata management. In Proc. the 29th International Conference on Data Engineering, April 2013, pp.1165-1173.

  30. Thomson A, Diamond T, Weng S C, Ren K, Shao P, Abadi D J. Calvin: Fast distributed transactions for partitioned database systems. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.1-12.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng-Hao Jin.

Electronic supplementary material

ESM 1

(PDF 550 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, ZH., Shi, H., Hu, YX. et al. CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance. J. Comput. Sci. Technol. 35, 194–208 (2020). https://doi.org/10.1007/s11390-020-9536-z

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-9536-z

Keywords

Navigation