CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Jin, Zheng-Hao; Shi, Haiyang; Hu, Ying-Xin; Zha, Li; Lu, Xiaoyi

doi:10.1007/s11390-020-9536-z

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Regular Paper
Published: 17 January 2020

Volume 35, pages 194–208, (2020)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Zheng-Hao Jin¹,
Haiyang Shi²,
Ying-Xin Hu¹,
Li Zha^3,4 &
…
Xiaoyi Lu²

183 Accesses
1 Citation
Explore all metrics

Abstract

This paper presents CirroData, a high-performance SQL-on-Hadoop system designed for Big Data analytics workloads. As a home-grown enterprise-level online analytical processing (OLAP) system with more than seven-year research and development (R&D) experiences, we share our design details to the community about how to achieve high performance in CirroData. Multiple optimization techniques have been discussed in the paper. The effectiveness and the efficiency of all these techniques have been proved by our customers’ daily usage. Benchmark-level studies, as well as several real application case studies of CirroData, have been presented in this paper. Our evaluations show that CirroData can outperform various types of counterpart database systems in the community, such as “Spark+Hive”, “Spark+HBase”, Impala, DB-X/Y, Greenplum, HAWQ, and others. CirroData can achieve up to 4.99x speedup compared with Greenplum, HAWQ, and Spark in the standard TPC-H queries. Application-level evaluations demonstrate that CirroData outperforms “Spark+Hive” and “Spark+HBase” by up to 8.4x and 38.8x, respectively. In the meantime, CirroData achieves the performance speedups for some application workloads by up to 20x, 100x, 182.5x, 92.6x, and 55.5x as compared with Greenplum, DB-X, Impala, DB-Y, and HAWQ, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

Xuanhe Zhou, Zhaoyan Sun & Guoliang Li

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

References

Fox G C, Qiu J, Kamburugamuve S, Jha S, Luckow A. HPC-ABDS high performance computing enhanced Apache big data stack. In Proc. the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2015, pp.1057-1066.
Qiu J, Jha S, Luckow A, Fox G C. Towards HPCABDS: An initial high-performance big data stack. http://grids.ucs.indiana.edu/ptliupages/publications/nisthpc-abds.pdf, June 2019.
Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th IEEE Symposium on Mass Storage Systems and Technologies, May 2010, Article No. 9.
Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: Cluster computing with working sets. In Proc. the 2nd USENIX Workshop on Hot Topics in Cloud Computing, June 2010, Article No. 5.
Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive — A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629.
Article Google Scholar
Kornacker M, Behm A, Bittorf V et al. Impala: A modern, open-source SQL engine for Hadoop. In Proc. the 7th Biennial Conference on Innovative Data Systems Research, January 2015, Article No. 5.
Chang L, Wang ZW, Ma T et al. HAWQ: A massively parallel processing SQL engine in Hadoop. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, June 2014, pp.1223-1234.
Costea A, Ionescu A, Raducanu B et al. VectorH: Taking SQL-on-Hadoop to the next level. In Proc. the 2016 International Conference on Management of Data, June 2016, pp.1105-1117.
Hunt P, Konar M, Junqueira F P, Reed B. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proc. the 2010 USENIX Annual Technical Conference, June 2010, Article No. 14.
Chris L, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. the 2nd IEEE/ACM International Symposium on Code Generation and Optimization, March 2004, pp.75-88.
Neumann T. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment, 2011, 4(9): 539-550.
Article Google Scholar
Neumann T, Leis V. Compiling database queries into machine code. IEEE Data Eng. Bull., 2014, 37(1): 3-11.
Google Scholar
Shamgunov N. The MemSQL in-memory database system. In Proc. the 2nd International Workshop on in Memory Data Management and Analytics, September 2014, Article No. 1.
Tan C K. Vitesse DB: 100% PostgreSQL, 100X faster for analytics. The 2nd South Bay PostgreSQL Meetup, 2015. https://www.meetup.com/postgresql-1/events/221039792/, Nov. 2019.
Lu X, Liang F, Wang B, Zha L, Xu Z. DataMPI: Extending MPI to Hadoop-like big data computing. In Proc. the 28th International Parallel and Distributed Processing Symposium, May 2014, pp.829-838.
Liang F, Lu X. Accelerating iterative big data computing through MPI. Journal of Computer Science and Technology, Mar. 2015, 30(2): 283-294.
Article MathSciNet Google Scholar
Gugnani S, Lu X, Qi H L, Zha L, Panda D K. Characterizing and accelerating indexing techniques on distributed ordered tables. In Proc. the 2017 IEEE International Conference on Big Data, December 2017, pp.173-182.
Kemper A, Neumann T. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.195-206.
Pavlo A, Angulo G, Arulraj J et al. Self-driving database management systems. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 14.
Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Zhang N, Antony S, Liu H, Murthy R. Hive — A petabyte scale data warehouse using Hadoop. In Proc. the 26th IEEE International Conference on Data Engineering, March 2010, pp.996-1005.
Barber R, Garcia-Arellano C, Grosman R et al. Evolving databases for new-gen big data applications. In Proc. the 8th Biennial Conference on Innovative Data Systems Research, January 2017, Article No. 2.
Kallman R, Kimura H, Natkins J et al. H-Store: A high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment, 2008, 1(2): 1496-1499.
Article Google Scholar
Pavlo A, Jones E P, Zdonik S. On predictive modeling for optimizing transaction execution in parallel OLTP systems. Proceedings of the VLDB Endowment, 2011, 5(2): 85-96.
Article Google Scholar
Serafini M, Mansour E, Aboulnaga A, Salem K, Rafiq T, Minhas U F. Accordion: Elastic scalability for database systems supporting distributed transactions. Proceedings of the VLDB Endowment, 2014, 7(12): 1035-1046.
Article Google Scholar
Taft R, Mansour E, Serafini M, Duggan J, Elmore A J, Aboulnaga A, Pavlo A, Stonebraker M. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment, 2014, 8(3): 245-256.
Article Google Scholar
Mahajan K, Chowdhury M, Akella A, Chawla S. Dynamic query re-planning using QOOP. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, October 2018, pp.253-267.
Cowling J A, Liskov B. Granola: Low-overhead distributed transaction coordination. In Proc. the 2012 USENIX Annual Technical Conference, June 2012, pp.223-235.
Färber F, May N, Lehner W et al. The SAP HANA database — An architecture overview. IEEE Data Eng. Bull., 2012, 35(1): 28-33.
Lee J, Kwon Y S, Färber F et al. SAP HANA distributed inmemory database system: Transaction, session, and metadata management. In Proc. the 29th International Conference on Data Engineering, April 2013, pp.1165-1173.
Thomson A, Diamond T, Weng S C, Ren K, Shao P, Abadi D J. Calvin: Fast distributed transactions for partitioned database systems. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.1-12.

Download references

Author information

Authors and Affiliations

Business-Intelligence of Oriental Nations Corporation Ltd., Beijing, 100102, China
Zheng-Hao Jin & Ying-Xin Hu
Department of Computer Science and Engineering, The Ohio State University, Ohio, 43210, U.S.A.
Haiyang Shi & Xiaoyi Lu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Li Zha
University of Chinese Academy of Sciences, Beijing, 101408, China
Li Zha

Authors

Zheng-Hao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Shi
View author publications
You can also search for this author in PubMed Google Scholar
Ying-Xin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Li Zha
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyi Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng-Hao Jin.

Electronic supplementary material

ESM 1

(PDF 550 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, ZH., Shi, H., Hu, YX. et al. CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance. J. Comput. Sci. Technol. 35, 194–208 (2020). https://doi.org/10.1007/s11390-020-9536-z

Download citation

Received: 15 July 2019
Revised: 14 October 2019
Published: 17 January 2020
Issue Date: January 2020
DOI: https://doi.org/10.1007/s11390-020-9536-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

DB-GPT: Large Language Model Meets Database

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

DB-GPT: Large Language Model Meets Database

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation