Abstract
Virtually, all of today’s Big Data systems are passive in nature, responding to queries posted by their users. Instead, we are working to shift Big Data platforms from passive to active. In our view, a Big Active Data (BAD) system should continuously and reliably capture Big Data while enabling timely and automatic delivery of relevant information to a large pool of interested users, as well as supporting retrospective analyses of historical information. While various scalable streaming query engines have been created, their active behavior is limited to a (relatively) small window of the incoming data. To this end, we have created a BAD platform that combines ideas and capabilities from both Big Data and Active Data (e.g., publish/subscribe, streaming engines). It supports complex subscriptions that consider not only newly arrived items but also their relationships to past, stored data. Further, it can provide actionable notifications by enriching the subscription results with other useful data. Our platform extends an existing open-source Big Data Management System, Apache AsterixDB, with an active toolkit. The toolkit contains features to rapidly ingest semistructured data, share execution pipelines among users, manage scaled user data subscriptions, and actively monitor the state of the data to produce individualized information for each user. This paper describes the features and design of our current BAD data platform and demonstrates its ability to scale without sacrificing query capabilities or result individualization.
Similar content being viewed by others
Notes
There is a distributed variant of Postgres—from Greenplum—that provided database triggers in an earlier version. However, triggers have been removed in the current version due to their unreliable behavior in a distributed setting [62].
References
Abadi, D.J., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik S.B (2005) The design of the borealis stream processing engine. In: CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2005, Online Proceedings, pp. 277–289 (2005). www.cidrdb.org
Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)
Agrawal, P., Silberstein, A., Cooper, B.F., Srivastava, U., Ramakrishnan, R. Asynchronous view maintenance for VLSD databases. In: Çetintemel, U., Zdonik, S.B., Kossmann, D., Tatbul, N. (ed.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29–July 2, 2009, pp. 179–192. ACM (2009)
Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J., Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinländer, A., Sax, M.J., Schelter, S., Höger, M., Tzoumas, K., Warneke, D.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)
Alkowaileet, W.Y., Alsubaiee, S., Carey, M.J., Li, C., Ramampiaro, H., Sinthong, P., Wang X. End-to-end machine learning with Apache AsterixDB. In: Schelter, S., Seufert, S., Kumar, A. (ed.) Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM@SIGMOD 2018, Houston, TX, USA, June 15, 2018, pp. 6:1–6:10. ACM (2018)
Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Borkar, V.R., Bu, Y., Carey, M.J., Cetindil, I., Cheelangi, M., Faraaz, K., Gabrielova, E., Grover, R., Heilbron, Z., Kim, Y., Li, C., Li, G., Ok, J.M., Onose, N., Pirzadeh, P., Tsotras, V.J., Vernica, R., Wen, J., Westmann, T.: AsterixDB: a scalable, open source BDMS. PVLDB 7(14), 1905–1916 (2014)
Alsubaiee, S., Behm, A., Borkar, V.R., Heilbron, Z., Kim, Y., Carey, M.J., Dreseler, M., Li, C.: Storage management in AsterixDB. PVLDB 7(10), 841–852 (2014)
Amazon SNS. https://aws.amazon.com/sns/
Apache AsterixDB. https://asterixdb.apache.org
Apache Flink. https://flink.apache.org
Apache Hadoop. http://hadoop.apache.org
Apache HBase. http://hbase.apache.org/
Apache Spark. http://spark.apache.org
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Motwani, R., Nishizawa, I., Srivastava, U., Thomas, D., Varma, R., Widom, J.: STREAM: the stanford stream data manager. IEEE Data Eng. Bull. 26(1), 19–26 (2003)
Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., Zaharia, M. Structured Streaming: a declarative API for real-time applications in apache Spark. In: Das, G., Jermaine, C.M., Bernstein P.A. (ed.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pp. 601–613. ACM (2018)
Atikoglu, B. Xu, Y., Frachtenberg, E., Jiang, S., Paleczny, M. Workload analysis of a large-scale key-value store. In: Harrison, P.G., Arlitt, M.F., Casale, G. (ed.) ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, London, United Kingdom, June 11–15, 2012, pp. 53–64. ACM (2012)
Babu, S., Widom, J.: Continuous queries over data streams. SIGMOD Rec. 30(3), 109–120 (2001)
BAD project. https://github.com/apache/asterixdb-bad
Bainomugisha, E., Carreton, A.L., Cutsem, T.V., Mostinckx, S., Meuter, W.D.: A survey on reactive programming. ACM Comput. Surv. 45(4), :52:1–52:34 (2013)
Bamba, B., Liu, L., Yu, P.S., Zhang, G., Doo, M. Scalable processing of spatial alarms. In: Sadayappan, P., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) High Performance Computing-HiPC 2008, 15th International Conference, Bangalore, India, December 17-20, 2008. Proceedings, volume 5374 of Lecture Notes in Computer Science, pp. 232–244. Springer (2008)
Borkar, V.R., Bu, Y., Onose, Jr. E.P.C. N., Westmann, T., Pirzadeh, P., Carey, M.J., Tsotras, V.J. Algebricks: a data model-agnostic compiler backend for big data languages. In: Ghandeharizadeh, S., Barahmand, S., Balazinska, M., Freedman, M.J. (eds.) Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27–29, 2015, pp. 422–433. ACM (2015)
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R. Hyracks: a flexible and extensible foundation for data-intensive computing. In: Abiteboul, S., Böhm, K., Koch, C., Tan, K. (eds.) Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11–16, 2011, Hannover, Germany, pp. 1151–1162. IEEE Computer Society (2011)
Borkar, D., Mayuram, R., Sangudi, G., Carey, M.J. Have your data and query it too: from key-value caching to big data management. In: Özcan, F., Koutrika, G., Madden, S. (ed.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26–July 01, 2016, pp. 239–251. ACM (2016)
Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K.: Apache Flink™: Stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F., Shah, M.A. Telegraphcq: continuous dataflow processing for an uncertain world. In:CIDR 2003, First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 5-8, 2003, Online Proceedings. www.cidrdb.org (2003)
Chen, J., DeWitt, D.J., Tian, F., Wang, Y. Niagaracq: a scalable continuous query system for internet databases. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16–18, 2000, Dallas, Texas, USA, pp. 379–390. ACM (2000)
Chintapalli, S., Dagit, D., Evans, B., Farivar, R., Graves, T., Holderbaugh, M., Liu, Z., Nusbaum, K., Patil, K., Peng, B., Poulosky, P. Benchmarking streaming computation engines: storm, Flink and Spark Streaming. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23–27, 2016, pp. 1789–1792. IEEE Computer Society (2016)
Chirkova, R., Yang, J.: Materialized views. Found. Trends Databases 4(4), 295–405 (2012)
Couchbase. http://www.couchbase.com/
Dalvi, B.B., Kshirsagar, M., Sudarshan, S.: Keyword search on external memory data graphs. PVLDB 1(1), 1189–1204 (2008)
Dayal, U., Blaustein, B.T., Buchmann, A.P., Chakravarthy, U.S., Hsu, M., Ledin, R., McCarthy, D.R., Rosenthal, A., Sarin, S.K., Carey, M.J., Livny, M., Jauhari, R.: The HiPAC project: combining active databases and timing constraints. SIGMOD Rec. 17(1), 51–70 (1988)
Dean, J., Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In: Brewer, E.A., Chen, P. (eds) 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004, pp. 137–150. USENIX Association (2004)
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W. Dynamo: amazon’s highly available key-value store. In: Bressoud, T.C., Kaashoek, M.F. (eds.) Proceedings of the 21st ACM Symposium on Operating Systems Principles 2007, SOSP 2007, Stevenson, Washington, USA, October 14–17, 2007, pp. 205–220. ACM (2007)
Desta, M.S., Hyytiä, E., Keränen, A., Kärkkäinen, T., Ott, J. Evaluating (geo) content sharing with the ONE simulator. In: Nikoletseas, S.E., Rumín, Á C. (eds.) MobiWac’13, Proceedings of the 11th ACM International Symposium on Mobility Management and Wireless Access, Barcelona, Spain, November 3–8, 2013, pp. 37–40. ACM (2013)
Dindar, N., Güç, B., Lau, P., Özal, A., Soner, M., Tatbul, N. Dejavu: declarative pattern matching over live and archived streams of events. In: Çetintemel, U., Zdonik, S.B., Kossmann, D., Tatbul, N. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29–July 2, 2009, pp. 1023–1026. ACM (2009)
Escriva, R., Wong, B., Sirer, E.G. Hyperdex: a distributed, searchable key-value store. In: Eggert, L., Ott, J., Padmanabhan, V.N., Varghese, G. (eds) ACM SIGCOMM 2012 Conference, SIGCOMM ’12, Helsinki, Finland-August 13–17, 2012, pp. 25–36. ACM (2012)
Eugster, P.T., Felber, P., Guerraoui, R., Kermarrec, A.: The many faces of publish/subscribe. ACM Comput. Surv. 35(2), 114–131 (2003)
Gedik, B., Andrade, H., Wu, K., Yu, P.S., Doo, M. SPADE: the systems declarative stream processing engine. In: Wang, J.T. (ed.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008, pp. 1123–1134. ACM (2008)
Golab, L., Johnson, T., Shkapenyuk, V. Scheduling updates in a real-time stream warehouse. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (ed.) Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009–April 2 2009, Shanghai, China, pp. 1207–1210. IEEE Computer Society (2009)
Goldberg, D., Nichols, D.A., Oki, B.M., Terry, D.B.: Using collaborative filtering to weave an information tapestry. Commun. ACM 35(12), 61–70 (1992)
Grover, R., Carey, M.J. Data ingestion in AsterixDB. In: Alonso, G., Geerts, F., Popa, L., Barceló, P., Teubner, J., Ugarte, M., den Bussche, J.V., Paredaens, J. (eds.) Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23–27, 2015, pp. 605–616. OpenProceedings.org (2015)
Hanson, E.N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B., Vernon, A. Scalable trigger processing. In: Kitsuregawa, M., Papazoglou, M.P., Pu, C. (eds.) Proceedings of the 15th International Conference on Data Engineering, Sydney, Australia, March 23–26, 1999, pp. 266–275. IEEE Computer Society (1999)
Hanson, E.N.: The design and implementation of the ariel active database rule system. IEEE Trans. Knowl. Data Eng. 8(1), 157–172 (1996)
Jafarpour, H., Hore, B., Mehrotra, S., Venkatasubramanian, N. Subscription subsumption evaluation for content-based publish/subscribe systems. In: Issarny, V., Schantz, R.E. (eds.) Middleware 2008, ACM/IFIP/USENIX 9th International Middleware Conference, Leuven, Belgium, December 1–5, 2008, Proceedings, volume 5346 of Lecture Notes in Computer Science, pp. 62–81. Springer (2008)
Jin, Y., Strom, R.E. Relational subscription middleware for internet-scale publish-subscribe. In: Jacobsen, H.(ed.) Proceedings of the 2nd International Workshop on Distributed Event-Based Systems, DEBS: Sunday, June 8th, 2003, p. 2003. ACM, San Diego, California, USA (in conjunction with SIGMOD/PODS) (2003)
Keränen, A., Ott, J., Kärkkäinen, T. The ONE simulator for DTN protocol evaluation. In: Dalle, O., Wainer, G.A., Perrone, L.F., Stea, G. (eds.) Proceedings of the 2nd International Conference on Simulation Tools and Techniques for Communications, Networks and Systems, SimuTools 2009, Rome, Italy, March 2–6, 2009, p. 55. ICST/ACM (2009)
Kiran, M., Murphy, P., Monga, I., Dugan, J., Baveja, S.S. Lambda architecture for cost-effective batch and speed big data processing. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, October 29–November 1, 2015, pp. 2785–2792. IEEE Computer Society (2015)
Krämer, J., Seeger, B. PIPES: a public infrastructure for processing and exploring streams. In: Weikum, G., König, A.C., Deßloch, S. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, June 13–18, 2004, pp. 925–926. ACM (2004)
Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. Proc. NetDB 11, 1–7 (2011)
Lee, K., Liu, L., Palanisamy, B., Yigitoglu, E.: Road network-aware spatial alarms. IEEE Trans. Mob. Comput. 15(1), 188–201 (2016)
Li, M., Ye, F., Kim, M., Chen, H., Lei, H. A scalable and elastic publish/subscribe service. In: 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16–20 May, 2011-Conference Proceedings, pp. 1254–1265. IEEE (2011)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)
Luo, C., Carey, M.J.: Efficient data ingestion and query processing for LSM-based storage systems. PVLDB 12(5), 531–543 (2019)
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G. Pregel: a system for large-scale graph processing. In: Elmagarmid, A.K., Agrawal, D. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp. 135–146. ACM (2010)
Markowetz, A., Yang, Y., Papadias, D. Keyword search on relational data streams. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, June 12–14, 2007, pp. 605–616. ACM (2007)
Milo, T., Zur, T., Verbin, E. Boosting topic-based publish-subscribe systems with dynamic clustering. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, June 12–14, 2007, pp. 749–760. ACM (2007)
MongoDB. http://www.mongodb.org/
Moro, M.M., Bakalov, P., Tsotras, V.J. Early profile pruning on xml-aware publish/subscribe systems. In: Koch, C., Gehrke, J., Garofalakis, M.N., Srivastava, D., Aberer, K., Deshpande, A., Florescu, D., Chan, C.Y., Ganti, V., Kanne, C., Klas, W., Neuhold, E.J. (eds.) Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007, pp. 866–877. ACM (2007)
Nikolic, M., Elseidy, M., Koch, C. LINVIEW: incremental view maintenance for complex analytical queries. In: Dyreson, C.E., Li, F., Özsu, M.T. (eds.) International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 253–264. ACM (2014)
ONE Simulator. https://akeranen.github.io/the-one/
Pig Website. http://hadoop.apache.org/pig
Pivotal Greenplum. https://gpdb.docs.pivotal.io/4300/pdf/GPDB43RefGuide.pdf
Qader, M.A., Hristidis, V. DualDB: an efficient LSM-based publish/subscribe storage system. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27–29, 2017, pp. 24:1–24:6. ACM (2017)
Quass, D., Widom, J. On-line warehouse view maintenance. In: Peckham, J. (ed.) SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13–15, 1997, Tucson, Arizona, USA, pp. 393–404. ACM Press (1997)
Saigaonkar, S., Rao, M., Mantha, S. Publish subscribe system based on ontology and XML filtering. In: 2011 3rd International Conference on Computer Research and Development, Vol. 1, pp. 154–158. IEEE (2011)
Stonebraker, M., Rowe, L.A. The design of postgres. In: Zaniolo, C. (ed.) Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 28–30, 1986, pp. 340–355. ACM Press (1986)
G. S. Thakur, B. L. Bhaduri, J. O. Piburn, K. M. Sims, R. N. Stewart, and M. L. Urban. Planetsense: a real-time streaming and spatio-temporal analytics platform for gathering geo-spatial intelligence from open source data. In J. Bao, C. Sengstock, M. E. Ali, Y. Huang, M. Gertz, M. Renz, and J. Sankaranarayanan, editors, Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, Bellevue, WA, USA, November 3-6, 2015, pages 11:1–11:4. ACM, 2015
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive - A warehousing solution over a map-reduce framework. PVLDB 2(2), 1626–1629 (2009)
Uddin, M.Y.S., Venkatasubramanian, N. Edge caching for enriched notifications delivery in big active data. In: 38th IEEE International Conference on Distributed Computing Systems, ICDCS 2018, Vienna, Austria, July 2–6, 2018, pp. 696–705. IEEE Computer Society (2018)
United States geological survey, Shakecast (2014). earthquake.usgs.gov/research/software/shakecast/
Wang, X., Carey, M.J.: An IDEA: an ingestion framework for data enrichment in AsterixDB. PVLDB 12(11), 1485–1498 (2019)
Wang, X., Zhang, W., Zhang, Y., Lin, X., Huang, Z.: Top-k spatial-keyword publish/subscribe over sliding window. VLDB J. 26(3), 301–326 (2017)
Widom, J., Cochrane, R., Lindsay, B.G. Implementing set-oriented production rules as an extension to starburst. In: Lohman, G.M., Sernadas, A., Camps, R. (eds.) 17th International Conference on Very Large Data Bases, September 3–6, 1991, Barcelona, Catalonia, Spain, Proceedings, pp. 275–285. Morgan Kaufmann (1991)
Yan, D., Bu, Y., Tian, Y., Deshpande, A., Cheng, J. Big graph analytics systems. In: Özcan, F., Koutrika, G., Madden, S. (eds.) Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26–July 01, 2016, pp. 2241–2243. ACM (2016)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Gribble, S.D., Katabi, D. (ed.) Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27, 2012, pp. 15–28. USENIX Association (2012)
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Zhao, Y., Kim, K., Venkatasubramanian, N. DYNATOPS: a dynamic topic-based publish/subscribe architecture. In: Chakravarthy, S., Urban, S.D., Pietzuch, P.R., Rundensteiner, E.A. (eds.) The 7th ACM International Conference on Distributed Event-Based Systems, DEBS ’13, Arlington, TX, USA-June 29–July 03, 2013, pp, 75–86. ACM (2013)
Acknowledgements
This research was partially supported by NSF Grants IIS-1447826, IIS-1447720, IIS-1838222, IIS-1838248, CNS-1924694, and CNS-1925610.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jacobs, S., Wang, X., Carey, M.J. et al. BAD to the bone: Big Active Data at its core. The VLDB Journal 29, 1337–1364 (2020). https://doi.org/10.1007/s00778-020-00616-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00616-7