Skip to main content
Log in

TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Source code https://github.com/cipriantruica/TextBenDS

References

  • Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3.

  • Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797.

  • Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296.

  • Bellot, P., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., SanJuan, E., Schenkel, R., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Sanderson, M., Scholer, F., & Wang, Q. (2013). Report on inex 2013. SIGIR Forum, 47(2), 21–32. https://doi.org/10.1145/2568388.2568393.

    Article  Google Scholar 

  • Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1.

  • Bouakkaz, M., Loudcher, S., & Ouinten, Y. (2016). OLAP textual aggregation approach using the google similarity distance. International Journal of Business Intelligence and Data Mining, 11(1), 31. https://doi.org/10.1504/ijbidm.2016.076425.

    Article  Google Scholar 

  • Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., & Teisseire, M. (2011). Towards an on-line analysis of tweets processing. In International Conference on Database and Expert Systems Applications (pp. 154–161). https://doi.org/10.1007/978-3-642-23091-2_15.

    Chapter  Google Scholar 

  • Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1.

  • Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726.

  • Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492.

    Article  Google Scholar 

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

    Article  Google Scholar 

  • Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8

  • Gattiker, A. E., Gebara, F. H., Hofstee, H. P., Hayes, J. D., & Hylick, A. (2013). Big data text-oriented benchmark creation for Hadoop. IBM Journal of Research and Development, 57(3/4), 10:1–10:6. https://doi.org/10.1147/JRD.2013.2240732.

    Article  Google Scholar 

  • Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A. (2013). Bigbench: Towards an industry standard benchmark for big data analytics. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1197–1208). https://doi.org/10.1145/2463676.2463712.

    Chapter  Google Scholar 

  • Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., & Zicari, R. V. (2017). Bigbench v2: The new and improved bigbench. In 2017 IEEE 33rd International Conference on Data Engineering (pp. 1225–1236). https://doi.org/10.1109/ICDE.2017.167.

    Chapter  Google Scholar 

  • Gray, J. (1993). The benchmark handbook for database and transaction systems (2nd ed.). Burlington: Morgan Kaufmann Publishers.

  • Guille, A., & Favre, C. (2015). Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Social Network Analysis and Mining, 5(1), 18. https://doi.org/10.1007/s13278-015-0258-0.

    Article  Google Scholar 

  • Hofmann, T. (2017). Probabilistic latent semantic indexing. SIGIR Forum, 51(2), 211–218. https://doi.org/10.1145/3130348.3130370.

    Article  Google Scholar 

  • Huang, S., Huang, J., Dai, J., Xie, T., & Huang, B. (2010). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In International Conference on Data Engineering (pp. 41–51). https://doi.org/10.1109/ICDEW.2010.5452747.

    Chapter  Google Scholar 

  • Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S. A., Yang, Q., Luo, C., & Li, J. (2014). Characterizing and subsetting big data workloads. In 2014 IEEE International Symposium on Workload Characterization (pp. 191–201). https://doi.org/10.1109/IISWC.2014.6983058.

    Chapter  Google Scholar 

  • Kılıç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., & Borandag, E. (2017). Ttc-3600: A new benchmark dataset for turkish text categorization. Journal of Information Science, 43(2), 174–185. https://doi.org/10.1177/0165551515620551.

    Article  Google Scholar 

  • Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253.

  • Lavrenko, V., & Croft, W. B. (2017). Relevance-based language models. SIGIR Forum, 51(2), 260–267. https://doi.org/10.1145/3130348.3130376.

    Article  Google Scholar 

  • Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397 URL http://www.jmlr.org/papers/v5/lewis04a.html.

    Google Scholar 

  • Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283.

  • Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30.

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

  • Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11.

  • O’Shea, J., Bandar, Z., Crockett, K. A., & McLean, D. (2010). Benchmarking short text semantic similarity. International Journal of Intelligent Information and Database Systems, 4(2), 103–120. https://doi.org/10.1504/IJIIDS.2010.032437.

    Article  Google Scholar 

  • Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822.

  • Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581.

  • Pirzadeh, P., Carey, M. J., & Westmann, T. (2015). Bigfun: A performance study of big data management system functionality. In IEEE International Conference on Big Data (pp. 507–514). https://doi.org/10.1109/BigData.2015.7363793.

    Chapter  Google Scholar 

  • Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https://doi.org/10.1145/3121050.3121062.

  • Ravat, F., Teste, O., Tournier, R., & Zurfluh, G. (2008). Top−keyword: an aggregation function for textual document olap. In International Conference on Data Warehousing and Knowledge Discovery (pp. 55–64). https://doi.org/10.1007/978-3-540-85836-2-6.

    Chapter  Google Scholar 

  • Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015). Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (pp. 1357–1369). New York: ACM. https://doi.org/10.1145/2723372.2742790.

    Chapter  Google Scholar 

  • Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2.

  • Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597.3137600.

    Article  Google Scholar 

  • Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286.

  • Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. In Symposium on Mass Storage Systems and Technologies (pp. 1–10). https://doi.org/10.1109/MSST.2010.5496972.

    Chapter  Google Scholar 

  • Spärck Jones, K., Walker, S., & Robertson, S. E. (2000a). A probabilistic model of information retrieval: development and comparative experiments: Part 1. Information Processing & Management, 36(6), 779–808. https://doi.org/10.1016/S0306-4573(00)00015-7.

    Article  Google Scholar 

  • Spärck Jones, K., Walker, S., & Robertson, S. E. (2000b). A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing & Management, 36(6), 809–840. https://doi.org/10.1016/S0306-4573(00)00016-9.

    Article  Google Scholar 

  • Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. VLDB Endowment, 2(2), 1626–1629. https://doi.org/10.14778/1687553.1687609.

    Article  Google Scholar 

  • Transaction Processing Performance Council (TPC) (2016). TPC express benchmark hs standard specification version 1.4.2.http://www.tpc.org Accessed March 2019.

  • Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019.

  • Truică, C. O., & Darmont, J. (2017). T2K2: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3.

  • Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33.

  • Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055.

  • Truică, C. O., Darmont, J., Boicea, A., & Rădulescu, F. (2018). Benchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2. Future Generation Computer Systems, 85, 60–75. https://doi.org/10.1016/j.future.2018.02.037.

    Article  Google Scholar 

  • Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., & Baldeschwieler, E. (2013). Apache hadoop yarn: Yet another resource negotiator. In Annual Symposium on Cloud Computing (pp. 5:1–5:16). https://doi.org/10.1145/2523616.2523633.

    Chapter  Google Scholar 

  • Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., & Qiu, B. (2014). BigDataBench: A big data benchmark suite from internet services. In IEEE International Symposium on High Performance Computer Architecture (pp. 488–499). https://doi.org/10.1109/HPCA.2014.6835958.

    Chapter  Google Scholar 

  • Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., & Feng, G. (2016). Textgen: a realistic text data content generation method for modern storage system benchmarks. Frontiers of Information Technology & Electronic Engineering, 17(10), 982–993. https://doi.org/10.1631/FITEE.1500332.

    Article  Google Scholar 

  • Wang, X., Ah-Pine, J., & Darmont, J. (2017). Shcoclust, a scalable similarity-based hierarchical co-clustering method and its application to textual collections. In 2017 IEEE International Conference on Fuzzy Systems (pp. 1–6). https://doi.org/10.1109/FUZZ-IEEE.2017.8015720.

    Chapter  Google Scholar 

  • Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094.

  • Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. https://doi.org/10.1145/2934664.

    Article  Google Scholar 

  • Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96

  • Zhang, D., Zhai, C., & Han, J. (2012). MiTexCube: MicroTextCluster cube for online analysis of text cells and its applications. Statistical Analysis and Data Mining, 6(3), 243–259. https://doi.org/10.1002/sam.11159.

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded by grant No. PN-III-P1-1.2-PCCDI-2017-0734.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ciprian-Octavian Truică.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Truică, CO., Apostol, ES., Darmont, J. et al. TextBenDS: a Generic Textual Data Benchmark for Distributed Systems. Inf Syst Front 23, 81–100 (2021). https://doi.org/10.1007/s10796-020-09999-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-020-09999-y

Keywords

Navigation