Skip to main content
Log in

ScalaParBiBit: scaling the binary biclustering in distributed-memory systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Biclustering is a data mining technique that allows us to find groups of rows and columns that are highly correlated in a 2D dataset. Although there exist several software applications to perform biclustering, most of them suffer from a high computational complexity which prevents their use in large datasets. In this work we present ScalaParBiBit, a parallel tool to find biclusters on binary data, quite common in many research fields such as text mining, marketing or bioinformatics. ScalaParBiBit takes advantage of the special characteristics of these binary datasets, as well as of an efficient parallel implementation and algorithm, to accelerate the biclustering procedure in distributed-memory systems. The experimental evaluation proves that our tool is significantly faster and more scalable that the state-of-the-art tool ParBiBit in a cluster with 32 nodes and 768 cores. Our tool together with its reference manual are freely available at https://github.com/fraguela/ScalaParBiBit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data availability

The application developed in this manuscript, together with building and usage instructions, as well as the datasets used in the experiments are publicly available under an open source license at https://github.com/fraguela/ScalaParBiBit.

References

  1. Bhatnagar, R., Kumar, L.: High performance parallel/distributed biclustering using Barycenter heuristic. In: 2009 SIAM International Conference on Data Mining, Sparks, SDM 2009, pp 1050–1061 (2009)

  2. Chen, H.C., Zou, W., Tien, Y.J., Chen, J.J.: Identification of bicluster regions in a binary matrix and its applications. PLoS ONE 8(8), e71680 (2013)

    Article  Google Scholar 

  3. Feng, G., Li, Z., Zhou, W., Dong, S.: Entropy-based outlier detection using Spark. Clust. Comput. 23(2), 409–419 (2020)

    Article  Google Scholar 

  4. González, C.H., Fraguela, B.B.: Enhancing and evaluating the configuration capability of a skeleton for irregular computations. In: 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, Turku, PDP 2015, pp 119–127 (2015)

  5. González, C.H., Fraguela, B.B.: A general and efficient divide-and-conquer algorithm framework for multi-core clusters. Clust. Comput. 20(3), 2605–2626 (2017)

    Article  Google Scholar 

  6. González-Domínguez, J., Expósito, R.R.: ParBiBit: parallel tool for binary biclustering on modern distributed-memory systems. PLoS ONE 13(4), e01943 (2018)

    Article  Google Scholar 

  7. González-Domínguez, J., Expósito, R.R.: Accelerating binary biclustering on platforms with CUDA-enabled GPUs. Inf. Sci. 496, 317–325 (2019)

    Article  Google Scholar 

  8. Hoefler, T., Dinan, J., Thakur, R., Barrett, B., Balaji, P., Gropp, W., Underwood, K.: Remote memory access programming in MPI-3. ACM Trans. Parallel Comput. 2(2), 9:1-9:26 (2015)

    Article  Google Scholar 

  9. Isokpehi, R.D., Johnson, M.O., Campos, B., Sanders, A., Cozart, T., Harvey, I.S.: Knowledge visualizations to inform decision making for improving food accessibility and reducing obesity rates in the United States. Int. J. Environ. Res. Public Health 17(4), 1263 (2020)

    Article  Google Scholar 

  10. Jiang, F., Leung, CKS.: Mining interesting following patterns from social networks. In: 16th International Conference on Data Warehousing and Knowledge Discovery, Munich, DaWaK 2014, pp 308–319 (2014)

  11. Koniges, A., Cook, B., Deslippe, J., Kurth, T., Shan, H.: MPI usage at NERSC: present and future. In: 23rd European MPI Users’ Group Meeting, Edinburgh, EuroMPI 2016, pp 217–217 (2016)

  12. Lee, Y., Kim, Y., Yeom, H.Y.: Lightweight memory tracing for hot data identification. Clust. Comput. 23(3), 2273–2285 (2020)

    Article  Google Scholar 

  13. Li, Z., Chang, C., Kundu, S., Long, Q.: Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics 21(3), 610–624 (2020)

    Article  MathSciNet  Google Scholar 

  14. Lin, Q., Xue, Y., Chen, W.S., Ye, S.Q., Li, W.L., Liu, J.J.: Parallel large average submatrices biclustering based on MapReduce. In: 11th International Conference on Computational Intelligence and Security, Shenzhen, CIS 2015 (2015)

  15. Lin, Q., Zhang, H., Wang, X., Xue, Y., Liu, H., Gong, C.: A novel parallel biclustering approach and its application to identify and segment highly profitable telecom customers. IEEE Access 7, 28696–28711 (2019)

    Article  Google Scholar 

  16. López-Fernández, A., Rodríguez-Baena, D., Gómez-Vela, F., Divina, F., García-Torres, M.: A multi-GPU biclustering algorithm for binary datasets. J. Parallel Distrib. Comput. 147, 209–219 (2021)

    Article  Google Scholar 

  17. Nisar, A., Ahmad, W., Liao, WK., Choudhary, A.: An efficient Map-Reduce algorithm for computing formal concepts from binary data. In: 3rd IEEE International Conference on Big Data, Santa Clara, Big Data 2015, pp 1519–1528 (2015)

  18. Padilha, V.A., Campello, R.: A systematic comparative evaluation of biclustering techniques. BMC Bioinform. 18, 55 (2017)

    Article  Google Scholar 

  19. Pontes, B., Giráldez, R., Aguilar-Ruiz, J.S.: Biclustering on expression data: a review. J. Biomed. Inf. 57, 163–180 (2015)

    Article  Google Scholar 

  20. Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)

    Article  Google Scholar 

  21. Rathipriya, R.: A novel evolutionary biclustering approach using MapReduce (EBC-MR). Int. J. Knowl. Discov. Bioinform. 6(1), 26–36 (2016)

    Article  MathSciNet  Google Scholar 

  22. Rocha, O., Mendes, R.: JBiclustGE: Java API with unified biclustering algorithms for gene expression data analysis. Knowl.-Based Syst. 155, 83–87 (2018)

    Article  Google Scholar 

  23. Rodriguez, M.Z., Comin, C.H., Casanova, D., Bruno, O.M., Amancio, D.R., Costa, Ld.F., Rodrigues, F.A.: Clustering algorithms: a comparative approach. PLoS ONE 14(1), 2102 (2019)

    Google Scholar 

  24. Rodríguez-Baena, D.S., Pérez-Pulido, A.J., Aguilar-Ruiz, J.S.: A biclustering algorithm for extracting bit-patterns from binary datasets. Bioinformatics 27(19), 2738–2745 (2011)

    Article  Google Scholar 

  25. Sarazin, T., Lebbah, M., Azzag, H.: Biclustering using Spark-MapReduce. In: 2nd IEEE International Conference on Big Data, Washington, DC, Big Data 2014, pp 58–60 (2014)

  26. Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O.P., Tiwari, A., Er, M.J., Ding, W., Lin, C.T.: A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017)

    Article  Google Scholar 

  27. Stroustrup, B.: The C++ programming language, 4th edn. Addison-Wesley Professional, Boston (2013)

    MATH  Google Scholar 

  28. Wei, L., Ling, C.: A parallel algorithm for gene expressing data biclustering. J. Comput. 3(10), 71–77 (2008)

    Google Scholar 

  29. Wu, H., Cheng, S., Wang, Z., Zhang, S., Yuan, F.: Multi-task learning based on question-answering style reviews for aspect category classification and aspect term extraction on GPU clusters. Clust. Comput. 23(3), 1973–1986 (2020)

    Article  Google Scholar 

  30. Yoon, S., Nguyen, H.C., Jo, W., Kim, J., Chi, S.M., Park, J., Kim, S.Y., Nam, D.: Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets. Nucleic Acids Res. 47(9), e53–e53 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the Ministry of Science and Innovation of Spain (TIN2016-75845-P and PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/ 501100011033), and by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04). We acknowledge also the support from the Centro Singular de Investigación de Galicia “CITIC”, funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by grant ED431G 2019/01. We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the usage of their resources.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Basilio B. Fraguela.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fraguela, B.B., Andrade, D. & González-Domínguez, J. ScalaParBiBit: scaling the binary biclustering in distributed-memory systems. Cluster Comput 24, 2249–2268 (2021). https://doi.org/10.1007/s10586-021-03261-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03261-z

Keywords

Navigation