Skip to main content
Log in

A review of alignment based similarity measures for web usage mining

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

In order to understand web-based application user behavior, web usage mining applies unsupervised learning techniques to discover hidden patterns from web data that captures user browsing on web sites. For this purpose, web session clustering has been among the most popular approaches to group users with similar browsing patterns that reflect their common interest. An adequate web session clustering implementation significantly depends on the measure that is used to evaluate the similarity of sessions. An efficient approach to evaluate session similarity is sequence alignment, which is known as the task of determining the similarity of elements between sequences. In this paper, we review and compare sequence alignment-based measures for web sessions, and also discuss sequence similarity measures that are not alignment-based. This review also provides a perspective of sequence similarity measures that manipulate web sessions in usage clustering process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Algiriyage N, Jayasena S, Dias G (2015) Web user profiling using hierarchical clustering with improved similarity measure. In: Moratuwa engineering research conference (MERCon). IEEE, pp 295–300

  • Anandhi D, Ahmed MI (2017) Prediction of user’s type and navigation pattern using clustering and classification algorithms. Clust Comput. https://doi.org/10.1007/s10586-017-1090-2

  • Anupama D, Gowda SD (2015) Clustering of web user sessions to maintain occurrence of sequence in navigation pattern. Procedia Comput Sci 58:558–564

    Article  Google Scholar 

  • Aruk T, Ustek D, Kursun O (2012) A comparative analysis of smith-waterman based partial alignment. In: IEEE symposium on computers and communications (ISCC). IEEE, pp 000250–000252

  • Azimpour-Kivi M, Azmi R (2011) A webpage similarity measure for web sessions clustering using sequence alignment. In: International symposium on artificial intelligence and signal processing (AISP). IEEE, pp 20–24

  • Banerjee A, Ghosh J (2001) Clickstream clustering using weighted longest common subsequences. In: Proceedings of the web mining workshop at the 1st SIAM conference on data mining, vol 143. Citeseer, p 144

  • Barton C, Flouri T, Iliopoulos CS, Pissis SP (2015) Global and local sequence alignment with a bounded number of gaps. Theor Comput Sci 582:1–16

    Article  MathSciNet  MATH  Google Scholar 

  • Bose RJC, van der Aalst WM (2012) Process diagnostics using trace alignment: opportunities, issues, and challenges. Inf Syst 37(2):117–141

    Article  Google Scholar 

  • Bouguessa M (2011) A practical approach for clustering transaction data. In: Machine learning and data mining in pattern recognition. Springer, pp 265–279

  • Breitinger F, Baier H (2012) A fuzzy hashing approach based on random sequences and hamming distance. In: Proceedings of the conference on digital forensics, security and law. Association of Digital Forensics, Security and Law, p 89

  • Bucka-Lassen K, Caprani O, Hein J (1999) Combining many multiple alignments in one improved alignment. Bioinformatics (Oxford, England) 15(2):122–130

    Article  Google Scholar 

  • Buscaldi D, Tournier R, Aussenac-Gilles N, Mothe J (2012) Irit: textual similarity combining conceptual similarity with an n-gram comparison method. In: Proceedings of the first joint conference on lexical and computational semantics-volume 1: proceedings of the main conference and the shared task, and volume 2: proceedings of the sixth international workshop on semantic evaluation. Association for Computational Linguistics, pp 552–556

  • Chakraborty A, Bandyopadhyay S (2013a) Clustering of web sessions by fogsaa. In: IEEE recent advances in intelligent computational systems (RAICS). IEEE, pp 282–287

  • Chakraborty A, Bandyopadhyay S (2013b) FOGSAA: fast optimal global sequence alignment algorithm. Sci Rep 3:1746

    Article  Google Scholar 

  • Chaofeng L (2009) Research on web session clustering. J Softw 4(5):460–468

    Google Scholar 

  • Chitraa V, Thanamni AS (2012) An enhanced clustering technique for web usage mining. Int J Eng Res Technol 1:1–5

    Article  Google Scholar 

  • Chordia BS, Adhiya KP (2011) Grouping web access sequences using sequence alignment method. Indian J Comput Sci Eng (IJCSE) 2(3):308–314

    Google Scholar 

  • Daily J (2016) Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinform 17(1):81

    Article  MathSciNet  Google Scholar 

  • Della Vedova G (2000) Multiple sequence alignment and phylogenetic reconstruction: theory and methods in biological data analysis. Ph.D. thesis, Citeseer

  • Delmestri A, Cristianini N (2010) String similarity measures and PAM-like matrices for cognate identification. UOB-ISLTR2010

  • Deza MM, Deza E (2013) Distances and similarities in data analysis. In: Encyclopedia of distances. Springer, pp 291–305

  • Dhandi M, Chakrawarti RK (2016) A comprehensive study of web usage mining. In: Symposium on colossal data analysis and networking (CDAN). IEEE, pp 1–5

  • Di Tommaso P, Moretti S, Xenarios I, Orobitg M, Montanyola A, Chang JM, Taly JF, Notredame C (2011) T-coffee: a web server for the multiple sequence alignment of protein and rna sequences using structural information and homology extension. Nucleic Acids Res 39(suppl-2):W13–W17

  • Dimopoulos C, Makris C, Panagis Y, Theodoridis E, Tsakalidis A (2010) A web page usage prediction scheme using sequence indexing and clustering techniques. Data Knowl Eng 69(4):371–382

    Article  Google Scholar 

  • Eddy SR (2004) What is a hidden markov model? Nat Biotechnol 22(10):1315–1316

    Article  Google Scholar 

  • Edgar RC (2004) Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32(5):1792–1797

    Article  Google Scholar 

  • Edgar RC, Edgar RC, Edgar RC, USCLE M (2005) Muscle user guide. Technical report. http://www.drive5.com/muscle/docs.htm. Accessed Jan 2019

  • El Azab A, Mahmood MA, El-Aziz A (2017) Effectiveness of web usage mining techniques in business application. The dark web: breakthroughs in research and practice, p 227

  • Gauch S, Speretta M, Chandramouli A, Micarelli A (2007) User profiles for personalized information access. In: The adaptive web. Springer, pp 54–89

  • Gonnet GH, Benner SA (1996) Probabilistic ancestral sequences and multiple alignments. In: Scandinavian workshop on algorithm theory. Springer, pp 380–391

  • Gündüz Ş, Özsu MT (2003) A web page prediction model based on click-stream tree representation of user behavior. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 535–540

  • Hay B, Wets G, Vanhoof K (2004) Mining navigation patterns using a sequence alignment method. Knowl Inf Syst 6(2):150–163

    Article  Google Scholar 

  • Higgins D (1997) Multiple sequence alignment. In: Genetic databases. Elsevier, pp 165–183

  • Howard RA (1966) Dynamic programming. Manag Sci 12(5):317–348

    Google Scholar 

  • Hung JH, Weng Z (2016) Sequence alignment and homology search with blast and clustalw. Cold Spring Harb Protoc 2016(11):pdb–prot093088

  • Hung YS, Chen KLB, Yang CT, Deng GF (2013) Web usage mining for analysing elder self-care behavior patterns. Expert Syst Appl 40(2):775–783

    Article  Google Scholar 

  • Kondrak G (2005) N-gram similarity and distance. In: International symposium on string processing and information retrieval. Springer, pp 115–126

  • Li C (2009) Research on web session clustering. J Softw 4(5):460–468

    Article  Google Scholar 

  • Li C, Lu Y (2007) Similarity measurement of web sessions by sequence alignment. In: IFIP international conference on network and parallel computing workshops, NPC Workshops. IEEE, pp 716–720

  • Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483

    Article  Google Scholar 

  • Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: International conference on data mining. IEEE, pp 911–916

  • Liu Z, Wang Y, Dontcheva M, Hoffman M, Walker S, Wilson A (2017) Patterns and sequences: Interactive exploration of clickstreams to understand common visitor paths. IEEE Trans Vis Comput Graph 23(1):321–330

    Article  Google Scholar 

  • Lopes P, Roy B (2015) Dynamic recommendation system using web usage mining for e-commerce users. Procedia Comput Sci 45:60–69

    Article  Google Scholar 

  • Lu L, Dunham M, Meng Y (2005) Discovery of significant usage patterns from clusters of clickstream data. In: Proceedings of WebKDD. Citeseer, pp 21–24

  • Luu VT, Forestier G, Fondement F, Muller PA (2015) Web site audience segmentation using hybrid alignment techniques. In: Trends and applications in knowledge discovery and data mining. Springer, pp 29–40

  • Luu VT, Forestier G, Ripken M, Fondement F, Muller PA (2016a) Web usage prediction and recommendation using web session clustering. In: Eleventh international conference on digital information management (ICDIM). IEEE, pp 107–113

  • Luu VT, Ripken M, Forestier G, Fondement F, Muller PA (2016b) Using glocal event alignment for comparing sequences of significantly different lengths. In: Machine learning and data mining in pattern recognition. Springer, pp 58–72

  • Madeira F, Park YM, Lee J, Buso N, Gur T, Madhusoodanan N, Basutkar P, Tivey A, Potter S, Finn RD, Lopez R (2019) The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res. https://doi.org/10.1093/nar/gkz268

  • Maetschke SR, Kassahn KS, Dunn JA, Han SP, Curley EZ, Stacey KJ, Ragan MA (2010) A visual framework for sequence analysis using n-grams and spectral rearrangement. Bioinformatics 26(6):737–744

    Article  Google Scholar 

  • Maleki S, Musuvathi M, Mytkowicz T (2016) Efficient parallelization using rank convergence in dynamic programming algorithms. Commun ACM 59(10):85–92

    Article  Google Scholar 

  • Malik ZK, Fyfe C (2012) Review of web personalization. J Emerg Technol Web Intell 4(3):285–296

    Google Scholar 

  • Mandal OP, Azad HK (2014) Web access prediction model using clustering and artificial neural network. Int J Eng Res Technol 3

  • Milligan GW, Cooper MC (1986) A study of the comparability of external criteria for hierarchical cluster analysis. Multivar Behav Res 21(4):441–458

    Article  Google Scholar 

  • Mishra R, Kumar P, Bhasker B (2014) An alternative approach for clustering web user sessions considering sequential information. Intell Data Anal 18(2):137–156

    Article  Google Scholar 

  • Nakamura A, Kudo M (2011) Packing alignment: alignment for sequences of various length events. In: Advances in knowledge discovery and data mining. Springer, pp 234–245

  • Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

    Article  Google Scholar 

  • Neelima G, Rodda S (2016) Predicting user behavior through sessions using the web log mining. In: International conference on advances in human machine interaction (HMI). IEEE, pp 1–5

  • Pandi M, Kashefi O, Minaei B et al (2011) A novel similarity measure for sequence data. J Inf Process Syst 7(3):413–424

    Article  Google Scholar 

  • Petitjean F, Forestier G, Webb G, Nicholson AE, Chen Y, Keogh E, et al (2014) Dynamic time warping averaging of time series allows faster and more accurate classification. In: International conference on data mining. IEEE, pp 470–479

  • Pinkham J (2010) Method of tracking and targeting internet payloads based on time spent actively viewing. US Patent App. 12/393,546

  • Poornalatha G, Prakash SR (2013) Web sessions clustering using hybrid sequence alignment measure (HSAM). Soc Netw Anal Min 3(2):257–268

    Article  Google Scholar 

  • Poornalatha G, Raghavendra P (2011a) Alignment based similarity distance measure for better web sessions clustering. Procedia Comput Sci 5:450–457

    Article  Google Scholar 

  • Poornalatha G, Raghavendra PS (2011b) Web user session clustering using modified k-means algorithm. In: Advances in computing and communications. Springer, pp 243–252

  • Pramanik S, Setua S (2017) An opposition based differential evolution to solve multiple sequence alignment. In: International conference on computational intelligence, communications, and business analytics. Springer, pp 440–450

  • Raphaeli O, Goldstein A, Fink L (2017) Analyzing online consumer behavior in mobile and PC devices: a novel web usage mining approach. Electron Commer Res Appl 26:1–12

    Article  Google Scholar 

  • Rendón E, Abundez I, Arizmendi A, Quiroz E (2011) Internal versus external cluster validation indexes. Int J Comput Commun 5(1):27–34

    Google Scholar 

  • Rosenberg MS (2009) Sequence alignment: methods, models, concepts, and strategies. University of California Press, Berkeley

    Google Scholar 

  • Shi P (2009) An efficient approach for clustering web access patterns from web logs. Int J Adv Sci Technol 5(1):354–362

    Google Scholar 

  • Si J, Li Q, Qian T, Deng X (2012) Discovering \(k\) web user groups with specific aspect interests. In: Machine learning and data mining in pattern recognition. Springer, pp 321–335

  • Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  Google Scholar 

  • Sonnhammer EL, Durbin R (1995) A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167(1):GC1–GC10

    Google Scholar 

  • Taly JF, Magis C, Bussotti G, Chang JM, Di Tommaso P, Erb I, Espinosa-Carrasco J, Kemena C, Notredame C (2011) Using the t-coffee package to build multiple sequence alignments of protein, RNA, DNA sequences and 3D structures. Nat Protoc 6(11):1669

    Article  Google Scholar 

  • Tan CW, Herrmann M, Forestier G, Webb GI, Petitjean F (2018) Efficient search of the best warping window for dynamic time warping. In: Proceedings of the 2018 SIAM international conference on data mining. SIAM, pp 225–233

  • Ting IH, Clark L, Kimble C (2009) Identifying web navigation behaviour and patterns automatically from clickstream data. Int J Web Eng Technol 5(4):398–426

    Article  Google Scholar 

  • Tong JC (2013) Blocks substitution matrix (BLOSUM). In: Encyclopedia of systems biology. Springer, pp 152–152

  • Vorontsov IE, Kulakovskiy IV, Makeev VJ (2013) Jaccard index based similarity measure to compare transcription factor binding site models. Algorithms Mol Biol 8(1):1

    Article  Google Scholar 

  • Wagh R, Patil J (2017) Enhanced web personalization for improved browsing experience. Adv Comput Sci Technol 10(6):1953–1968

    Google Scholar 

  • Wang W, Zaïane OR (2002) Clustering web sessions by sequence alignment. In: Proceedings of 13th international workshop on database and expert systems applications. IEEE, pp 394–398

  • Wang XD, Liu JX, Xu Y, Zhang J (2015) A survey of multiple sequence alignment techniques. In: International conference on intelligent computing. Springer, pp 529–538

  • Wang G, Zhang X, Tang S, Zheng H, Zhao BY (2016) Unsupervised clickstream clustering for user behavior analysis. In: Proceedings of the 2016 CHI conference on human factors in computing systems. ACM, pp 225–236

  • Yan R, Xu D, Yang J, Walker S, Zhang Y (2013) A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction. Sci Rep 3:2619

    Article  Google Scholar 

  • Yang J, Huang H, Jin X (2017) Mining web access sequence with improved apriori algorithm. In: IEEE international conference on computational science and engineering (CSE) and embedded and ubiquitous computing (EUC), vol 1. IEEE, pp 780–784

  • Yilmaz H, Senkul P (2010) Using ontology and sequence information for extracting behavior patterns from web navigation logs. In: IEEE international conference on data mining workshops (ICDMW). IEEE, pp 549–556

  • Zahid SK, Hasan L, Khan AA, Ullah S (2015) A novel structure of the smith-waterman algorithm for efficient sequence alignment. In: International conference on digital information, networking, and wireless communications (DINWC). IEEE, pp 6–9

Download references

Acknowledgements

The authors would like to thanks the Beampulse company for providing datasets to perform experiments. They also like to thanks VIET and Campus France for funding this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Germain Forestier.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luu, VT., Forestier, G., Weber, J. et al. A review of alignment based similarity measures for web usage mining. Artif Intell Rev 53, 1529–1551 (2020). https://doi.org/10.1007/s10462-019-09712-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-019-09712-9

Keywords

Navigation