Abstract
Sequantial data are important in many real world location based services. In this paper, we study the problem of sequence matching. Specifically, we want to identify the sequences most similar to a given sequence, under three most commonly used preferece-aware similarity measures, i.e., Fagin’s intersection metric, Kendall’s tau, and Spearman’s footrule. We first analyze the properties of these three preference-aware similarity measures, revealing the connection between them and set intersection. Then, we build an index structure, which is essentially a doubly linked list, to facilitate efficient sequence matching. Lower- and upper-bounds are derived to achieve support prefix-based filtering. Experiments on various datasets show that our proposed method outperforms the baselines by a large margin.
Similar content being viewed by others
Notes
Some other work may simply denote \(\mathcal {U}\) as the set {1, 2,⋯ ,n} (see for example [5]). These two representations are equivalent in the sense that we can assign a distinct integer ID to each item. In this work, we choose to avoid using such integer IDs so that to minimize any possible confusion between items and their ranks.
For the sake of clarity, we overload the symbol F to compute the distance between top-m lists and suspend the use of the asterisk (∗) which indicates the Hausdorff nature. In addition, F∗ defined in Eq. 2 is a metric whereas F in Eq. 3 is NOT. This is because in Eq. 3 the universe \(\mathcal {U}\) is considered as π ∪ σ, which is not true in general. However, compared to the (true) Hausdorff distance F∗, F in Eq. 3 is preferable in some aspects. For example, consider again σ1 and σ2 the top-3 lists of fruits in Section 3.1: it is more intuitive to compute the distance based merely on what they have in their lists; it is less intuitive (although somehow still makes sense) to alter the distance value whenever, say, there is a new fruit appended into the universe.
Those items outside of π ∩ σ are not important, for they have no impact on the distance value. Therefore in this sense it doesn’t matter whether or not σ′∖ π = σ ∖ π.
Note that ℓ is well-defined since σ(v0) − 0 = 0 ≤ z − s and σ(vs+ 1) − (s + 1) = m − s ≥ z − s.
The number of 1’s in a binary string is known as the Hamming weight of that string. It can be efficiently computed using one of the bitwise tricks named sidewaysaddition [68]. If the operand y is expected to be sparse (i.e., y contains merely a few number of 1’s), then the sideways addition can be done by keeping doing y ← y & (y − 1) until y = 0 [69].
This DBLP citation network is publicly available at http://arnetminer.org/citation
The Jester dataset is available at http://www.ieor.berkeley.edu/~goldberg/jester-data/
References
Rentfrow PJ, Gosling SD (2003) The do re mi’s of everyday life: the structure and personality correlates of music preferences. J Pers Soc Psychol 84(6):1236–1256
Chausson O Assessing the impact of gender and personality on film preferences. Technical report, University of Cambridge, 2010. myPersonality Project
Cantador I, Ferández-Tobías I, Bellogín A (2013) Relating personality types with user preferences in multiple entertainment domains. In: EMPIRE
Diaconis P, Graham RL (1977) Spearman’s footrule as a measure of disarray. J Royal Statistical Soc Series B (Methodol) 39(2):262–268
Douglas E (1984) Critchlow. Metric methods for analyzing partially ranked data. Technical Report 225, Dept of Statistics, Stanford University
Salama IA, Quade D (1990) A note on spearman’s footrule. Comm Statistics 19(2):591–601
Fagin R, Kumar R, Sivakumar D (2003) Comparing top-k lists. SIAM J Discrete Math 17(1):134–160
Wu S, Crestani F (2003) Methods for ranking information retrieval systems without relevance judgements. In: SAC
Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. TOIS 28(4):1–34
Adomavicius G, Tuzhilin A (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. TKDE 17(6):734–749
Konstas I, Stathopoulos V, Jose JM (2009) On social networks and collaborative recommendation. In: SIGIR
Shang S, Chen L, Wei Z, Jensen CS, Zheng K, Kalnis P (2017) Trajectory similarity join in spatial networks. In: PVLDB
Yue X, Xi M, Chen B, Gao M, He Y, Xu J (2019) A revocable group signatures scheme to provide privacy-preserving authentications. Mobile Networks and Applications
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD
Pal K, Michel S (2016) Efficient similarity search across top-k lists under the Kendall’s tau distance. In: SSDMB2016
Berchtold S, Ertl B, Keim DA, Kriegel H-P, Seidl T (1998) Fast nearest neighbor search in high-dimensional space. In: ICDE
Roussopoulos N, Kelly S, Vincent F (1995) eRic Nearest neighbor queries. In: KDD
Hjaltason GR, Samet H (1999) Distance browsing in spatial databases. TODS 24(2):265–318
Sharifzadeh M, Shahabi C (2010) Vor-tree: R-trees with Voronoi diagrams for efficient processing of spatial nearest neighbor queries. PVLDB 3(1-2):1231–1242
Liu T, Moore AW, Gray A (2006) New algorithms for efficient high-dimensional nonparametric classification. JMLR 7:1135–1158
Sproull RF (1991) Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6:579–589
Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbors. In: ICML
Filho RFS, Traina A, Traina C Jr., Faloutsos C (2001) Similarity search without tears: the OMNI-family of all-purpose access methods. In: ICDE
Jagadish HV, Ooi BC, Tan K-L, Yu C, Zhang R (2005) idistance: an adaptive b+-tree based indexing method for nearest neighbor search. TODS 30(2):364–397
Venkateswaran J, Lachwani D, Kahveci T, Jermaine C (2006) Reference-based indexing of sequence databases. In: VLDB
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
Kendall M (1948) Rank correlation methods charles griffin and co.
Jurman G, Merler S, Barla A, Paoli S, Galea A, Furlanello C (2008) Algebraic stability indicators for ranked lists in molecular profiling. Bioinformatics 24 (2):258–264
Jurman G, Riccadonna S, Visintainer R, Furlanello C (2009) Canberra distance on ranked lists. In: Adv ranking NIPS 09 Workshop, Whistler, Canada
Jurman G, Riccadonna S, Visintainer R, Furlanello C (2012) Algebraic comparison of partial lists in bioinformatics. PLoS One 7(5):e36540
Chen J, Li Y, Feng L (2012) A new weighted Spearman’s footrule as a mesaure of distance between rankings. In: 1207.2541.v2 [cs.DM]
Bartholdi JJ III, Tovey CA, Trick MA (1989) Voting schemes for which it can be difficult to tell who won the election. Soc Choice Welfare 8(2):157–165
Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the Web. In: WWW
Ailon N (2007) Aggregation of partial rankings, p-ratings and top-m lists. In: SODA
Sculley D. (2007) Rank aggregation for similar items. In: SDM
Fang Q, Feng J, Ng W (2011) Identifying differentially-expressed genes via weighted rank aggregation. In: ICDM
Liu Y-T, Liu T-Y, Qin T, Ma Z-M, Li H (2007) Supervised rank aggregation. In: WWW
Klementiev A, Roth D, Small K (2008) Unsupervised rank aggregation with distance-based models. In: ICML
Fagin R, Kumar R, Sivakumar D (2003) Efficient similarity search and classification via rank aggregation. In: SIGMOD
Witten IH, Moffat A, Bell TC (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, Burlington
Sanders P, Transier F (2007) Intersection in integer inverted indices. In: ALENEX
Mirzazadeh M. (2004) Adaptive comparison-based algorithms for evaluating set queries. Master’s thesis, University of Waterloo
Bille P, Pagh A, Pagh R (2007) Fast evaluation of union-intersection expressions. In: ISAAC
Blelloch GE, Reid-Miller M (1998) Fast set operations using treaps. In: SPAA
Ding B, König AC (2011) Fast set intersection in memory. In: VLDB
Shang S, Ding R, Bo Y, Xie K, Zheng K, Kalnis P (2012) User oriented trajectory search for trip recommendation. In: EDBT
Cao X, Chen L, Cong G, Xiao X (2012) Keyword-aware optimal route search. In: PVLDB
Cao X, Chen L, Cong G, Jensen CS, Qu Q, Skovsgaard A, Wu D, Yiu ML (2012) Spatial keyword querying. In: ER
Cao X, Chen L, Cong G, Guan J, Phan N-T, Xiao X (2013) KORS: Keyword-aware optimal route search system. In: ICDE
Han J, Wen J-R (2013) Mining frequent neighborhood patterns in a large labeled graph. In: CIKM
Han J, Wen J-R, Pei J (2014) Within-network classification using radius-constrained neighborhood patterns. In: CIKM
Han J, Zheng K, Sun A, Shang S, Wen J-R (2016) Discovering neighborhood pattern queries by sample answers in knowledge base. In: ICDE
Shang S, Ding R, Zheng K, Jensen CS, Kalnis P, Zhou X (2014) Personalized trajectory matching in spatial networks. VLDB J 23(3):449–468
Shang S, Chen L, Wei Z, Jensen CS, Wen J-R, Kalnis P (2016) Collective travel planning in spatial networks. TKDE 28(5):1132–1146
Shang S, Chen L, Jensen CS, Wen J-R, Kalnis P (2017) Searching trajectories by regions of interest. TKDE 29(7):1549–1562
Shang S, Chen L, Zheng K, Jensen CS, Wei Z, Kalnis P (2018) Parallel trajectory to location join. TKDE, online first
Chen L, Cui Y, Cong G, Cao X (2014) SOPS: A system for efficient processing of spatial-keyword publish/subscribe. In: PVLDB
Chen L, Cong G, Cao X, Tan K-L (2015) Temporal spatial-keyword top-k publish/subscribe. In: ICDE
Chen L, Cong G (2015) Diversity-aware top-k publish/subscribe for text stream. In: SIGMOD
Chen Z, Cong G, Zhang Z, Tom ZJ, Chen L (2017) Distributed publish/subscribe query processing on the spatio-textual data stream. In: ICDE
Chen L, Shang S, Zhang Z, Cao X, Jensen CS, Kalnis P (2018) Location-aware top-k term publish/subscribe. In: ICDE
Li M, Chen L, Cong G, Gu Y, Yu G (2016) Efficient processing of location-aware group preference queries. In: CIKM
An L, Wang W, Shang S, Li Q, Zhang X (2018) Efficient task assignment in spatial crowdsourcing with worker and task privacy protection. GeoInformatica 22 (2):335–362
Chen L, Cong G, Cao X (2013) An efficient query indexing mechanism for filtering geo-textual data. In: SIGMOD
Zhao K, Liu Y, Yuan Q, Chen L, Chen Z, Cong G (2016) Towards personalized maps: mining user preferences from geo-textual data. In: PVLDB
Li X, Cheng Y, Cong G, Chen L (2017) Discovering pollution sources and propagation patterns in urban area. In: KDD
Zhao K, Chen L, Cong G (2016) Topic exploration in spatio-temporal document collections. In: SIGMOD
Knuth DE (2009) Bitwise Tricks & Techniques; Binary Decision Diagrams, volume 4, fascicle 1 of The Art of Computer Programming, chapter 7 Addison-Wesley
Wegner P (1960) A technique for counting ones in a binary computer. CACM 3 (5):322
Tang J, Zhang D, Yao L (2007) Social network extraction of academic researchers. In: ICDM’07
Tang J, Zhang J, Yao L, Li J, Li Z, Su Z (2008) Arnetminer: Extraction and mining of academic social networks. In: KDD
Tang J, Yao L, Zhang D, Zhang J (2010) A combination approach to web user profiling. ACM TKDD 5(1):1–44
Tang J, Zhang J, Jin R, Zi Y, Cai K, Li Z, Zhong S u (2011) Topic level expertise search over heterogeneous networks. Machine Learning Journal 82 (2):211–237
Tang J, Fong ACM, Bo W, Zhang J (2012) A unified probabilistic framework for name disambiguation in digital library. TKDE 24(6):975–987
Goldberg K, Roeder T, Gupta D, Perkins C (2001) Eigentaste: a constant time collaborative filtering algorithm. J Inform Retrieval 4:133–151
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, H., Lu, Z. Preference-aware sequence matching for location-based services. Geoinformatica 24, 107–131 (2020). https://doi.org/10.1007/s10707-019-00370-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-019-00370-1