Skip to main content
Log in

Domain-based Latent Personal Analysis and its use for impersonation detection in social media

  • Published:
User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Abstract

Zipf’s law defines an inverse proportion between a word’s ranking in a given corpus and its frequency in it, roughly dividing the vocabulary into frequent words and infrequent ones. Here, we stipulate that within a domain an author’s signature can be derived from, in loose terms, the author’s missing popular words and frequently used infrequent words. We devise a method, termed Latent Personal Analysis (LPA), for finding domain-based attributes for entities in a domain: their distance from the domain and their signature, which determines how they most differ from a domain. We identify the most suitable distance metric for the method among several and construct the distances and personal signatures for authors, the domain’s entities. The signature consists of both over-used terms (compared to the average) and missing popular terms. We validate the correctness and power of the signatures in identifying users and set existence conditions. We test LPA in several domains, both textual and non-textual. We then demonstrate the use of the method in explainable authorship attribution: we define algorithms that utilize LPA  to identify two types of impersonation in social media: (1) authors with sockpuppets (multiple) accounts and (2) front-users accounts, operated by several authors. We validate the algorithms and employ them over a large-scale dataset obtained from a social media site with over 4000 users. We corroborate these results using temporal rate analysis. LPA  can further be used to devise personal attributes in a wide range of scientific domains in which the constituents have a long-tail distribution of elements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Availability of code, data, and materials

Code and data are available from http://scan.haifa.ac.il/data.

Notes

  1. The information regarding the book is taken from Ben-Tovim (2008) and https://en.m.wikipedia.org/wiki/Robinson_Crusoe.

References

  • Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. (TOIS) 26(2), 1–29 (2008)

    Article  Google Scholar 

  • Akmajian, A., Farmer, A.K., Bickmore, L., Demers, R.A., Harnish, R.M.: Linguistics: An Introduction to Language and Communication. MIT Press, Cambridge (2017)

    Google Scholar 

  • Alon, U., Mokryn, O., Hershberg, U.: Using domain based latent personal analysis of b cell clone diversity patterns to identify novel relationships between the b cell clone populations in different tissues. Front. Immunol. 12, 642673 (2021)

    Article  Google Scholar 

  • Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), 119–123 (2009)

    Article  Google Scholar 

  • Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 96–103 (1998)

  • Barabasi, A.L.: The origin of bursts and heavy tails in human dynamics. Nature 435(7039), 207 (2005)

    Article  Google Scholar 

  • Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  • Barbon, S., Igawa, R.A., Zarpelão, B.B.: Authorship verification applied to detection of compromised accounts on online social networks. Multimed. Tools Appl. 76(3), 3213–3233 (2017)

    Article  Google Scholar 

  • Ben-Shoshan, H., Mokryn, O.: Activemap: Visual analysis of temporal activity in social media sites. In: Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion, pp. 1–2 (2018)

  • Ben-Tovim, R.: Robinson Crusoe, Wittgenstein, and the return to society. Philos. Lit. 32(2), 278–292 (2008)

    MathSciNet  Google Scholar 

  • Bigi, B.: Using Kullback–Leibler distance for text categorization. In: European Conference on Information Retrieval, pp. 305–319. Springer (2003)

  • Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  • Brinegar, C.S.: Mark twain and the quintus curtius snodgrass letters: a statistical test of authorship. J. Am. Stat. Assoc. 58(301), 85–96 (1963)

    Article  Google Scholar 

  • Brown, R., McNeill, D.: The “tip of the tongue’’ phenomenon. J. Verbal Learn. Verbal Behav. 5(4), 325–337 (1966)

    Article  Google Scholar 

  • Burrows, J.F.: Word-patterns and story-shapes: the statistical analysis of narrative style. Lit. Linguist. Comput. 2(2), 61–70 (1987)

    Article  Google Scholar 

  • Calude, A.S., Pagel, M.: How do we use language? shared patterns in the frequency of word use across 17 world languages. Philos. Trans. R. Soc. B Biol. Sci. 366(1567), 1101–1107 (2011)

    Article  Google Scholar 

  • Cao, N., Lu, L., Lin, Y.R., Wang, F., Wen, Z.: Socialhelix: visual analysis of sentiment divergence in social media. J. Vis. 18(2), 221–235 (2015)

    Article  Google Scholar 

  • Chen, S., Chen, S., Wang, Z., Liang, J., Yuan, X., Cao, N., Wu, Y.: D-map: visual analysis of ego-centric information diffusion patterns in social media. In: 2016 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 41–50. IEEE (2016)

  • Clough, P.: Plagiarism in natural and programming languages: an overview of current tools and technologies. Citeseer (2000)

  • Cohen, J.: Things I have learned (so far). Am. Psychol. 45(12), 1304 (1990)

    Article  Google Scholar 

  • Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  • Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)

    Article  Google Scholar 

  • Ferrer-i Cancho, R., Solé, R.V.: Least effort and the origins of scaling in human language. Proc. Natl. Acad. Sci. 100(3), 788–791 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Ferrer-i Cancho, R., Vitevitch, M.S.: The origins of Zipf’s meaning-frequency law. J. Assoc. Inf. Sci. Technol. 69(11), 1369–1379 (2018)

    Article  Google Scholar 

  • Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social bots. Commun. ACM 59(7), 96–104 (2016)

    Article  Google Scholar 

  • Ferraz Costa, A., Yamaguchi, Y., Juci Machado Traina, A., Traina, Jr. C., Faloutsos, C.: Rsc: mining and modeling temporal activity in social media. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–278. ACM (2015)

  • Freud, S.: Negation. Int. J. Psycho-Anal. 6, 367–371 (1925)

    Google Scholar 

  • Griffiths, T.L., Steyvers, M.: A probabilistic approach to semantic representation. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 24 (2002)

  • Hahn, M., Jurafsky, D., Futrell, R.: Universals of word order reflect optimization of grammars for efficient communication. Proc. Natl. Acad. Sci. 117(5), 2347–2353 (2020)

    Article  Google Scholar 

  • Hofmann, T.: Probabilistic latent semantic analysis (2013). arXiv preprint arXiv:13016705

  • Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)

    Article  Google Scholar 

  • Hu, X., Wang, Y., Wu, Q.: Multiple authors detection: a quantitative analysis of dream of the red chamber. Adv. Adapt. Data Anal. 6(04), 1450012 (2014)

    Article  MathSciNet  Google Scholar 

  • Iqbal, F., Binsalleeh, H., Fung, B.C., Debbabi, M.: A unified data mining solution for authorship analysis in anonymous textual communications. Inf. Sci. 231, 98–112 (2013)

    Article  Google Scholar 

  • Johnson, B., Shneiderman, B.: Tree-maps: a space-filling approach to the visualization of hierarchical information structures. In: Proceedings of the 2nd Conference on Visualization’91, pp. 284–291. IEEE Computer Society Press (1991)

  • Juola, P., et al.: Authorship attribution. Found. Trends® Inf. Retrieval 1(3), 233–334 (2008)

  • Kaplan, A.M., Haenlein, M.: Users of the world, unite! The challenges and opportunities of social media. Bus. Horiz. 53(1), 59–68 (2010)

    Article  Google Scholar 

  • Kietzmann, J.H., Hermkens, K., McCarthy, I.P., Silvestre, B.S.: Social media? Get serious! Understanding the functional building blocks of social media. Bus. Horiz. 54(3), 241–251 (2011)

    Article  Google Scholar 

  • Kilgarriff, A., Rose, T.: Measures for corpus similarity and homogeneity. In: Proceedings of the Third Conference on Empirical Methods for Natural Language Processing, pp. 46–52 (1998)

  • Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Assoc. Inf. Sci. Technol. 65(1), 178–187 (2014)

    Article  Google Scholar 

  • Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)

    Article  Google Scholar 

  • Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45(1), 83–94 (2011)

    Article  Google Scholar 

  • Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? J. Law Econ. Policy 39(2006), 317–331 (2013)

    Google Scholar 

  • Krippendorff, K.: Content Analysis: An Introduction to Its Methodology. Sage Publications, Thousand Oaks (2018)

    Google Scholar 

  • Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  • Kumar, S., Cheng, J., Leskovec, J., Subrahmanian, V.: An army of me: sockpuppets in online discussion communities. In: Proceedings of the 26th International Conference on World Wide Web, pp. 857–866 (2017)

  • Mosteller, F., Wallace, D.L.: Inference and disputed authorship: the Federalist (1964)

  • Narayanan, A., Paskov, H., Gong, N.Z., Bethencourt, J., Stefanov, E., Shin, E.C.R., Song, D.: On the feasibility of internet-scale author identification. In: 2012 IEEE Symposium on Security and Privacy, pp. 300–314. IEEE (2012)

  • Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y., Woodard, D.: Surveying stylometry techniques and applications. ACM Comput. Surv. (CSUR) 50(6), 86 (2018)

    Article  Google Scholar 

  • Newman, M.E.: Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46(5), 323–351 (2005)

    Article  Google Scholar 

  • Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92. ACM (2006)

  • Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)

    Article  Google Scholar 

  • Price, D.J.D.S.: Networks of scientific papers. Science 149, 510–515 (1965)

    Article  Google Scholar 

  • Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho, A.R., Stamatatos, E.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2016)

    Article  Google Scholar 

  • Schreck, T., Keim, D.: Visual analysis of social media data. Computer 46(5), 68–75 (2013)

    Article  Google Scholar 

  • Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E., et al.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8(9), e73791 (2013a)

    Article  Google Scholar 

  • Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro-messages. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1880–1891 (2013b)

  • Shrestha, P., Sierra, S., González, F.A., Montes, M., Rosso, P., Solorio, T.: Convolutional neural networks for authorship attribution of short texts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 669–674 (2017)

  • Snijders, T.A.: The statistical evaluation of social network dynamics. Sociol. Methodol. 31(1), 361–395 (2001)

    Article  Google Scholar 

  • Steiger, B., Schmitz, R.: Computer implemented methods for visualizing correlations between blood glucose data and events and apparatuses thereof. US Patent App. 13/603,853 (2014)

  • Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)

    Article  Google Scholar 

  • Traxler, M., Gernsbacher, M.A.: Handbook of Psycholinguistics. Elsevier, Amsterdam (2011)

    Google Scholar 

  • Van Dijck, J.: Users like you? Theorizing agency in user-generated content. Media Cult. Soc. 31(1), 41–58 (2009)

    Article  Google Scholar 

  • Vani, K., Gupta, D.: Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: comparisons, analysis and challenges. Inf. Process. Manag. 54(3), 408–432 (2018)

    Article  Google Scholar 

  • Viswanath, B., Bashir, M.A., Crovella, M., Guha, S., Gummadi, K.P., Krishnamurthy, B., Mislove, A.: Towards detecting anomalous user behavior in online social networks. In: 23rd Usenix Security Symposium, pp. 223–238 (2014)

  • Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018)

    Article  Google Scholar 

  • Wang, G., Wilson, C., Zhao, X., Zhu, Y., Mohanlal, M., Zheng, H., Zhao, B.Y.: Serf and turf: crowdturfing for fun and profit. In: Proceedings of the 21st International Conference on World Wide Web, pp. 679–688. ACM (2012)

  • Webber, W., Moffat, A., Zobel, J.: A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 1–38 (2010)

    Article  Google Scholar 

  • West, G.B., Brown, J.H., Enquist, B.J.: A general model for the origin of allometric scaling laws in biology. Science 276(5309), 122–126 (1997)

    Article  Google Scholar 

  • Zanette, D.H., Manrubia, S.C.: Vertical transmission of culture and the distribution of family names. Physica A 295(1–2), 1–8 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, C., Wu, X., Niu, Z., Ding, W.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)

    Article  Google Scholar 

  • Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)

    Article  Google Scholar 

  • Zipf, G.: Human Behavior and the Principle of Least Effort. Addison Wesley, Cambridge (1949)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank Tom Atkins and Uri Alon for their help. We would also like to thank David Bodoff and Einat Minkov for interesting discussions and helpful remarks.

Funding

The research was partially funded by the Magnet Infomedia Consortium, Israel.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Osnat Mokryn.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Distance metrics

A distance metric d on a set X is a function \(d:X\times X\rightarrow [0,\infty )\). I.e., it receives two elements in the set and gives the distance between them as a real, non-negative number. To be a true metric, such a function needs to fulfill the following four criteria:

  • Non-negativity \(d(x,y)\ge 0\). The distance between any two elements is greater or equal to zero.

  • Symmetry \(d(x,y)=d(y,x)\). Distance is independent of starting point—for any two elements xy in the set, the distance from x to y is the same as the distance from y to x.

  • Identity of indiscernibles \(d(x,y)=0\leftrightarrow x=y\). Two elements have a zero distance from each other if and only if they are the same element.

  • Triangle inequality \(d(x,z)\le d(x,y)+d(y,z)\). The shortest path is a direct one—given two elements, xz in the set, it is never shorter to go through a third one, y.

We show here that each of our selected metrics, as described in Sect. 3.2.1, is a distance metric by the definition given in Section 8. Specifically, we show that the discussed distance metrics in Sect. 3.2.1 all satisfy the following three properties: Non-negativity, Symmetry and Identity of indiscernibles.

1.1 RBD distance metric

Non-negativity We first notice that \(\sum _{d=1}^\infty p^{d-1}\) is the sum of the geometric progression \(p^{d-1}\) and is therefore equal to \(\frac{1}{1-p}\). since \(A_d \le 1\), we get

$$\begin{aligned} \sum _{d=1}^\infty p^{d-1}\cdot A_d\le \frac{1}{1-p} \end{aligned}$$

and therefore

$$\begin{aligned} (1-p)\sum _{d=1}^\infty p^{d-1}\cdot A_d\le (1-p)\frac{1}{1-p} = 1 \end{aligned}$$

As p is in the range [0, 1] and \(A_d\) is non-negative, we also get

$$\begin{aligned} (1-p)\sum _{d=1}^\infty p^{d-1}\cdot A_d\ge 0 \end{aligned}$$

Hence RBO is always in the range [0, 1]. As \(RBD=1-RBO\) it is also in that range.

Symmetry For two lists, \(V_1,V_2\), \(A_d\) is defined by the intersection of the lists over the first d terms. This is a symmetrical property—the intersection of S with T is the same as the intersection of T with S. p is an independent parameter, and we therefore have \(RBO(V_1,V_2,p)=RBO(V_2,V_1,p)\), and hence this also holds for RBD.

Identity of indiscernibles Let us consider \(V_1=V_2\), that is for every term d, \(V_1(d),V_2(d)\). Then, \(\forall d, X_d=d\) and \(A_d=1\). We then have

$$\begin{aligned} \sum _{d=1}^\infty p^{d-1}\cdot A_d=\sum _{d=1}^\infty p^{d-1} \end{aligned}$$

As before, this is the sum of the geometric progression \(p^{d-1}\) and is equal \(\frac{1}{1-p}\). Therefore, for \(V_1=V_2\) we get \(RBO(V_1,V_2,p)=1\) and \(RBD(V_1,V_2,p)=0\).

Now, consider two distinct lists. that is, for some term d, \(V_1(d)\ne V_2(d)\). For that term we have \(X_d<d\) and \(A_d<1\). We therefore have

$$\begin{aligned} \sum _{d=1}^\infty p^{d-1}\cdot A_d<\frac{1}{1-p} \end{aligned}$$

and therefore \(RBO(V_1,V_2,p)<1\) and \(RBD(V_1,V_2,p)>0\).

1.2 Cosine Similarity

As we are dealing with frequency vectors, all coordinates are non-negative. All vectors are therefore in the first orthant and the angles between them are in the range \([0,\pi /2]\) radians, and therefore, \(cos\theta\) is in the range [0, 1] and so is \(1-cos\theta\).

Symmetry Both the dot product and the standard multiplication are commutative operations and therefore \(D(V_1,V_2)=\frac{V_1\cdot V_2}{\Vert V_1 \Vert \times \Vert V_2 \Vert }=\frac{V_2\cdot V_1}{\Vert V_2 \Vert \times \Vert V_1 \Vert }=D(V_2,V_1)\), and therefore, also \(1-D(V_1,V_2)=1-D(V_2,V_1)\).

Identity of indiscernibles First, we note that for vectors of length n, the standard dot product \(V_1\cdot V_2\) is defined as \(\sum _{i=1}^n V_{1_i}\times V_{2_i}\) and the standard euclidean norm is defined as \(\sqrt{\sum _{i=1}^n (V_i)^2}\). Assume \(V_1=V_2\), i.e., for every i \(V_1(i)=V_2(i)\). We then have \(V_1\cdot V_2=\sum _{i=1}^n V_1(i)\times V_2(i) = \sum _{i=1}^n V_1(i)\times V_1(i) = \sum _{i=1}^n V_1(i)^2\). We also have \(\Vert V_1 \Vert = \Vert V_2\Vert\) and therefore \(\Vert V_1\Vert \times \Vert V_2\Vert =\Vert V_1\Vert ^2=\sqrt{\sum _{i=1}^n V_1(i)^2}^2=\sum _{i=1}^n V_1(i)^2\). Therefore, for \(V_1=V_2\) we have \(D(V_1,V_2)=\frac{V_1\cdot V_2}{\Vert V_1 \Vert \times \Vert V_2 \Vert }=\frac{\sum _{i=1}^n V_1(i)^2}{\sum _{i=1}^n V_1(i)^2}=1\) and \(1-D(V_1,V_2)=0\).

Assume \(1-D(V_1,V_2)=0\), that is \(D(V_1,V_2)=1\). As we have seen when proving non-negativity, this implies that the angle \(\theta\) between \(V_1\) and \(V_2\) is zero. Since both vectors are frequency vectors, i.e., \(\Vert V_1\Vert =\Vert V_2\Vert =1\) this means they are the same vector.

1.3 L1 norm

Non-negativity It’s enough to prove non-negativity for each element of the sum, but that is assured by the absolute value.

Symmetry Again, it’s enough to show symmetry for each element of the sum. We note that \((V_1(x)-V_2(x))=-(V_2(x)-V_1(x))\) and therefore \(\arrowvert (V_1(x)-V_2(x))\arrowvert =\arrowvert (V_2(x)-V_1(x))\arrowvert\)

Identity of indiscernibles Assume \(V_1=V_2\), i.e for every \(x\in X\) we have \(V_1(x)=V_2(x)\). Then \(V_1(x)-V_2(x)=0\) and we have \(L1(V_1,V_2)=0\).

Assume \(V_1\ne V_2\) i.e., for some \(x\in X\) we have \(V_1(x)\ne V_2(x)\). For that x we have \(\big [V_1(x)-V_2(x)\big ]>0\) and as all elements in the sum are non-negative we have \(L1(V_1,V_2)>0\).

1.4 KL divergence

Non-negativity It is enough to show that each element in the sum is non-negative. For every \(x\in X\) either \(V_1(x)<V_2(x)\), \(V_1(x)V_2(x)\) or \(V_1(x)=V_2(x)\). If \(V_1(x)<V_2(x)\), then \(V_1(x)-V_2(x)<0\) and \(log\frac{V_1(x)}{V_2(x)}<0\). Therefore, \(\big [V_1(x)-V_2(x)\big ]log\frac{V_1(x)}{V_2(x)}>0\). If \(V_1(x)>V_2(x)\), then \(V_1(x)-V_2(x)>0\) and \(log\frac{V_1(x)}{V_2(x)}>0\) and again \(\big [V_1(x)-V_2(x)\big ]log\frac{V_1(x)}{V_2(x)}>0\). Lastly, if \(V_1(x)=V_2(x)\) then \(V_1(x)-V_2(x)=0\) and \(log\frac{V_1(x)}{V_2(x)}=0\) and we have \(\big [V_1(x)-V_2(x)\big ]log\frac{V_1(x)}{V_2(x)}=0\)

Symmetry We note that \((V_1(x)-V_2(x))=-(V_2(x)-V_1(x))\) and likewise \(log\frac{V_1(x)}{V_2(x)}=-log\frac{V_2(x)}{V_1(x)}\) and therefore \((V_1(x)-V_2(x))log\frac{V_1(x)}{V_2(x)}=(V_2(x)-V_1(x))log\frac{V_2(x)}{V_1(x)}\). Again, this is true for each element in the sum and therefore holds for the entire sum.

Identity of indiscernibles Assume \(V_1=V_2\), i.e for every \(x\in X\) we have \(V_1(x)=V_2(x)\). Then \(V_1(x)-V_2(x)=0\) and \(log\frac{V_1(x)}{V_2(x)}=1\) and we have \(\big [V_1(x)-V_2(x)\big ]log\frac{V_1(x)}{V_2(x)}=0\).

Assume \(V_1\ne V_2\) i.e., for some \(x\in X\) we have \(V_1(x)\ne V_2(x)\). For that x we have \(\big [V_1(x)-V_2(x)\big ]log\frac{V_1(x)}{V_2(x)}>0\) and as all elements in the sum are non-negative we have \(D([V_1\Arrowvert V_2]=\sum _{x\in X}\Bigg [\big [V_1(x)-V_2(x)\big ]log\Big [\frac{V_1(x)}{V_2(x)}\Big ]\Bigg ]>0\)

Appendix 2: Datasets details

Books dataset The Books dataset is comprised of 30 English language books, taken from the Gutenberg project’s most popular books list. Each book is divided into chapters, varying from 7 to 61 chapters per book. We included only chapters that contain more than 150 words in the text and omitted words that appear less than five times in the book to avoid the extreme consequences of a very long tail. Each text snippet is labeled with an author’s name, and belongs to a book, and more specifically to a chapter in the book. In the Validation section, we use this subdivision to validate whether our method is able to distinguish between two texts written by the same person and two texts written by different people. We also test its ability to distinguish between a text written by one person and a text written by several, by collecting a number of chapters from different books to one virtual book. Table 8 in Appendix 1 lists the book names and relevant statistics.

IMDb reviews dataset IMDb (Internet Movie Database) is among the world’s most popular and authoritative sources for movie, TV and celebrity content. It offers a searchable database of more than 185 million data items, including more than 3.5 million films. This dataset contains 1,406,000 movie reviews, spanning the period of July 1998–June 2016. We use their movie reviews text as our dataset for demonstrating the applications of our method. We define a user/author (contributor) as a registered person who published a movie review on IMDb. Each author is identified by a unique user id and alias. Each review contains a text, a timestamp, and an author ID. The original obtained IMDb dataset contained 467, 961 users. To have a large enough sample of text for each user, we extracted only users who published at least 30 reviews. 3969 users met this criterion. We defined \(n=30\) (number of different reviews per user) as our lower threshold to achieve statistical inference. The choice of \(n = 30\) for a boundary between small and large samples is already used as a rule of thumb in many research areas: “The number 30 seems to have arisen from the understanding that with fewer than 30 cases, you were dealing with small samples that required specialized handling with small-sample statistics instead of the critical-ratio approach we have been taught” (Cohen 1990).

Table 8 Books datasets characteristics
Table 9 IMDb dataset characteristics

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mokryn, O., Ben-Shoshan, H. Domain-based Latent Personal Analysis and its use for impersonation detection in social media. User Model User-Adap Inter 31, 785–828 (2021). https://doi.org/10.1007/s11257-021-09295-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11257-021-09295-7

Keywords

Navigation