What’s in a name? – gender classification of names with character based machine learning models

Hu, Yifan; Hu, Changwei; Tran, Thanh; Kasturi, Tejaswi; Joseph, Elizabeth; Gillingham, Matt

doi:10.1007/s10618-021-00748-6

What’s in a name? – gender classification of names with character based machine learning models

Published: 12 May 2021

Volume 35, pages 1537–1563, (2021)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Yifan Hu ORCID: orcid.org/0000-0003-2017-924X¹,
Changwei Hu¹,
Thanh Tran²,
Tejaswi Kasturi¹,
Elizabeth Joseph³ &
…
Matt Gillingham³

1121 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Gender information is no longer a mandatory input when registering for an account at many leading Internet companies. However, prediction of demographic information such as gender and age remains an important task, especially in intervention of unintentional gender/age bias in recommender systems. Therefore it is necessary to infer the gender of those users who did not to provide this information during registration. We consider the problem of predicting the gender of registered users based on their declared name. By analyzing the first names of 100M+ users, we found that genders can be very effectively classified using the composition of the name strings. We propose a number of character based machine learning models, and demonstrate that our models are able to infer the gender of users with much higher accuracy than baseline models. Moreover, we show that using the last names in addition to the first names improves classification performance further.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 4

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Article 19 April 2016

Gérard Biau & Erwan Scornet

A survey of transfer learning

Article Open access 28 May 2016

Karl Weiss, Taghi M. Khoshgoftaar & DingDing Wang

Notes

E.g., if a user read an article about the department store Macy’s, a categorical variable wiki_Macy’s is added to the list of features describing the user
As an example, in SSA Data, around 21% of people with the name “Avery” are male. In a regression setting, we can fit a model to predict a value of 0.21 given this name. On the other hand, in a binary classification setting, we seek to predict a value of 0 for this name.
Given a predominantly female name with a male probability of \(p < 0.5\), and given k randomly selected people with this name, the probability that more than half of these people are male is: \(f(p, k) = \sum _{i=\lceil k/2\rceil )}^{k} C_n^i p^i(1-p)^{k-i}\). For \(p = 0.3\), we found that \(f(p,k) < 0.05\) when \(k \ge 16.\) For \(p=0.4\), we found that \(f(p,k) < 0.05\) when \(k \ge 66.\)

References

3000 most common words in english. https://www.ef.edu/english-resources/english-vocabulary/top-3000-words/ (2020). [Online; accessed March 22, 2020]
SP 500 Companies (2020). https://datahub.io/core/s-and-p-500-companies. [Online; accessed March 22, 2020]
Social Security Administration: National data on the relative frequency of given names in the population of U.S. births where the individual has a social security number (tabulated based on social security records as of march 3, 2019). http://www.ssa.gov/oact/babynames/names.zip
Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: Sixth International AAAI Conference on Weblogs and Social Media
Ambekar A, Ward C, Mohammed J, Male S, Skiena S (2009) Name-ethnicity classification from open sources. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 49–58. ACM
Beretta V, Maccagnola D, Cribbin T, Messina E (2015) An interactive method for inferring demographic attributes in twitter. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 113–122. ACM
Brown E (2017) Gender inference from character sequences in multinational first names. https://towardsdatascience.com/name2gender-introduction-626d89378fb0#408a
Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, pp. 1301–1309. Association for Computational Linguistics
Chen P, Sun Z, Bing L, Yang W (2017) Recurrent attention network on memory for aspect sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 452–461
Ciot M, Sonderegger M, Ruths D (2013) Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1136–1145
Google Cloud Content Categories (2019). https://cloud.google.com/natural-language/docs/categories
Culotta A, Kumar NR, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: AAAI, pp. 72–78
Culotta A, Ravi NK, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res 55:389–408
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Grbovic M, Radosavljevic V, Djuric N, Bhamidipati N, Nagarajan A (2015) Gender and interest targeting for sponsored post advertising at tumblr. In: proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1819–1828. ACM, New York, NY, USA. https://doi.org/10.1145/2783258.2788616
Han S, Hu Y, Skiena S, Coskun B, Liu M, Qin H, Perez J (2017) Generating look-alike names for security challenges. In: proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, pp. 57–67. ACM, New York, NY, USA. https://doi.org/10.1145/3128572.3140441
Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: neural Computation, pp. 1735–1780
Karako C, Manggala P (2018) Using image fairness representations in diversity-based re-ranking for recommendations. In: adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 23–28. ACM
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Knowles R, Carroll J, Dredze M (2016) Demographer: Extremely simple name demographics. In: proceedings of the First Workshop on NLP and Computational Social Science, pp. 108–113
Kokkos A, Tzouramanis T (2014) A robust gender inference model for online social networks and its application to linkedin and twitter. First Monday 19(9)
Liu W, Al Zamal F, Ruths D (2012) Using social media to infer gender composition of commuter populations. In: sixth international AAAI Conference on Weblogs and Social Media
Liu W, Ruths D (2013) What’s in a name? using first names as features for gender inference in twitter. In: analyzing microtext AAAI 2013 Spring Symposium, pp. 10–16. AAAI, Palo Alto, CA, USA
Lu F (2018) The 11 Most Beautiful Chinese Names and What They Mean. https://bit.ly/2yGSNO7
Ludu PS (2014) Inferring gender of a twitter user using celebrities it follows. arXiv preprint arXiv:1405.6667
Merler M, Cao L, Smith JR (2015) You are what you tweet...pic! gender prediction based on semantic analysis of social media images. In: 2015 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: proceedings of Workshop at ICLR
Mueller J, Stumme G (2016) Gender inference using statistical name characteristics in twitter. In: proceedings of the The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics, Data Science 2016, p. 47. ACM
Otterbacher J (2010) Inferring gender of movie reviewers: exploiting writing style, content and metadata. In: proceedings of the 19th ACM international conference on Information and knowledge management, pp. 369–378. ACM
Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Fifth International AAAI Conference on Weblogs and Social Media
Rao D, Yarowsky D (2010) Detecting latent user properties in social media. In: Proc. of the NIPS MLSN Workshop, pp. 1–7. Citeseer
Sakaki S, Miura Y, Ma X, Hattori K, Ohkuma T (2014) Twitter user gender inference using combined analysis of text and image processing. In: proceedings of the Third Workshop on Vision and Language, pp. 54–61
Wang S, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pp. 90–94. Association for Computational Linguistics, Stroudsburg, PA, USA
Wang Y, Huang M, Zhao L, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615
Wikipedia: Andrea. https://en.wikipedia.org/wiki/Andrea [Online; accessed March 22, 2020]
Wikipedia: Toni. https://en.wikipedia.org/wiki/Toni [Online; accessed March 22, 2020]
Wikipedia: Unisex name. https://en.wikipedia.org/wiki/Unisex_name [Online; accessed March 22, 2020]
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Łukasz Kaiser, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR arXiv:1609.08144
Yao S, Huang B (2017) Beyond parity: Fairness objectives for collaborative filtering. In: advances in neural information processing systems, pp. 2921–2930
Ye J, Han S, Hu Y, Coskun B, Liu M, Qin H, Skiena S (2017) Nationality classification using name embeddings. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pp. 1897–1906. ACM, New York, NY, USA. https://doi.org/10.1145/3132847.3133008
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 649–657. MIT Press, Cambridge, MA, USA
Zhou X, Wan X, Xiao J (2016) Attention-based lstm network for cross-lingual sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 247–256

Download references

Acknowledgements

The authors like to thank the reviewers for the very thoughtful reviews of earlier versions of this paper.

Author information

Authors and Affiliations

Yahoo! Research, Verizon Media, 770 Broadway, New York, USA
Yifan Hu, Changwei Hu & Tejaswi Kasturi
Worcester Polytechnic Institute, Worcester, MA, USA
Thanh Tran
Verizon Media, 701 1st Ave, Sunnyvale, CA, USA
Elizabeth Joseph & Matt Gillingham

Authors

Yifan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Changwei Hu
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Tran
View author publications
You can also search for this author in PubMed Google Scholar
Tejaswi Kasturi
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Joseph
View author publications
You can also search for this author in PubMed Google Scholar
Matt Gillingham
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yifan Hu.

Additional information

Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

1.1 A.1 Fusion model that combines content information and names for gender prediction

We have seen in Sect. 4 (Table 4) that the content based model is inferior to the name based models. However, it is reasonable to expect that because content consumption by users captures behavior signals, it may offer useful information that can complement the name based models. This is especially true if the user has a unisex first name, for which it is not possible to predict with certainty the gender of the user by first name alone.

We conducted a study to investigate a simple ensemble model. We took a sample of 1M users from Yahoo Full Name Data. For each user, we predicted the male probability using the first name and the NBLR model. For the same user, we predicted the probability based on the content the user consumed, using the CONTENT model.

We experimented with the following ways of fusing the results of the content and name based models:

Logistic regression: we use \(P(\text {M}|\text {first})\) and \(P(\text {M}|\text {content})\) as features and declared gender as labels to fit a logistic regression model.

Logistic regression with logits: we first convert \(P(\text {M}|\text {first})\) and \(P(\text {M}|\text {content})\) to logits, then fit a logistic regression model. Specifically, we convert the probability to logits via the function \(p\rightarrow \ln (p/(1-p))\).

Xgboost: we use \(P(\text {M}|\text {first})\) and \(P(\text {M}|\text {content})\) as features and declared gender as labels to fit an xgboost model.

Table 8 Performance comparison of 3 fusion models, vs the baseline name and content models

Full size table

We did a 5-fold cross-validation to derive the average results in Table 8. We do not report the standard deviation since it is quite small (under 0.0006). From the table, we see that the fusion model using xgboost can improve AUC by 0.008, accuracy by 0.012, confirming our belief that combining the content and names improves gender prediction. Logistic regression performed slightly worse than xgboost. Logistic regression with logits does not perform as well as logistic regression, although still better than the name based model alone.

1.2 A.2 Dictionary look-up vs model prediction

For users with popular names, it is unnecessary to use a name-based machine learning model to predict their genders. Instead, a simple dictionary lookup is sufficient – it avoids any error that might be introduced in the modeling process. On the other hand, for unpopular names, a model-based method is more suitable: when a name appears only a few times, statistically we have a low confidence in deciding the gender based on the few occurrences. Furthermore, for names that have never appeared before in the data, a machine learning model is the only choice.

Therefore, when deploying in real world applications, we suggest taking a hierarchical approach for the gender classification: if a name appears \(\ge k\) times in the population of users with declared gender, we decide the gender based on a dictionary lookup. For other names, we apply the machine learning based models. We will reason on a suitable choice of k next.

In general, we want to pick a name frequency k, above which we can trust the labels in the lookup dictionary with a high confidence. Suppose that a name is 30%/70% male/female in a very large population (e.g., the world). If there are k of our users with that name, drawn from that population randomly, then based on simple calculation using binomial distribution, we need at least \(k \ge 16\) to be sure that the name will be labelled as female with \(95\%\) or higher probability ^{Footnote 3}. For a name that is 40%/60% male/female, we need \(k\ge 66\). Based on the above considerations, we decide to choose \(k = {40}.\) With that name frequency, we know from Table 2 that a dictionary lookup can cover the gender of 86% of users. The gender for remaining users can be inferred by the machine learning models. We emphasize that dictionary lookup is proposed only for real-world applications. For the purpose of understanding our models for names of different frequencies, all results reported in this paper are based on the machine learning models. We have seen in Table 4 that our proposed models performed well for both popular and unpopular names, when compared with the baselines.

1.3 A.3 Understanding the models

In the previous subsection, we have seen that NBLR, LSTM and char-BERT all perform very similarly. Normally, a linear model such as NBLR is easier to understand than nonlinear models. By looking at the weights for the features, one can understand why a sample is predicted as positive or negative.

However, in the case of predicting the gender using the name string, the features are character n-grams extracted from the same name, hence they are not independent. This feature correlation can give rise to negative weights for some features that are associated with male names, which is compensated by larger positive weights for other correlated features (as a reminder, we use label 1 for male and 0 for female). Similarly positive weights may be observed for features that are associated with female names. This makes model interpretation from feature weights difficult. For example, when looking at the female name “_anna_” (we add prefix and suffix “_” to a name, see Sect. 3.2), we found that the 5-gram “anna_” has a positive weight of 0.36, which may be a surprise as one would expect that a name ends with “anna” should be a female name. However, a substring of “anna_” is “nna_”, and it has a negative weight of \(-1.19\). When combining the weights of all 17 character grams, together with the intercept, we end up with a weight of \(-9.01\) and a male probability of close to 0. Thus, looking at feature weights can be confusing and even misleading.

Instead, it is more instructive to apply one of our models to real datasets, and see how it behaves. We first apply NBLR to classify Chinese names. Note that the Chinese language has up to 100K characters. Of these, about 7000 are commonly used. A person’s given name could consist of any one or two characters. Therefore, the number of unique given names could be in the millions (\(7000\times 7000=49\text {M}\)). It is difficult to collect enough instances of each of these possible variations and their associated gender. This is where a character-based model is especially suited. As a sanity check, we looked at a list of 11 names from Lu (2018) that are considered especially beautiful, because of their poetic connotation or phonetic harmony. Table 9 shows the male probabilities for each of these names, together with a “note” column for their likely genders and explanation, taken from Lu (2018). We observe that our classifier does assign higher scores for names that are definitely male than those that are female.

Table 9 “Eleven Beautiful Chinese Names” from Lu (2018). P(M) is the probability of male, predicted by NBLR. The “label” column gives the likely gender, and the “note” column gives a description, both based on Lu (2018). Three of the eleven has frequency \(\ge 40\) and their genders (in brackets) could have been looked up

Full size table

Table 10 Words with a high female/male score among: (a) 3,000 most frequent English words that are not in SSA Data; (b) 236,736 words in nltk.corpus.words that are neither in SSA Data nor in Yahoo Data. (c) S&P 500 companies

Full size table

It is also helpful to understand the NBLR classifier by looking at English words that scored high as male or female. We first take the top (3000 most common words in 3000 most common words in english 2020), filter out 443 words that are found in SSA Data (such as “Rose”), then list the top 11 highest scored male/female words in Table 10 (a). Interestingly, a lot of them make sense, e.g., “miss” or “women” are female words due to their meanings. However, we notice that the majority of these 3,000 words are found in Yahoo Data as names. Next, we take 236,736 words in nltk.corpus.words and filter out any names in either SSA Data or Yahoo Data, and end up with 169,902 words. The highest scoring masculine/feminine words are given in Table 10 (b). Many of the words in column “feminine” of Table 10 (b) do look very feminine. For example, many words end in “ia” (e.g., “anadenia”). Words on the right-hand-side of the table are classified as masculine, likely due to the ending of the words, such as “*boy”. Incidentally “knappan” is a word in Malayalam (a Dravidian language primarily spoken in the south Indian state of Kerala) that means “a good for nothing guy”.

Table 10 (c) gives the most feminine and masculine company names in S&P 500 companies (SP 500 Companies 2020). The results are also interesting. Corporations with feminine names include a number of ladies fashion companies (Tiffany, Ulta, Macy’s), while many with masculine names are in engineering or energy sectors (e.g., Duke Energy, General Motor, General Electric).

Overall, the performance of NBLR on these three datasets illustrates that the model can extend well to rare or even unseen strings, and to non-English languages.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Y., Hu, C., Tran, T. et al. What’s in a name? – gender classification of names with character based machine learning models. Data Min Knowl Disc 35, 1537–1563 (2021). https://doi.org/10.1007/s10618-021-00748-6

Download citation

Received: 05 November 2019
Accepted: 05 March 2021
Published: 12 May 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s10618-021-00748-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

What’s in a name? – gender classification of names with character based machine learning models

Abstract

Access this article

Similar content being viewed by others