Skip to main content

Advertisement

Log in

What’s in a name? – gender classification of names with character based machine learning models

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Gender information is no longer a mandatory input when registering for an account at many leading Internet companies. However, prediction of demographic information such as gender and age remains an important task, especially in intervention of unintentional gender/age bias in recommender systems. Therefore it is necessary to infer the gender of those users who did not to provide this information during registration. We consider the problem of predicting the gender of registered users based on their declared name. By analyzing the first names of 100M+ users, we found that genders can be very effectively classified using the composition of the name strings. We propose a number of character based machine learning models, and demonstrate that our models are able to infer the gender of users with much higher accuracy than baseline models. Moreover, we show that using the last names in addition to the first names improves classification performance further.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. E.g., if a user read an article about the department store Macy’s, a categorical variable wiki_Macy’s is added to the list of features describing the user

  2. As an example, in SSA Data, around 21% of people with the name “Avery” are male. In a regression setting, we can fit a model to predict a value of 0.21 given this name. On the other hand, in a binary classification setting, we seek to predict a value of 0 for this name.

  3. Given a predominantly female name with a male probability of \(p < 0.5\), and given k randomly selected people with this name, the probability that more than half of these people are male is: \(f(p, k) = \sum _{i=\lceil k/2\rceil )}^{k} C_n^i p^i(1-p)^{k-i}\). For \(p = 0.3\), we found that \(f(p,k) < 0.05\) when \(k \ge 16.\) For \(p=0.4\), we found that \(f(p,k) < 0.05\) when \(k \ge 66.\)

References

  • 3000 most common words in english. https://www.ef.edu/english-resources/english-vocabulary/top-3000-words/ (2020). [Online; accessed March 22, 2020]

  • SP 500 Companies (2020). https://datahub.io/core/s-and-p-500-companies. [Online; accessed March 22, 2020]

  • Social Security Administration: National data on the relative frequency of given names in the population of U.S. births where the individual has a social security number (tabulated based on social security records as of march 3, 2019). http://www.ssa.gov/oact/babynames/names.zip

  • Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In: Sixth International AAAI Conference on Weblogs and Social Media

  • Ambekar A, Ward C, Mohammed J, Male S, Skiena S (2009) Name-ethnicity classification from open sources. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 49–58. ACM

  • Beretta V, Maccagnola D, Cribbin T, Messina E (2015) An interactive method for inferring demographic attributes in twitter. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 113–122. ACM

  • Brown E (2017) Gender inference from character sequences in multinational first names. https://towardsdatascience.com/name2gender-introduction-626d89378fb0#408a

  • Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, pp. 1301–1309. Association for Computational Linguistics

  • Chen P, Sun Z, Bing L, Yang W (2017) Recurrent attention network on memory for aspect sentiment analysis. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp. 452–461

  • Ciot M, Sonderegger M, Ruths D (2013) Gender inference of twitter users in non-english contexts. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1136–1145

  • Google Cloud Content Categories (2019). https://cloud.google.com/natural-language/docs/categories

  • Culotta A, Kumar NR, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: AAAI, pp. 72–78

  • Culotta A, Ravi NK, Cutler J (2016) Predicting twitter user demographics using distant supervision from website traffic data. J Artif Intell Res 55:389–408

    Article  Google Scholar 

  • Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  • Grbovic M, Radosavljevic V, Djuric N, Bhamidipati N, Nagarajan A (2015) Gender and interest targeting for sponsored post advertising at tumblr. In: proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1819–1828. ACM, New York, NY, USA. https://doi.org/10.1145/2783258.2788616

  • Han S, Hu Y, Skiena S, Coskun B, Liu M, Qin H, Perez J (2017) Generating look-alike names for security challenges. In: proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, pp. 57–67. ACM, New York, NY, USA. https://doi.org/10.1145/3128572.3140441

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. In: neural Computation, pp. 1735–1780

  • Karako C, Manggala P (2018) Using image fairness representations in diversity-based re-ranking for recommendations. In: adjunct Publication of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 23–28. ACM

  • Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  • Knowles R, Carroll J, Dredze M (2016) Demographer: Extremely simple name demographics. In: proceedings of the First Workshop on NLP and Computational Social Science, pp. 108–113

  • Kokkos A, Tzouramanis T (2014) A robust gender inference model for online social networks and its application to linkedin and twitter. First Monday 19(9)

  • Liu W, Al Zamal F, Ruths D (2012) Using social media to infer gender composition of commuter populations. In: sixth international AAAI Conference on Weblogs and Social Media

  • Liu W, Ruths D (2013) What’s in a name? using first names as features for gender inference in twitter. In: analyzing microtext AAAI 2013 Spring Symposium, pp. 10–16. AAAI, Palo Alto, CA, USA

  • Lu F (2018) The 11 Most Beautiful Chinese Names and What They Mean. https://bit.ly/2yGSNO7

  • Ludu PS (2014) Inferring gender of a twitter user using celebrities it follows. arXiv preprint arXiv:1405.6667

  • Merler M, Cao L, Smith JR (2015) You are what you tweet...pic! gender prediction based on semantic analysis of social media images. In: 2015 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE

  • Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: proceedings of Workshop at ICLR

  • Mueller J, Stumme G (2016) Gender inference using statistical name characteristics in twitter. In: proceedings of the The 3rd Multidisciplinary International Social Networks Conference on SocialInformatics, Data Science 2016, p. 47. ACM

  • Otterbacher J (2010) Inferring gender of movie reviewers: exploiting writing style, content and metadata. In: proceedings of the 19th ACM international conference on Information and knowledge management, pp. 369–378. ACM

  • Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Fifth International AAAI Conference on Weblogs and Social Media

  • Rao D, Yarowsky D (2010) Detecting latent user properties in social media. In: Proc. of the NIPS MLSN Workshop, pp. 1–7. Citeseer

  • Sakaki S, Miura Y, Ma X, Hattori K, Ohkuma T (2014) Twitter user gender inference using combined analysis of text and image processing. In: proceedings of the Third Workshop on Vision and Language, pp. 54–61

  • Wang S, Manning CD (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In: proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12, pp. 90–94. Association for Computational Linguistics, Stroudsburg, PA, USA

  • Wang Y, Huang M, Zhao L, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615

  • Wikipedia: Andrea. https://en.wikipedia.org/wiki/Andrea [Online; accessed March 22, 2020]

  • Wikipedia: Toni. https://en.wikipedia.org/wiki/Toni [Online; accessed March 22, 2020]

  • Wikipedia: Unisex name. https://en.wikipedia.org/wiki/Unisex_name [Online; accessed March 22, 2020]

  • Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Łukasz Kaiser, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR arXiv:1609.08144

  • Yao S, Huang B (2017) Beyond parity: Fairness objectives for collaborative filtering. In: advances in neural information processing systems, pp. 2921–2930

  • Ye J, Han S, Hu Y, Coskun B, Liu M, Qin H, Skiena S (2017) Nationality classification using name embeddings. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pp. 1897–1906. ACM, New York, NY, USA. https://doi.org/10.1145/3132847.3133008

  • Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. In: proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 649–657. MIT Press, Cambridge, MA, USA

  • Zhou X, Wan X, Xiao J (2016) Attention-based lstm network for cross-lingual sentiment classification. In: proceedings of the 2016 conference on empirical methods in natural language processing, pp. 247–256

Download references

Acknowledgements

The authors like to thank the reviewers for the very thoughtful reviews of earlier versions of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yifan Hu.

Additional information

Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

1.1 A.1 Fusion model that combines content information and names for gender prediction

We have seen in Sect. 4 (Table 4) that the content based model is inferior to the name based models. However, it is reasonable to expect that because content consumption by users captures behavior signals, it may offer useful information that can complement the name based models. This is especially true if the user has a unisex first name, for which it is not possible to predict with certainty the gender of the user by first name alone.

We conducted a study to investigate a simple ensemble model. We took a sample of 1M users from Yahoo Full Name Data. For each user, we predicted the male probability using the first name and the NBLR model. For the same user, we predicted the probability based on the content the user consumed, using the CONTENT model.

Fig. 5
figure 5

Fusion model that combines name and content information to predict gender

We experimented with the following ways of fusing the results of the content and name based models:

Logistic regression: we use \(P(\text {M}|\text {first})\) and \(P(\text {M}|\text {content})\) as features and declared gender as labels to fit a logistic regression model.

Logistic regression with logits: we first convert \(P(\text {M}|\text {first})\) and \(P(\text {M}|\text {content})\) to logits, then fit a logistic regression model. Specifically, we convert the probability to logits via the function \(p\rightarrow \ln (p/(1-p))\).

Xgboost: we use \(P(\text {M}|\text {first})\) and \(P(\text {M}|\text {content})\) as features and declared gender as labels to fit an xgboost model.

Table 8 Performance comparison of 3 fusion models, vs the baseline name and content models

We did a 5-fold cross-validation to derive the average results in Table 8. We do not report the standard deviation since it is quite small (under 0.0006). From the table, we see that the fusion model using xgboost can improve AUC by 0.008, accuracy by 0.012, confirming our belief that combining the content and names improves gender prediction. Logistic regression performed slightly worse than xgboost. Logistic regression with logits does not perform as well as logistic regression, although still better than the name based model alone.

1.2 A.2 Dictionary look-up vs model prediction

For users with popular names, it is unnecessary to use a name-based machine learning model to predict their genders. Instead, a simple dictionary lookup is sufficient – it avoids any error that might be introduced in the modeling process. On the other hand, for unpopular names, a model-based method is more suitable: when a name appears only a few times, statistically we have a low confidence in deciding the gender based on the few occurrences. Furthermore, for names that have never appeared before in the data, a machine learning model is the only choice.

Therefore, when deploying in real world applications, we suggest taking a hierarchical approach for the gender classification: if a name appears \(\ge k\) times in the population of users with declared gender, we decide the gender based on a dictionary lookup. For other names, we apply the machine learning based models. We will reason on a suitable choice of k next.

In general, we want to pick a name frequency k, above which we can trust the labels in the lookup dictionary with a high confidence. Suppose that a name is 30%/70% male/female in a very large population (e.g., the world). If there are k of our users with that name, drawn from that population randomly, then based on simple calculation using binomial distribution, we need at least \(k \ge 16\) to be sure that the name will be labelled as female with \(95\%\) or higher probability Footnote 3. For a name that is 40%/60% male/female, we need \(k\ge 66\). Based on the above considerations, we decide to choose \(k = {40}.\) With that name frequency, we know from Table 2 that a dictionary lookup can cover the gender of 86% of users. The gender for remaining users can be inferred by the machine learning models. We emphasize that dictionary lookup is proposed only for real-world applications. For the purpose of understanding our models for names of different frequencies, all results reported in this paper are based on the machine learning models. We have seen in Table 4 that our proposed models performed well for both popular and unpopular names, when compared with the baselines.

1.3 A.3 Understanding the models

In the previous subsection, we have seen that NBLR, LSTM and char-BERT all perform very similarly. Normally, a linear model such as NBLR is easier to understand than nonlinear models. By looking at the weights for the features, one can understand why a sample is predicted as positive or negative.

However, in the case of predicting the gender using the name string, the features are character n-grams extracted from the same name, hence they are not independent. This feature correlation can give rise to negative weights for some features that are associated with male names, which is compensated by larger positive weights for other correlated features (as a reminder, we use label 1 for male and 0 for female). Similarly positive weights may be observed for features that are associated with female names. This makes model interpretation from feature weights difficult. For example, when looking at the female name “_anna_” (we add prefix and suffix “_” to a name, see Sect. 3.2), we found that the 5-gram “anna_” has a positive weight of 0.36, which may be a surprise as one would expect that a name ends with “anna” should be a female name. However, a substring of “anna_” is “nna_”, and it has a negative weight of \(-1.19\). When combining the weights of all 17 character grams, together with the intercept, we end up with a weight of \(-9.01\) and a male probability of close to 0. Thus, looking at feature weights can be confusing and even misleading.

Instead, it is more instructive to apply one of our models to real datasets, and see how it behaves. We first apply NBLR to classify Chinese names. Note that the Chinese language has up to 100K characters. Of these, about 7000 are commonly used. A person’s given name could consist of any one or two characters. Therefore, the number of unique given names could be in the millions (\(7000\times 7000=49\text {M}\)). It is difficult to collect enough instances of each of these possible variations and their associated gender. This is where a character-based model is especially suited. As a sanity check, we looked at a list of 11 names from Lu (2018) that are considered especially beautiful, because of their poetic connotation or phonetic harmony. Table 9 shows the male probabilities for each of these names, together with a “note” column for their likely genders and explanation, taken from Lu (2018). We observe that our classifier does assign higher scores for names that are definitely male than those that are female.

Table 9 “Eleven Beautiful Chinese Names” from Lu (2018). P(M) is the probability of male, predicted by NBLR. The “label” column gives the likely gender, and the “note” column gives a description, both based on Lu (2018). Three of the eleven has frequency \(\ge 40\) and their genders (in brackets) could have been looked up
Table 10 Words with a high female/male score among: (a) 3,000 most frequent English words that are not in SSA Data; (b) 236,736 words in nltk.corpus.words that are neither in SSA Data nor in Yahoo Data. (c) S&P 500 companies

It is also helpful to understand the NBLR classifier by looking at English words that scored high as male or female. We first take the top (3000 most common words in 3000 most common words in english 2020), filter out 443 words that are found in SSA Data (such as “Rose”), then list the top 11 highest scored male/female words in Table 10 (a). Interestingly, a lot of them make sense, e.g., “miss” or “women” are female words due to their meanings. However, we notice that the majority of these 3,000 words are found in Yahoo Data as names. Next, we take 236,736 words in nltk.corpus.words and filter out any names in either SSA Data or Yahoo Data, and end up with 169,902 words. The highest scoring masculine/feminine words are given in Table 10 (b). Many of the words in column “feminine” of Table 10 (b) do look very feminine. For example, many words end in “ia” (e.g., “anadenia”). Words on the right-hand-side of the table are classified as masculine, likely due to the ending of the words, such as “*boy”. Incidentally “knappan” is a word in Malayalam (a Dravidian language primarily spoken in the south Indian state of Kerala) that means “a good for nothing guy”.

Table 10 (c) gives the most feminine and masculine company names in S&P 500 companies (SP 500 Companies 2020). The results are also interesting. Corporations with feminine names include a number of ladies fashion companies (Tiffany, Ulta, Macy’s), while many with masculine names are in engineering or energy sectors (e.g., Duke Energy, General Motor, General Electric).

Overall, the performance of NBLR on these three datasets illustrates that the model can extend well to rare or even unseen strings, and to non-English languages.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Y., Hu, C., Tran, T. et al. What’s in a name? – gender classification of names with character based machine learning models. Data Min Knowl Disc 35, 1537–1563 (2021). https://doi.org/10.1007/s10618-021-00748-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00748-6

Keywords

Navigation