Mining, analyzing, and modeling text written on mobile devices

K. Vertanen; P.O. Kristensson

doi:10.1017/S1351324919000548

Mining, analyzing, and modeling text written on mobile devices

Published online by Cambridge University Press: 10 October 2019

K. Vertanen and

P.O. Kristensson

Show author details

K. Vertanen*: Affiliation:
Michigan Technological University, Houghton, MI, USA
P.O. Kristensson: Affiliation:
University of Cambridge, Cambridge, UK
*: *Corresponding author. Email: vertanen@mtu.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We present a method for mining the web for text entered on mobile devices. Using searching, crawling, and parsing techniques, we locate text that can be reliably identified as originating from 300 mobile devices. This includes 341,000 sentences written on iPhones alone. Our data enables a richer understanding of how users type “in the wild” on their mobile devices. We compare text and error characteristics of different device types, such as touchscreen phones, phones with physical keyboards, and tablet computers. Using our mined data, we train language models and evaluate these models on mobile test data. A mixture model trained on our mined data, Twitter, blog, and forum data predicts mobile text better than baseline models. Using phone and smartwatch typing data from 135 users, we demonstrate our models improve the recognition accuracy and word predictions of a state-of-the-art touchscreen virtual keyboard decoder. Finally, we make our language models and mined dataset available to other researchers.

Keywords

Language resources Corpus linguistics Statistical methods Text data mining

Type: Article
Information: Natural Language Engineering , Volume 27 , Issue 1 , January 2021 , pp. 1 - 33

DOI: https://doi.org/10.1017/S1351324919000548 [Opens in a new window]
Copyright: © Cambridge University Press 2019

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baldwin, T. and Chai, J. (2012). Autonomous self-assessment of autocorrections: exploring text message dialogues. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics, pp. 710–719.Google Scholar

Bell, P., Yamamoto, H., Swietojanski, P., Wu, Y., McInnes, F., Hori, C. and Renals, S. (2013). A lecture transcription system combining neural network acoustic and Language Models. In Proceedings of INTERSPEECH. ISCA, pp. 3087–3091.Google Scholar

Bisani, M. and Ney, H. (2004). Bootstrap estimates for confidence intervals in ASR performance evaluation. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing. ICASSP’04. IEEE, pp. 409–411.CrossRef Google Scholar

Brill, E. and Moore, R.C. (2000). An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. ACL’00. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 286–293.CrossRef Google Scholar

Brody, S. and Diakopoulos, N. (2011). Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using word lengthening to detect sentiment in microblogs. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK: Association for Computational Linguistics, pp. 562–570.Google Scholar

Bulyko, I., Ostendorf, M., Siu, M., Ng, T., Stolcke, A. and Çetin, Ö. (2007). Web resources for language modeling in conversational speech recognition. ACM Transactions on Speech and Language Processing 5(1), 1:1–1:25.CrossRef Google Scholar

Burton, K., Java, A. and Soboroff, I. (2009). The ICWSM 2009 Spinn3r dataset. In: Proceedings of the 3rd Annual Conference on Weblogs and Social Media. ICWSM’09. Palo Alto, California, USA: AAAI.Google Scholar

Carey, J. (1980). Paralanguage in computer mediated communication. In Proceedings of the 18th Annual Meeting on Association for Computational Linguistics. ACL’80. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 67–69.CrossRef Google Scholar

Chelba, C., Brants, T., Neveitt, W. and Xu, P. (2010). Study on interaction between entropy pruning and Kneser–Ney smoothing. In Proceedings of INTERSPEECH. ISCA, pp. 2242–2245.Google Scholar

Chen, B., Kuhn, R., Foster, G., Cherry, C. and Huang, F. (2016). Bilingual methods for adaptive training data selection for machine translation. In Proceedings of the Association for Machine Translation in the Americas. AMTA’16, pp. 93–103.Google Scholar

Chen, S.F., Beeferman, D. and Rosenfeld, R. (1998). Evaluation metrics for language models. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, pp. 275–280.Google Scholar

Chen, S.F. and Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics. ACL’96. Morristown, NJ, USA: Association for Computational Linguistics, pp. 310–318.CrossRef Google Scholar

Chen, T. and Kan, M.-Y. (2013). Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47(2), 299–335.Google Scholar

Cooper, W.E. (1983). Cognitive Aspects of Skilled Typewriting. New York: Springer-Verlag.CrossRef Google Scholar

Creutz, M., Virpioja, S. and Kovaleva, A. (2009). Web augmentation of language models for continuous speech recognition of SMS text messages. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. EACL’09. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 157–165.CrossRef Google Scholar

Darragh, J.J., Witten, I.H. and James, M.L. (1990). The reactive keyboard: a predictive typing aid. Computer 23(11), 41–49.CrossRef Google Scholar

De Mulder, W., Bethard, S. and Moens, M.-F. (2015). A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language 30(1), 61–98.CrossRef Google Scholar

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M. and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Proceedings of the Conference on Computational Linguistics. ACL’14. Baltimore, USA: Association for Computational Linguistics, pp. 1370–1380.Google Scholar

Fowler, A., Partridge, K., Chelba, C., Bi, X., Ouyang, T. and Zhai, S. (2015). Effects of language modeling and its personalization on touchscreen typing performance. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’15. New York, NY, USA: ACM, 649–658.Google Scholar

Fu, B., Lin, J., Li, L., Faloutsos, C., Hong, J. and Sadeh, N. (2013). Why people hate your app: making sense of user feedback in a mobile app store. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD’13. New York, NY, USA: ACM, pp. 1276–1284.CrossRef Google Scholar

Gao, J., Goodman, J., Li, M. and Lee, K.-F. (2002). Toward a unified approach to statistical language modeling for chinese. ACM Transactions on Asian Language Information Processing (TALIP) 1(1), 3–33.CrossRef Google Scholar

Gillick, L. and Cox, S.J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing. ICASSP’89. IEEE, pp. 532–535.CrossRef Google Scholar

Goodman, J., Venolia, G., Steury, K. and Parker, C. (2002). Language modeling for soft keyboards. In Proceedings of the Eighteenth National Conference on Artificial Intelligence. Menlo Park, CA, USA: American Association for Artificial Intelligence, pp. 419–424.CrossRef Google Scholar

Grinter, R. and Eldridge, M. (2003). Wan2Tlk?: everyday text messaging. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’03. New York, NY, USA: ACM, pp. 441–448.CrossRef Google Scholar

Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. HLT’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 368–378.Google Scholar

Hayes, A.F. and Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1(1), 77–89.CrossRef Google Scholar

Heafield, K. (2011). KenLM: faster and smaller language model queries. In Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, pp. 187–197.Google Scholar

Hunt, M.J. (1990). Figures of merit for assessing connected-word recognisers. Speech Communication 9(4), 329–336.CrossRef Google Scholar

Kalman, Y.M. and Gergle, D. (2009). Letter and punctuation mark repeats as cues in computer-mediated communication. In 95th Annual Meeting of the National Communication Association in Chicago, IL. Google Scholar

Kamvar, M. and Baluja, S. (2007). Deciphering trends in mobile search. IEEE Computer 40(8), 58–62.CrossRef Google Scholar

Klimt, B. and Yang, Y. (2004). The enron corpus: a new dataset for email classification research. In Proceedings of the European Conference on Machine Learning. Springer-Verlag, pp. 217–226.CrossRef Google Scholar

Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Barcelona, Spain: Association for Computational Linguistics, pp. 388–395.Google Scholar

Kombrink, S., Mikolov, T., Karafiát, M. and Burget, L. (2011). Recurrent neural network based language modeling in meeting recognition. In Proceedings of INTERSPEECH. ISCA, vol. 11, pp. 2877–2880.Google Scholar

Kristensson, P.O. and Vertanen, K. (2012). Performance comparisons of phrase sets and presentation styles for text entry evaluations. In Proceedings of the 2012 ACM International Conference on Intelligent User Interfaces. IUI’12. New York, NY, USA: ACM, 29–32.CrossRef Google Scholar

Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439.CrossRef Google Scholar

Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, vol. 10, pp. 707–710. Available at https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf Google Scholar

Ling, R. (2005). The sociolinguistics of SMS: an analysis of SMS use by a random sample of Norwegians. In Ling, R. and Pedersen, P. E. (eds), Mobile Communications. London: Springer-Verlag London Limited, Springer, pp. 335–349.CrossRef Google Scholar

Ling, R. (2007). The Length of Text Messages and the Use of Predictive Texting: Who Uses It and How Much Do They Have to Say? TESOL, College of Arts and Sciences, American University.Google Scholar

Lui, M. and Baldwin, T. (2012). langid.py: an off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations. ACL’12. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 25–30.Google Scholar

Maalej, W. and Nabil, H. (2015). Bug report, feature request, or simply praise? On automatically classifying app reviews. In Proceedings of the 2015 IEEE 23rd International Requirements Engineering Conference (RE). IEEE, pp. 116–125.CrossRef Google Scholar

Mikolov, T., Deoras, A., Kombrink, S., Burget, L. and Cernocký, J. (2011). Empirical evaluation and combination of advanced language modeling techniques. In Proceedings of INTERSPEECH. ISCA, pp. 605–608.Google Scholar

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J. and Khudanpur, S. (2010). Recurrent neural network based language model. In Proceedings of INTERSPEECH. ISCA, pp. 1045–1048.Google Scholar

Moore, R.C. and Lewis, W. (2010). Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers. ACLShort’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 220–224.Google Scholar

Munro, R. (2011). Subword and spatiotemporal models for identifying actionable information in Haitian Kreyol. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning. CoNLL’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 68–77.Google Scholar

Munro, R. and Manning, C.D. (2010). Subword variation in text message classification. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp. 510–518.Google Scholar

Munro, R. and Manning, C.D. (2012). Short message communications: users, topics, and in-language processing. In: Proceedings of the 2nd ACM Symposium on Computing for Development. ACM.Google Scholar

Neviarouskaya, A., Prendinger, H. and Ishizuka, M. (2007). Textual affect sensing for sociable and expressive online communication. In Proceedings of the 2nd International Conference on Affective Computing and Intelligent Interaction. ACII’07. Berlin, Heidelberg: Springer-Verlag, pp. 218–229.CrossRef Google Scholar

O’Day, D.R. and Calix, R. (2013). Text message corpus: applying natural language processing to mobile device forensics. In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo Workshops. ICMEW’13. IEEE, pp. 1–6.CrossRef Google Scholar

Paek, T. and Hsu, B.-J. (Paul). (2011). Sampling representative phrase sets for text entry experiments: a procedure and public resource. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’11. New York, NY, USA: ACM, pp. 2477–2480.CrossRef Google Scholar

Pauls, A. and Klein, D. (2011). Faster and smaller N-gram language models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. HLT’11. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 258–267.Google Scholar

Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment classification. In Proceedings of the ACL Student Research Workshop. ACLstudent’05. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 43–48.CrossRef Google Scholar

Renals, S. (2010). Recognition and understanding of meetings. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT’10. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 1–9.Google Scholar

Riordan, M.A. and Kreuz, R.J. (2010). Cues in computer-mediated communication: a corpus analysis. Computers in Human Behavior 26(6), 1806–1817.CrossRef Google Scholar

Rosenfeld, R. (2000). Two decades of statistical language modeling: where do we go From here? In Proceedings of the IEEE. IEEE, vol. 88, pp. 1270–1278.Google Scholar

Rough, D., Vertanen, K. and Kristensson, P.O. (2014). An evaluation of dasher with a high-performance language model as a gaze communication method. In Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. AVI’14. New York, NY, USA: ACM, pp. 169–176.CrossRef Google Scholar

Schnoebelen, T. (2012). Do you smile with your nose? Stylistic variation in twitter emoticons. University of Pennsylvania Working Papers in Linguistics 18(2), 14.Google Scholar

Shaoul, C. and Westbury, C. (2009). A USENET Corpus (2005–2009). http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html. University of Alberta, Edmonton, AB.Google Scholar

Stolcke, A. (1998). Entropy-based pruning of backoff language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop. Morgan Kaufmann, pp. 270–274.Google Scholar

Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. In Proceedings of INTERSPEECH. ISCA, pp. 901–904.Google Scholar

Stolcke, A., Yuret, D. and Madnani, N. (2010). SRILM-FAQ - Frequently Asked Questions About SRI LM Tools. http://www.speech.sri.com/projects/srilm/manpages/srilm-faq.7.html.Google Scholar

Stolcke, A., Zheng, J., Wang, W. and Abrash, V. (2011). SRILM at sixteen: update and outlook. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. ASRU’11. IEEE, vol. 5.Google Scholar

Strik, H., Cucchiarini, C. and Kessens, J.M. (2001). Comparing the performance of two CSRs: how to determine the significance level of the differences. In Proceedings of INTERSPEECH. ISCA, pp. 2091–2094.Google Scholar

Tagg, C. (2009). A Corpus Linguistics Study of SMS Text Messaging. PhD Thesis, University of Birmingham, Birmingham, UK.Google Scholar

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D. and Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61(12), 2544–2558.CrossRef Google Scholar

Tong, X. and Evans, D.A. (1996). A statistical approach to automatic OCR error correction in context. In Proceedings of the Fourth Workshop on Very Large Corpora. Association for Computational Linguistics, pp. 88–100.Google Scholar

Vasa, R., Hoon, L., Mouzakis, K. and Noguchi, A. (2012). A preliminary analysis of mobile app user reviews. In Proceedings of the 24th Australian Computer-Human Interaction Conference. OzCHI’12. New York, NY, USA: ACM, pp. 241–244.CrossRef Google Scholar

Vertanen, K., Fletcher, C., Gaines, D., Gould, J. and Kristensson, P.O. (2018). The impact of word, multiple word, and sentence input on virtual keyboard decoding performance. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’18. New York, NY, USA: ACM, pp. 626:1–626:12.Google Scholar

Vertanen, K. and Kristensson, P.O. (2011a). The imagination of crowds: conversational AAC language modeling using crowdsourcing and large data sources. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK: Association for Computational Linguistics, pp. 700–711.Google Scholar

Vertanen, K. and Kristensson, P.O. (2011b). A versatile dataset for text entry evaluations based on genuine mobile emails. In Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services. MobileHCI’11. New York, NY, USA: ACM, pp. 295–298.Google Scholar

Vertanen, K. and Kristensson, P.O. (2014). Complementing text entry evaluations with a composition task. ACM Transactions on Computer-Human Interaction 21(2), 8:1–8:33.CrossRef Google Scholar

Vertanen, K., Memmi, H., Emge, J., Reyal, S. and Kristensson, P.O. (2015). VelociTap: investigating fast mobile text entry using sentence-based decoding of touchscreen keyboard input. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI’15. New York, NY, USA: ACM, pp. 659–668.CrossRef Google Scholar

Walther, J.B. and D’Addario, K.P. (2001). The impacts of emoticons on message interpretation in computer-mediated communication. Social Science Computer Review 19(3), 324–347.CrossRef Google Scholar

Ward, D.J., Blackwell, A.F. and MacKay, D.J.C. (2000). Dasher - a data entry Interface using continuous gestures and language models. In Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology. UIST’00. New York, NY, USA: ACM, pp. 129–137.CrossRef Google Scholar

Wobbrock, J.O. (2007). Measures of text entry performance, Chapter 3. In MacKenzie, I.S. and Tanaka-Ishii, K. (eds), Text Entry Systems. San Francisco, California, USA: Morgan Kauffman, pp. 47–74.CrossRef Google Scholar

Yao, K., Zweig, G., Hwang, M.-Y., Shi, Y. and Yu, D. (2013). Recurrent neural networks for language understanding. In Proceedings of INTERSPEECH. ISCA, pp. 2524–2528.Google Scholar

Article contents

Mining, analyzing, and modeling text written on mobile devices

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests