-
A machine translation mechanism of Brazilian Portuguese to Libras with syntactic-semantic adequacy Nat. Lang. Eng. (IF 1.465) Pub Date : 2021-02-01 Manuella Aschoff C. B. Lima; Tiago Maritan U. de Araújo; Rostand E. O. Costa; Erickson S. Oliveira
Deaf people communicate naturally using visual-spatial languages, called sign languages (SL). Although SLs are recognized as a language in many countries, the problems faced by Deaf people for accessing information remain. As a result, they have difficulties exercising their citizenship and access information in SLs, which usually leads to linguistic and knowledge acquisition delays. Some scientific
-
GPT-3: What’s it good for? Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-12-15 Robert Dale
GPT-3 made the mainstream media headlines this year, generating far more interest than we’d normally expect of a technical advance in NLP. People are fascinated by its ability to produce apparently novel text that reads as if it was written by a human. But what kind of practical applications can we expect to see, and can they be trusted?
-
Recent advances in processing negation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-12-17 Roser Morante; Eduardo Blanco
Negation is a complex linguistic phenomenon present in all human languages. It can be seen as an operator that transforms an expression into another expression whose meaning is in some way opposed to the original expression. In this article, we survey previous work on negation with an emphasis on computational approaches. We start defining negation and two important concepts: scope and focus of negation
-
Exploiting native language interference for native language identification Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-11-26 Ilia Markov; Vivi Nastase; Carlo Strapparava
Native language identification (NLI)—the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2)—is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investigation of features that model a variety of linguistic phenomena
-
Natural language processing for similar languages, varieties, and dialects: A survey Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-11-20 Marcos Zampieri; Preslav Nakov; Yves Scherrer
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar
-
Automatic classification of participant roles in cyberbullying: Can we detect victims, bullies, and bystanders in social media text? Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-11-18 Gilles Jacobs; Cynthia Van Hee; Véronique Hoste
Successful prevention of cyberbullying depends on the adequate detection of harmful messages. Given the impossibility of human moderation on the Social Web, intelligent systems are required to identify clues of cyberbullying automatically. Much work on cyberbullying detection focuses on detecting abusive language without analyzing the severity of the event nor the participants involved. Automatic analysis
-
Comparison of rule-based and neural network models for negation detection in radiology reports Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-11-18 D. Sykes; A. Grivas; C. Grover; R. Tobin; C. Sudlow; W. Whiteley; A. Mcintosh; H. Whalley; B. Alex
Using natural language processing, it is possible to extract structured information from raw text in the electronic health record (EHR) at reasonably high accuracy. However, the accurate distinction between negated and non-negated mentions of clinical terms remains a challenge. EHR text includes cases where diseases are stated not to be present or only hypothesised, meaning a disease can be mentioned
-
Improving sentiment analysis with multi-task learning of negation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-11-11 Jeremy Barnes; Erik Velldal; Lilja Øvrelid
Sentiment analysis is directly affected by compositional phenomena in language that act on the prior polarity of the words and phrases found in the text. Negation is the most prevalent of these phenomena, and in order to correctly predict sentiment, a classifier must be able to identify negation and disentangle the effect that its scope has on the final polarity of a text. This paper proposes a multi-task
-
A note on constituent parsing for Korean Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-11-10 Mija Kim; Jungyeul Park
This study deals with widespread issues on constituent parsing for Korean including the quantitative and qualitative error analyses on parsing results. The previous treebank grammars have been accepted as being interpretable in the various annotation schemes, whereas the recent parsers turn out to be much harder for humans to interpret. This paper, therefore, intends to find the concrete typology of
-
A systematic review of unsupervised approaches to grammar induction Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-10-27 Vigneshwaran Muralidaran; Irena Spasić; Dawn Knight
This study systematically reviews existing approaches to unsupervised grammar induction in terms of their theoretical underpinnings, practical implementations and evaluation. Our motivation is to identify the influence of functional-cognitive schools of grammar on language processing models in computational linguistics. This is an effort to fill any gap between the theoretical school and the computational
-
Unsupervised Arabic dialect segmentation for machine translation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-09-23 Wael Salloum; Nizar Habash
Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine
-
Improved feature decay algorithms for statistical machine translation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-09-22 Alberto Poncelas; Gideon Maillette de Buy Wenniger; Andy Way
In machine-learning applications, data selection is of crucial importance if good runtime performance is to be achieved. In a scenario where the test set is accessible when the model is being built, training instances can be selected so they are the most relevant for the test set. Feature Decay Algorithms (FDA) are a technique for data selection that has demonstrated excellent performance in a number
-
Cluster-based mention typing for named entity disambiguation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-08-20 Arda Çelebi; Arzucan Özgür
An entity mention in text such as “Washington” may correspond to many different named entities such as the city “Washington D.C.” or the newspaper “Washington Post.” The goal of named entity disambiguation (NED) is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g., location or person) of a mentioned entity can be correctly predicted from the context,
-
Benchmarks and goals Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-08-10 Kenneth Ward Church
Benchmarks can be a useful step toward the goals of the field (when the benchmark is on the critical path), as demonstrated by the GLUE benchmark, and deep nets such as BERT and ERNIE. The case for other benchmarks such as MUSE and WN18RR is less well established. Hopefully, these benchmarks are on a critical path toward progress on bilingual lexicon induction (BLI) and knowledge graph completion (KGC)
-
Incorporating word embeddings in unsupervised morphological segmentation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-10 Ahmet Üstün; Burcu Can
We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data
-
Automatic generation of lexica for sentiment polarity shifters Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-09 Marc Schulder; Michael Wiegand; Josef Ruppenhofer
Alleviating pain is good and abandoning hope is bad. We instinctively understand how words like alleviate and abandon affect the polarity of a phrase, inverting or weakening it. When these words are content words, such as verbs, nouns, and adjectives, we refer to them as polarity shifters. Shifters are a frequent occurrence in human language and an important part of successfully modeling negation in
-
Focus of negation: Its identification in Spanish Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-08 Mariona Taulé; Montserrat Nofre; Mónica González; Maria Antònia Martí
This article describes the criteria for identifying the focus of negation in Spanish. This work involved an in-depth linguistic analysis of the focus of negation through which we identified some 10 different types of criteria that account for a wide variety of constructions containing negation. These criteria account for all the cases that appear in the NewsCom corpus and were assessed in the annotation
-
Towards syntax-aware token embeddings Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-08 Diana Nicoleta Popa; Julien Perez; James Henderson; Eric Gaussier
Distributional semantic word representations are at the basis of most modern NLP systems. Their usefulness has been proven across various tasks, particularly as inputs to deep learning models. Beyond that, much work investigated fine-tuning the generic word embeddings to leverage linguistic knowledge from large lexical resources. Some work investigated context-dependent word token embeddings motivated
-
Negation detection for sentiment analysis: A case study in Spanish Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-07 Salud María Jiménez-Zafra; Noa P. Cruz-Díaz; Maite Taboada; María Teresa Martín-Valdivia
Accurate negation identification is one of the most important tasks in the context of sentiment analysis. In order to correctly interpret the sentiment value of a particular expression, we need to identify whether it is in the scope of negation. While much of the work on negation detection has focused on English, we have seen recent developments that provide accurate identification of negation in other
-
Linguistic knowledge-based vocabularies for Neural Machine Translation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-02 Noe Casas; Marta R. Costa-jussà; José A. R. Fonollosa; Juan A. Alonso; Ramón Fanlo
Neural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for
-
Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-30 Clément Dalloux; Vincent Claveau; Natalia Grabar; Lucas Emanuel Silva Oliveira; Claudia Maria Cabral Moro; Yohan Bonescki Gumiel; Deborah Ribeiro Carvalho
Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora
-
Learning from noisy out-of-domain corpus using dataless classification Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-17 Yiping Jin; Dittaya Wanvarie; Phu T. V. Le
In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain
-
Neural machine translation of low-resource languages using SMT phrase pair injection Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-17 Sukanta Sen; Mohammed Hasanuzzaman; Asif Ekbal; Pushpak Bhattacharyya; Andy Way
Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited
-
Natural language generation: The commercial state of the art in 2020 Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-10 Robert Dale
It took a while, but natural language generation is now an established commercial software category. It’s commented upon frequently in both industry media and the mainstream press, and businesses are willing to pay hard cash to take advantage of the technology. We look at who’s active in the space, the nature of the technology that’s available today and where things might go in the future.
-
A clustering framework for lexical normalization of Roman Urdu Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-10 Abdul Rafae Khan; Asim Karim; Hassan Sajjad; Faisal Kamiran; Jia Xu
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm
-
Improving speech emotion recognition based on acoustic words emotion dictionary Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-10 Wang Wei; Xinyi Cao; He Li; Lingjie Shen; Yaqin Feng; Paul A. Watters
To improve speech emotion recognition, a U-acoustic words emotion dictionary (AWED) features model is proposed based on an AWED. The method models emotional information from acoustic words level in different emotion classes. The top-list words in each emotion are selected to generate the AWED vector. Then, the U-AWED model is constructed by combining utterance-level acoustic features with the AWED
-
Imparting interpretability to word embeddings while preserving semantic structure Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-09 Lütfi Kerem Şenel; İhsan Utlu; Furkan Şahinuç; Haldun M. Ozaktas; Aykut Koç
As a ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words, but the vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We introduce
-
Computational generation of slogans Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-03 Khalid Alnajjar; Hannu Toivonen
In advertising, slogans are used to enhance the recall of the advertised product by consumers and to distinguish it from others in the market. Creating effective slogans is a resource-consuming task for humans. In this paper, we describe a novel method for automatically generating slogans, given a target concept (e.g., car) and an adjectival property to express (e.g., elegant) as input. Additionally
-
Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-27 Jenna Kanerva; Filip Ginter; Tapio Salakoski
In this paper, we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context
-
Temporally anchored spatial knowledge: Corpora and experiments Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-20 Alakananda Vempala; Eduardo Blanco
This article presents a two-step methodology to annotate temporally anchored spatial knowledge on top of OntoNotes. We first generate potential knowledge using semantic roles or syntactic dependencies and then crowdsource annotations to validate the potential knowledge. The resulting annotations indicate how long entities are or are not located somewhere and temporally anchor this spatial information
-
Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-05 Taghreed Tarmom; William Teahan; Eric Atwell; Mohammad Ammar Alsalka
The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching
-
Spoken Arabic dialect recognition using X-vectors Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-04 Abualsoud Hanani; Rabee Naser
This paper describes our automatic dialect identification system for recognizing four major Arabic dialects, as well as Modern Standard Arabic. We adapted the X-vector framework, which was originally developed for speaker recognition, to the task of Arabic dialect identification (ADI). The training and development ADI VarDial 2018 and VarDial 2017 were used to train and test all of our ADI systems
-
Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-24 Tirthankar Ghosal; Vignesh Edithal; Asif Ekbal; Pushpak Bhattacharyya; Srinivasa Satya Sameer Kumar Chivukula; George Tsatsaronis
Detecting, whether a document contains sufficient new information to be deemed as novel, is of immense significance in this age of data duplication. Existing techniques for document-level novelty detection mostly perform at the lexical level and are unable to address the semantic-level redundancy. These techniques usually rely on handcrafted features extracted from the documents in a rule-based or
-
Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-17 Cem Rıfkı Aydın; Tunga Güngör
Although many studies on sentiment analysis have been carried out for widely spoken languages, this topic is still immature for Turkish. Most of the works in this language focus on supervised models, which necessitate comprehensive annotated corpora. There are a few unsupervised methods, and they utilize sentiment lexicons either built by translating from English lexicons or created based on corpora
-
Effective multi-dialectal arabic POS tagging Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-14 Kareem Darwish; Mohammed Attia; Hamdy Mubarak; Younes Samih; Ahmed Abdelali; Lluís Màrquez; Mohamed Eldesouki; Laura Kallmeyer
This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers
-
Emerging trends: Subwords, seriously? Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-07 Kenneth Ward Church
Subwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference
-
Text classification with semantically enriched word embeddings Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-06 N. Pittaras; G. Giannakopoulos; G. Papadakis; V. Karkaletsis
The recent breakthroughs in deep neural architectures across multiple machine learning fields have led to the widespread use of deep neural models. These learners are often applied as black-box models that ignore or insufficiently utilize a wealth of preexisting semantic information. In this study, we focus on the text classification task, investigating methods for augmenting the input to deep neural
-
Investigating translated Chinese and its variants using machine learning Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-03 Hai Hu; Sandra Kübler
Translations are generally assumed to share universal features that distinguish them from texts that are originally written in the same language. Thus, we can argue that these translations constitute their own variety of a language, often called translationese. However, translations are also influenced by their source languages and thus show different characteristics depending on the source language
-
Syntax-ignorant N-gram embeddings for dialectal Arabic sentiment analysis Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-16 Hala Mulki; Hatem Haddad; Mourad Gridach; Ismail Babaoğlu
Arabic sentiment analysis models have recently employed compositional paragraph or sentence embedding features to represent the informal Arabic dialectal content. These embeddings are mostly composed via ordered, syntax-aware composition functions and learned within deep neural network architectures. With the differences in the syntactic structure and words’ order among the Arabic dialects, a sentiment
-
Fine-grained analysis of language varieties and demographics Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-10 Francisco Rangel; Paolo Rosso; Wajdi Zaghouani; Anis Charfi
The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the
-
Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-09 Renkui Hou; Chu-Ren Huang
This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by
-
Nonuniform language in technical writing: Detection and correction Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-06 Weibo Wang; Aminul Islam; Abidalrahman Moh’d; Axel J. Soto; Evangelos E. Milios
Technical writing in professional environments, such as user manual authoring, requires the use of uniform language. Nonuniform language refers to sentences in a technical document that are intended to have the same meaning within a similar context, but use different words or writing style. Addressing this nonuniformity problem requires the performance of two tasks. The first task, which we named nonuniform
-
Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-04 Paweł Cichosz
Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown
-
Learning to rank for multi-label text classification: Combining different sources of information Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-02-18 Hosein Azarbonyad; Mostafa Dehghani; Maarten Marx; Jaap Kamps
Efficiently exploiting all sources of information such as labeled instances, classes’ representation, and relations of them has a high impact on the performance of Multi-Label Text Classification (MLTC) systems. Most of the current approaches use labeled documents as the primary source of information for MLTC. We investigate the effectiveness of different sources of information— such as the labeled
-
Constrained BERT BiLSTM CRF for understanding multi-sentence entity-seeking questions Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-02-13 Danish Contractor; Barun Patra; Mausam; Parag Singla
We present the novel task of understanding multi-sentence entity-seeking questions (MSEQs), that is, the questions that may be expressed in multiple sentences, and that expect one or more entities as an answer. We formulate the problem of understanding MSEQs as a semantic labeling task over an open representation that makes minimal assumptions about schema or ontology-specific semantic vocabulary.
-
Transfer learning for Turkish named entity recognition on noisy text Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-01-28 Emre Kağan Akkaya; Burcu Can
In this article, we investigate using deep neural networks with different word representation techniques for named entity recognition (NER) on Turkish noisy text. We argue that valuable latent features for NER can, in fact, be learned without using any hand-crafted features and/or domain-specific resources such as gazetteers and lexicons. In this regard, we utilize character-level, character n-gram-level
-
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-01-28 Branislava Šandrih; Cvetana Krstev; Ranka Stanković
In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses
-
It all starts with entities: A Salient entity topic model Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-11-22 Chuan Wu; Evangelos Kanoulas; Maarten de Rijke
Entities play an essential role in understanding textual documents, regardless of whether the documents are short, such as tweets, or long, such as news articles. In short textual documents, all entities mentioned are usually considered equally important because of the limited amount of information. In long textual documents, however, not all entities are equally important: some are salient and others
-
Keyword extraction: Issues and methods Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-11-11 Nazanin Firoozeh; Adeline Nazarenko; Fabrice Alizon; Béatrice Daille
Due to the considerable growth of the volume of text documents on the Internet and in digital libraries, manual analysis of these documents is no longer feasible. Having efficient approaches to keyword extraction in order to retrieve the ‘key’ elements of the studied documents is now a necessity. Keyword extraction has been an active research field for many years, covering various applications in Text
-
Mining, analyzing, and modeling text written on mobile devices Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-10-10 K. Vertanen; P.O. Kristensson
We present a method for mining the web for text entered on mobile devices. Using searching, crawling, and parsing techniques, we locate text that can be reliably identified as originating from 300 mobile devices. This includes 341,000 sentences written on iPhones alone. Our data enables a richer understanding of how users type “in the wild” on their mobile devices. We compare text and error characteristics
-
Uncovering the language of wine experts Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-09-23 Ilja Croijmans; Iris Hendrickx; Els Lefever; Asifa Majid; Antal Van Den Bosch
Talking about odors and flavors is difficult for most people, yet experts appear to be able to convey critical information about wines in their reviews. This seems to be a contradiction, and wine expert descriptions are frequently received with criticism. Here, we propose a method for probing the language of wine reviews, and thus offer a means to enhance current vocabularies, and as a by-product question
-
Word sense disambiguation using implicit information Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-09-13 Goonjan Jain; D.K. Lobiyal
Humans proficiently interpret the true sense of an ambiguous word by establishing association among words in a sentence. The complete sense of text is also based on implicit information, which is not explicitly mentioned. The absence of this implicit information is a significant problem for a computer program that attempts to determine the correct sense of ambiguous words. In this paper, we propose
-
Learning keyphrases from corpora and knowledge models Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-09-10 R. Silveira; V. Furtado; V. Pinheiro
Extraction keyphrase systems traditionally use classification algorithms and do not consider the fact that part of the keyphrases may not be found in the text, reducing the accuracy of such algorithms a priori. In this work, we propose to improve the accuracy of these systems with inferential mechanisms that use a knowledge representation model, including symbolic models of knowledge bases and distributional
-
Tackling challenges of neural purchase stage identification from imbalanced twitter data Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-08-15 Heike Adel; Francine Chen; Yan-Ying Chen
Twitter and other social media platforms are often used for sharing interest in products. The identification of purchase decision stages, such as in the AIDA model (Awareness, Interest, Desire, and Action), can enable more personalized e-commerce services and a finer-grained targeting of advertisements than predicting purchase intent only. In this paper, we propose and analyze neural models for identifying
-
Domain bias in distinguishing Flemish and Dutch subtitles Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-08-15 Hans van Halteren
This paper describes experiments in which I tried to distinguish between Flemish and Netherlandic Dutch subtitles, as originally proposed in the VarDial 2018 Dutch–Flemish Subtitle task. However, rather than using all data as a monolithic block, I divided them into two non-overlapping domains and then investigated how the relation between training and test domains influences the recognition quality
-
Using linguistically defined specific details to detect deception across domains Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-08-01 Nikolai Vogler; Lisa Pearl
Current automatic deception detection approaches tend to rely on cues that are based either on specific lexical items or on linguistically abstract features that are not necessarily motivated by the psychology of deception. Notably, while approaches relying on such features can do well when the content domain is similar for training and testing, they suffer when content changes occur. We investigate
-
Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-07-24 José Ramom Pichel Campos; Pablo Gamallo Otero; Iñaki Alegria Loinaz
The objective of this work is to set a corpus-driven methodology to quantify automatically diachronic language distance between chronological periods of several languages. We apply a perplexity-based measure to written text representing different historical periods of three languages: European English, European Portuguese, and European Spanish. For this purpose, we have built historical corpora for
-
Detecting light verb constructions across languages Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-07-15 István Nagy T.; Anita Rácz; Veronika Vincze
Light verb constructions (LVCs) are verb and noun combinations in which the verb has lost its meaning to some degree and the noun is used in one of its original senses, typically denoting an event or an action. They exhibit special linguistic features, especially when regarded in a multilingual context. In this paper, we focus on the automatic detection of LVCs in raw text in four different languages
Contents have been reproduced by permission of the publishers.