当前期刊: Natural Language Engineering Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • Cluster-based mention typing for named entity disambiguation
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-08-20
    Arda Çelebi; Arzucan Özgür

    An entity mention in text such as “Washington” may correspond to many different named entities such as the city “Washington D.C.” or the newspaper “Washington Post.” The goal of named entity disambiguation (NED) is to identify the mentioned named entity correctly among all possible candidates. If the type (e.g., location or person) of a mentioned entity can be correctly predicted from the context,

  • Benchmarks and goals
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-08-10
    Kenneth Ward Church

    Benchmarks can be a useful step toward the goals of the field (when the benchmark is on the critical path), as demonstrated by the GLUE benchmark, and deep nets such as BERT and ERNIE. The case for other benchmarks such as MUSE and WN18RR is less well established. Hopefully, these benchmarks are on a critical path toward progress on bilingual lexicon induction (BLI) and knowledge graph completion (KGC)

  • Incorporating word embeddings in unsupervised morphological segmentation
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-10
    Ahmet Üstün; Burcu Can

    We investigate the usage of semantic information for morphological segmentation since words that are derived from each other will remain semantically related. We use mathematical models such as maximum likelihood estimate (MLE) and maximum a posteriori estimate (MAP) by incorporating semantic information obtained from dense word vector representations. Our approach does not require any annotated data

  • Automatic generation of lexica for sentiment polarity shifters
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-09
    Marc Schulder; Michael Wiegand; Josef Ruppenhofer

    Alleviating pain is good and abandoning hope is bad. We instinctively understand how words like alleviate and abandon affect the polarity of a phrase, inverting or weakening it. When these words are content words, such as verbs, nouns, and adjectives, we refer to them as polarity shifters. Shifters are a frequent occurrence in human language and an important part of successfully modeling negation in

  • Focus of negation: Its identification in Spanish
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-08
    Mariona Taulé; Montserrat Nofre; Mónica González; Maria Antònia Martí

    This article describes the criteria for identifying the focus of negation in Spanish. This work involved an in-depth linguistic analysis of the focus of negation through which we identified some 10 different types of criteria that account for a wide variety of constructions containing negation. These criteria account for all the cases that appear in the NewsCom corpus and were assessed in the annotation

  • Towards syntax-aware token embeddings
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-08
    Diana Nicoleta Popa; Julien Perez; James Henderson; Eric Gaussier

    Distributional semantic word representations are at the basis of most modern NLP systems. Their usefulness has been proven across various tasks, particularly as inputs to deep learning models. Beyond that, much work investigated fine-tuning the generic word embeddings to leverage linguistic knowledge from large lexical resources. Some work investigated context-dependent word token embeddings motivated

  • Negation detection for sentiment analysis: A case study in Spanish
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-07
    Salud María Jiménez-Zafra; Noa P. Cruz-Díaz; Maite Taboada; María Teresa Martín-Valdivia

    Accurate negation identification is one of the most important tasks in the context of sentiment analysis. In order to correctly interpret the sentiment value of a particular expression, we need to identify whether it is in the scope of negation. While much of the work on negation detection has focused on English, we have seen recent developments that provide accurate identification of negation in other

  • Linguistic knowledge-based vocabularies for Neural Machine Translation
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-07-02
    Noe Casas; Marta R. Costa-jussà; José A. R. Fonollosa; Juan A. Alonso; Ramón Fanlo

    Neural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for

  • Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-30
    Clément Dalloux; Vincent Claveau; Natalia Grabar; Lucas Emanuel Silva Oliveira; Claudia Maria Cabral Moro; Yohan Bonescki Gumiel; Deborah Ribeiro Carvalho

    Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora

  • Learning from noisy out-of-domain corpus using dataless classification
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-17
    Yiping Jin; Dittaya Wanvarie; Phu T. V. Le

    In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain

  • Neural machine translation of low-resource languages using SMT phrase pair injection
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-17
    Sukanta Sen; Mohammed Hasanuzzaman; Asif Ekbal; Pushpak Bhattacharyya; Andy Way

    Neural machine translation (NMT) has recently shown promising results on publicly available benchmark datasets and is being rapidly adopted in various production systems. However, it requires high-quality large-scale parallel corpus, and it is not always possible to have sufficiently large corpus as it requires time, money, and professionals. Hence, many existing large-scale parallel corpus are limited

  • Natural language generation: The commercial state of the art in 2020
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-10
    Robert Dale

    It took a while, but natural language generation is now an established commercial software category. It’s commented upon frequently in both industry media and the mainstream press, and businesses are willing to pay hard cash to take advantage of the technology. We look at who’s active in the space, the nature of the technology that’s available today and where things might go in the future.

  • A clustering framework for lexical normalization of Roman Urdu
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-10
    Abdul Rafae Khan; Asim Karim; Hassan Sajjad; Faisal Kamiran; Jia Xu

    Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm

  • Improving speech emotion recognition based on acoustic words emotion dictionary
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-10
    Wang Wei; Xinyi Cao; He Li; Lingjie Shen; Yaqin Feng; Paul A. Watters

    To improve speech emotion recognition, a U-acoustic words emotion dictionary (AWED) features model is proposed based on an AWED. The method models emotional information from acoustic words level in different emotion classes. The top-list words in each emotion are selected to generate the AWED vector. Then, the U-AWED model is constructed by combining utterance-level acoustic features with the AWED

  • Imparting interpretability to word embeddings while preserving semantic structure
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-09
    Lütfi Kerem Şenel; İhsan Utlu; Furkan Şahinuç; Haldun M. Ozaktas; Aykut Koç

    As a ubiquitous method in natural language processing, word embeddings are extensively employed to map semantic properties of words into a dense vector representation. They capture semantic and syntactic relations among words, but the vectors corresponding to the words are only meaningful relative to each other. Neither the vector nor its dimensions have any absolute, interpretable meaning. We introduce

  • Computational generation of slogans
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-06-03
    Khalid Alnajjar; Hannu Toivonen

    In advertising, slogans are used to enhance the recall of the advertised product by consumers and to distinguish it from others in the market. Creating effective slogans is a resource-consuming task for humans. In this paper, we describe a novel method for automatically generating slogans, given a target concept (e.g., car) and an adjectival property to express (e.g., elegant) as input. Additionally

  • Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-27
    Jenna Kanerva; Filip Ginter; Tapio Salakoski

    In this paper, we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context

  • Temporally anchored spatial knowledge: Corpora and experiments
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-20
    Alakananda Vempala; Eduardo Blanco

    This article presents a two-step methodology to annotate temporally anchored spatial knowledge on top of OntoNotes. We first generate potential knowledge using semantic roles or syntactic dependencies and then crowdsource annotations to validate the potential knowledge. The resulting annotations indicate how long entities are or are not located somewhere and temporally anchor this spatial information

  • Compression versus traditional machine learning classifiers to detect code-switching in varieties and dialects: Arabic as a case study
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-05
    Taghreed Tarmom; William Teahan; Eric Atwell; Mohammad Ammar Alsalka

    The occurrence of code-switching in online communication, when a writer switches among multiple languages, presents a challenge for natural language processing tools, since they are designed for texts written in a single language. To answer the challenge, this paper presents detailed research on ways to detect code-switching in Arabic text automatically. We compare the prediction by partial matching

  • Spoken Arabic dialect recognition using X-vectors
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-05-04
    Abualsoud Hanani; Rabee Naser

    This paper describes our automatic dialect identification system for recognizing four major Arabic dialects, as well as Modern Standard Arabic. We adapted the X-vector framework, which was originally developed for speaker recognition, to the task of Arabic dialect identification (ADI). The training and development ADI VarDial 2018 and VarDial 2017 were used to train and test all of our ADI systems

  • Is your document novel? Let attention guide you. An attention-based model for document-level novelty detection
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-24
    Tirthankar Ghosal; Vignesh Edithal; Asif Ekbal; Pushpak Bhattacharyya; Srinivasa Satya Sameer Kumar Chivukula; George Tsatsaronis

    Detecting, whether a document contains sufficient new information to be deemed as novel, is of immense significance in this age of data duplication. Existing techniques for document-level novelty detection mostly perform at the lexical level and are unable to address the semantic-level redundancy. These techniques usually rely on handcrafted features extracted from the documents in a rule-based or

  • Sentiment analysis in Turkish: Supervised, semi-supervised, and unsupervised techniques
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-17
    Cem Rıfkı Aydın; Tunga Güngör

    Although many studies on sentiment analysis have been carried out for widely spoken languages, this topic is still immature for Turkish. Most of the works in this language focus on supervised models, which necessitate comprehensive annotated corpora. There are a few unsupervised methods, and they utilize sentiment lexicons either built by translating from English lexicons or created based on corpora

  • Effective multi-dialectal arabic POS tagging
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-14
    Kareem Darwish; Mohammed Attia; Hamdy Mubarak; Younes Samih; Ahmed Abdelali; Lluís Màrquez; Mohamed Eldesouki; Laura Kallmeyer

    This work introduces robust multi-dialectal part of speech tagging trained on an annotated data set of Arabic tweets in four major dialect groups: Egyptian, Levantine, Gulf, and Maghrebi. We implement two different sequence tagging approaches. The first uses conditional random fields (CRFs), while the second combines word- and character-based representations in a deep neural network with stacked layers

  • Emerging trends: Subwords, seriously?
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-07
    Kenneth Ward Church

    Subwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference

  • Text classification with semantically enriched word embeddings
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-06
    N. Pittaras; G. Giannakopoulos; G. Papadakis; V. Karkaletsis

    The recent breakthroughs in deep neural architectures across multiple machine learning fields have led to the widespread use of deep neural models. These learners are often applied as black-box models that ignore or insufficiently utilize a wealth of preexisting semantic information. In this study, we focus on the text classification task, investigating methods for augmenting the input to deep neural

  • Investigating translated Chinese and its variants using machine learning
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-04-03
    Hai Hu; Sandra Kübler

    Translations are generally assumed to share universal features that distinguish them from texts that are originally written in the same language. Thus, we can argue that these translations constitute their own variety of a language, often called translationese. However, translations are also influenced by their source languages and thus show different characteristics depending on the source language

  • Syntax-ignorant N-gram embeddings for dialectal Arabic sentiment analysis
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-16
    Hala Mulki; Hatem Haddad; Mourad Gridach; Ismail Babaoğlu

    Arabic sentiment analysis models have recently employed compositional paragraph or sentence embedding features to represent the informal Arabic dialectal content. These embeddings are mostly composed via ordered, syntax-aware composition functions and learned within deep neural network architectures. With the differences in the syntactic structure and words’ order among the Arabic dialects, a sentiment

  • Fine-grained analysis of language varieties and demographics
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-10
    Francisco Rangel; Paolo Rosso; Wajdi Zaghouani; Anis Charfi

    The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the

  • Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-09
    Renkui Hou; Chu-Ren Huang

    This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by

  • Nonuniform language in technical writing: Detection and correction
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-06
    Weibo Wang; Aminul Islam; Abidalrahman Moh’d; Axel J. Soto; Evangelos E. Milios

    Technical writing in professional environments, such as user manual authoring, requires the use of uniform language. Nonuniform language refers to sentences in a technical document that are intended to have the same meaning within a similar context, but use different words or writing style. Addressing this nonuniformity problem requires the performance of two tasks. The first task, which we named nonuniform

  • Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-03-04
    Paweł Cichosz

    Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown

  • Learning to rank for multi-label text classification: Combining different sources of information
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-02-18
    Hosein Azarbonyad; Mostafa Dehghani; Maarten Marx; Jaap Kamps

    Efficiently exploiting all sources of information such as labeled instances, classes’ representation, and relations of them has a high impact on the performance of Multi-Label Text Classification (MLTC) systems. Most of the current approaches use labeled documents as the primary source of information for MLTC. We investigate the effectiveness of different sources of information— such as the labeled

  • Constrained BERT BiLSTM CRF for understanding multi-sentence entity-seeking questions
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-02-13
    Danish Contractor; Barun Patra; Mausam; Parag Singla

    We present the novel task of understanding multi-sentence entity-seeking questions (MSEQs), that is, the questions that may be expressed in multiple sentences, and that expect one or more entities as an answer. We formulate the problem of understanding MSEQs as a semantic labeling task over an open representation that makes minimal assumptions about schema or ontology-specific semantic vocabulary.

  • Transfer learning for Turkish named entity recognition on noisy text
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-01-28
    Emre Kağan Akkaya; Burcu Can

    In this article, we investigate using deep neural networks with different word representation techniques for named entity recognition (NER) on Turkish noisy text. We argue that valuable latent features for NER can, in fact, be learned without using any hand-crafted features and/or domain-specific resources such as gazetteers and lexicons. In this regard, we utilize character-level, character n-gram-level

  • Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2020-01-28
    Branislava Šandrih; Cvetana Krstev; Ranka Stanković

    In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses

  • It all starts with entities: A Salient entity topic model
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-11-22
    Chuan Wu; Evangelos Kanoulas; Maarten de Rijke

    Entities play an essential role in understanding textual documents, regardless of whether the documents are short, such as tweets, or long, such as news articles. In short textual documents, all entities mentioned are usually considered equally important because of the limited amount of information. In long textual documents, however, not all entities are equally important: some are salient and others

  • Keyword extraction: Issues and methods
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-11-11
    Nazanin Firoozeh; Adeline Nazarenko; Fabrice Alizon; Béatrice Daille

    Due to the considerable growth of the volume of text documents on the Internet and in digital libraries, manual analysis of these documents is no longer feasible. Having efficient approaches to keyword extraction in order to retrieve the ‘key’ elements of the studied documents is now a necessity. Keyword extraction has been an active research field for many years, covering various applications in Text

  • Natural discourse reference generation reduces cognitive load in spoken systems.
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2011-07-01
    E Campana,M K Tanenhaus,J F Allen,R Remington

    The generation of referring expressions is a central topic in computational linguistics. Natural referring expressions - both definite references like 'the baseball cap' and pronouns like 'it' - are dependent on discourse context. We examine the practical implications of context-dependent referring expression generation for the design of spoken systems. Currently, not all spoken systems have the goal

  • Mining, analyzing, and modeling text written on mobile devices
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-10-10
    K. Vertanen; P.O. Kristensson

    We present a method for mining the web for text entered on mobile devices. Using searching, crawling, and parsing techniques, we locate text that can be reliably identified as originating from 300 mobile devices. This includes 341,000 sentences written on iPhones alone. Our data enables a richer understanding of how users type “in the wild” on their mobile devices. We compare text and error characteristics

  • Uncovering the language of wine experts
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-09-23
    Ilja Croijmans; Iris Hendrickx; Els Lefever; Asifa Majid; Antal Van Den Bosch

    Talking about odors and flavors is difficult for most people, yet experts appear to be able to convey critical information about wines in their reviews. This seems to be a contradiction, and wine expert descriptions are frequently received with criticism. Here, we propose a method for probing the language of wine reviews, and thus offer a means to enhance current vocabularies, and as a by-product question

  • Word sense disambiguation using implicit information
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-09-13
    Goonjan Jain; D.K. Lobiyal

    Humans proficiently interpret the true sense of an ambiguous word by establishing association among words in a sentence. The complete sense of text is also based on implicit information, which is not explicitly mentioned. The absence of this implicit information is a significant problem for a computer program that attempts to determine the correct sense of ambiguous words. In this paper, we propose

  • Learning keyphrases from corpora and knowledge models
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-09-10
    R. Silveira; V. Furtado; V. Pinheiro

    Extraction keyphrase systems traditionally use classification algorithms and do not consider the fact that part of the keyphrases may not be found in the text, reducing the accuracy of such algorithms a priori. In this work, we propose to improve the accuracy of these systems with inferential mechanisms that use a knowledge representation model, including symbolic models of knowledge bases and distributional

  • Tackling challenges of neural purchase stage identification from imbalanced twitter data
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-08-15
    Heike Adel; Francine Chen; Yan-Ying Chen

    Twitter and other social media platforms are often used for sharing interest in products. The identification of purchase decision stages, such as in the AIDA model (Awareness, Interest, Desire, and Action), can enable more personalized e-commerce services and a finer-grained targeting of advertisements than predicting purchase intent only. In this paper, we propose and analyze neural models for identifying

  • Domain bias in distinguishing Flemish and Dutch subtitles
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-08-15
    Hans van Halteren

    This paper describes experiments in which I tried to distinguish between Flemish and Netherlandic Dutch subtitles, as originally proposed in the VarDial 2018 Dutch–Flemish Subtitle task. However, rather than using all data as a monolithic block, I divided them into two non-overlapping domains and then investigated how the relation between training and test domains influences the recognition quality

  • Using linguistically defined specific details to detect deception across domains
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-08-01
    Nikolai Vogler; Lisa Pearl

    Current automatic deception detection approaches tend to rely on cues that are based either on specific lexical items or on linguistically abstract features that are not necessarily motivated by the psychology of deception. Notably, while approaches relying on such features can do well when the content domain is similar for training and testing, they suffer when content changes occur. We investigate

  • Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-07-24
    José Ramom Pichel Campos; Pablo Gamallo Otero; Iñaki Alegria Loinaz

    The objective of this work is to set a corpus-driven methodology to quantify automatically diachronic language distance between chronological periods of several languages. We apply a perplexity-based measure to written text representing different historical periods of three languages: European English, European Portuguese, and European Spanish. For this purpose, we have built historical corpora for

  • Detecting light verb constructions across languages
    Nat. Lang. Eng. (IF 1.465) Pub Date : 2019-07-15
    István Nagy T.; Anita Rácz; Veronika Vincze

    Light verb constructions (LVCs) are verb and noun combinations in which the verb has lost its meaning to some degree and the noun is used in one of its original senses, typically denoting an event or an action. They exhibit special linguistic features, especially when regarded in a multilingual context. In this paper, we focus on the automatic detection of LVCs in raw text in four different languages

Contents have been reproduced by permission of the publishers.
ACS ES&T Engineering
ACS ES&T Water
ACS Publications填问卷