  • ChoCo: a multimodal corpus of the Choctaw language
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-31
    Jacqueline Brixey, Ron Artstein

    This article presents a general use corpus for Choctaw, an American indigenous language (ISO 639-2: cho, endonym: Chahta). The corpus contains audio, video, and text resources, with many texts also translated in English. The Oklahoma Choctaw and the Mississippi Choctaw variants of the language are represented in the corpus. The data set provides documentation support for this threatened language, and

  • Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-29
    Dawn Knight, Fernando Loizides, Steven Neale, Laurence Anthony, Irena Spasić

    CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent

  • Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-26
    Sara Dahmani, Vincent Colotte, Slim Ouni

    In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing

  • Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-18
    Toqeer Ehsan, Sarmad Hussain

    A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser

  • Semantics-aware typographical choices via affective associations
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08
    Tugba Kulahcioglu, Gerard de Melo

    With the tens of thousands of fonts that are now readily available, it is non-trivial to select the most suitable font for a given use case. Considering the impact of the choice of font on human perception of the text, there is a strong need for semantic font search and recommendation. Aiming to fulfill this need, we induce a typographical lexicon providing associations between words and fonts. For

  • Building referring expression corpora with and without feedback
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08
    Danillo da Silva Rocha, Ivandré Paraboni

    The design of data collection experiments involving human participants is a common task in Referring Expression Generation (REG) and related fields. Many (or most) REG data collection tasks are implemented by making use of a human–computer (e.g., web-based) communicative setting, in which participants do not have any particular addressee in mind and do not receive any feedback regarding the appropriateness

  • Evaluating human corrections in a computer-assisted speaker diarization system
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-06
    Pierre-Alexandre Broux, Simon Petitrenaud, Sylvain Meignier, Jean Carrive, David Doukhan

    In this paper, we present a framework to evaluate the human corrections of a speaker diarization system. We propose four elementary actions to correct the diarization (“Create a boundary”, “Delete a boundary”, “Create a speaker label” and “Change the speaker label”) and we propose an automaton to simulate the correction sequence. A metric is described to evaluate the correction cost. The framework

  • Evaluating cross-lingual textual similarity on dictionary alignment problem
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-29
    Yiğit Sever, Gönenç Ercan

    Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual

  • ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-25
    Nhi-Thao Tran, Minh-Quoc Nghiem, Nhung T. H. Nguyen, Ngan Luu-Thuy Nguyen, Nam Van Chi, Dien Dinh

    Automatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has 199 clusters, there are only three documents in each cluster, which is small compared to typical datasets in English. This motivates us to construct ViMs—a

  • C2SI corpus: a database of speech disorder productions to assess intelligibility and quality of life in head and neck cancers
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-15
    Virginie Woisard, Corine Astésano, Mathieu Balaguer, Jérôme Farinas, Corinne Fredouille, Pascal Gaillard, Alain Ghio, Laurence Giusti, Imed Laaridh, Muriel Lalain, Benoît Lepage, Julie Mauclair, Olivier Nocaudie, Julien Pinquier, Gilles Pouchoulin, Michèle Puech, Danièle Robert, Vincent Roger

    Within the framework of the Carcinologic Speech Severity Index (C2SI) INCa Project, we collected a large database of French speech recordings aiming at validating Disorder Severity Indexes. Such a database will be useful for measuring the impact of oral and pharyngeal cavity cancer on speech production. It will permit to assess patients’ quality of life after treatment. The database is composed of

  • Writer’s uncertainty identification in scientific biomedical articles: a tool for automatic if-clause tagging
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-11
    Paolo Omero, Massimiliano Valotto, Riccardo Bellana, Ramona Bongelli, Ilaria Riccioni, Andrzej Zuczkowski, Carlo Tasso

    In a previous study, we manually identified seven categories (verbs, non-verbs, modal verbs in the simple present, modal verbs in the conditional mood, if, uncertain questions, and epistemic future) of Uncertainty Markers (UMs) in a corpus of 80 articles from the British Medical Journal randomly sampled from a 167-year period (1840–2007). The UMs detected on the base of an epistemic stance approach

  • Language resources for Maghrebi Arabic dialects’ NLP: a survey
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-25
    Jihene Younes, Emna Souissi, Hadhemi Achour, Ahmed Ferchichi

    Diglossia is one of the main characteristics of Arabic language. In Arab countries, there are three forms of Arabic that co-exist: Classical Arabic (CA) which is mainly used in the Quran and in several classical literary texts, Modern Standard Arabic (MSA) that descends from CA and used as official language, and various regional colloquial varieties of Arabic that are usually referred to as Arabic

  • Mapping languages: the Corpus of Global Language Use
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-08
    Jonathan Dunn

    This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports

  • A multi-platform dataset for detecting cyberbullying in social media
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-06
    David Van Bruwaene, Qianjia Huang, Diana Inkpen

    Recent work on cyberbullying detection relies on using machine learning models with text and metadata in small datasets, mostly drawn from single social media platforms. Such models have succeeded in predicting cyberbullying when dealing with posts containing the text and the metadata structure as found on the platform. Instead, we develop a multi-platform dataset that consists purely of the text from

  • Fake opinion detection: how similar are crowdsourced datasets to real data?
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-03-28
    Tommaso Fornaciari, Leticia Cagnina, Paolo Rosso, Massimo Poesio

    Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this

  • Automatic dialect identification system for Kannada language using single and ensemble SVM algorithms
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-21
    Nagaratna B. Chittaragi, Shashidhar G. Koolagudi

    In this paper, an automatic dialect identification (ADI) system is proposed by extracting spectral and prosodic features for Kannada language. A new dialect dataset is collected from native speakers of Kannada language (A Dravidian language). This dataset includes five distinct dialects of Kannada language representing five geographical regions of Karnataka state. Investigation of the significance

  • What's missing in geographical parsing?
    Lang. Resour. Eval. Pub Date : 2018-01-01
    Milan Gritta,Mohammad Taher Pilehvar,Nut Limsopatham,Nigel Collier

    Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as emergency responses, real-time social media geographical event analysis, understanding location instructions in auto-response systems and more. However

  • Investigating the cross-lingual translatability of VerbNet-style classification.
    Lang. Resour. Eval. Pub Date : 2018-01-01
    Olga Majewska,Ivan Vulić,Diana McCarthy,Yan Huang,Akira Murakami,Veronika Laippala,Anna Korhonen

    VerbNet-the most extensive online verb lexicon currently available for English-has proved useful in supporting a variety of NLP tasks. However, its exploitation in multilingual NLP has been limited by the fact that such classifications are available for few languages only. Since manual development of VerbNet is a major undertaking, researchers have recently translated VerbNet classes from English to

  • Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus.
    Lang. Resour. Eval. Pub Date : 2016-08-30
    Aleksandar Savkov,John Carroll,Rob Koeling,Jackie Cassell

    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction

  • A massively parallel corpus: the Bible in 100 languages.
    Lang. Resour. Eval. Pub Date : 2015-09-01
    Christos Christodouloupoulos,Mark Steedman

    We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English

  • Domain adaptation of statistical machine translation with domain-focused web crawling.
    Lang. Resour. Eval. Pub Date : 2015-06-30
    Pavel Pecina,Antonio Toral,Vassilis Papavassiliou,Prokopis Prokopidis,Aleš Tamchyna,Andy Way,Josef van Genabith

    In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework

  • The Hebrew CHILDES corpus: transcription and morphological analysis.
    Lang. Resour. Eval. Pub Date : 2014-11-25
    Aviad Albert,Brian MacWhinney,Bracha Nir,Shuly Wintner

    We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce

  • Building and evaluating resources for sentiment analysis in the Greek language.
    Lang. Resour. Eval. Pub Date : null
    Adam Tsakalidis,Symeon Papadopoulos,Rania Voskaki,Kyriaki Ioannidou,Christina Boididou,Alexandra I Cristea,Maria Liakata,Yiannis Kompatsiaris

    Sentiment lexicons and word embeddings constitute well-established sources of information for sentiment analysis in online social media. Although their effectiveness has been demonstrated in state-of-the-art sentiment analysis and related tasks in the English language, such publicly available resources are much less developed and evaluated for the Greek language. In this paper, we tackle the problems

  • Spanish corpora for sentiment analysis: a survey
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-31
    María Navas-Loro, Víctor Rodríguez-Doncel

    Corpora play an important role when training machine learning systems for sentiment analysis. However, Spanish is underrepresented in these corpora, as most primarily include English texts. This paper describes 20 Spanish-language text corpora—collected to support different tasks related to sentiment analysis, ranging from polarity to emotion categorization. We present a brand-new framework for the

  • Restoring Arabic vowels through omission-tolerant dictionary lookup
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-25
    Alexis Amid Neme, Sébastien Paumier

    Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring

  • A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-16
    Muhidin Mohamed, Mourad Oussalah

    In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is

  • TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-06
    Deniz Zeyrek, Amália Mendes, Yulia Grishina, Murathan Kurfalı, Samuel Gibbon, Maciej Ogrodniczuk

    TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature

  • DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-01
    Kheireddine Abainia

    Algeria’s socio-linguistic situation is known as a complex phenomenon involving several historical, cultural and technological factors. However, there are three languages that are mainly spoken in Algeria (Arabic, Tamazight and French) and they can be mixed in the same sentence (code-switching). Moreover, there are several varieties of dialects that differ from one region to another and sometimes within

  • In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-03-26
    Ayla Rigouts Terryn, Véronique Hoste, Els Lefever

    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need

  • DEMoS : an Italian emotional speech corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-22
    Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Batliner, Maximilian Schmitt, Björn W. Schuller

    We present DEMoS (Database of Elicited Mood in Speech), a new, large database with Italian emotional speech: 68 speakers, some 9 k speech samples. As Italian is under-represented in speech emotion research, for a comparison with the state-of-the-art, we model the ‘big 6 emotions’ and guilt. Besides making available this database for research, our contribution is three-fold: First, we employ a variety

