当前期刊: Language Resources and Evaluation Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • MEmoFC: introducing the Multilingual Emotional Football Corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-16
    Nadine Braun, Chris van der Lee, Lorenzo Gatti, Martijn Goudbeek, Emiel Krahmer

    This paper introduces a new corpus of paired football match reports, the Multilingual Emotional Football Corpus, (MEmoFC), which has been manually collected from English, German, and Dutch websites of individual football clubs to investigate the way different emotional states (e.g. happiness for winning and disappointment for losing) are realized in written language. In addition to the reports, it

    更新日期:2020-10-17
  • Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-12
    Eiman Alsharhan, Allan Ramsay

    Research in Arabic automatic speech recognition (ASR) is constrained by datasets of limited size, and of highly variable content and quality. Arabic-language resources vary in the attributes that affect language resources in other languages (noise, channel, speaker, genre), but also vary significantly in the dialect and level of formality of the spoken Arabic they capture. Many languages suffer similar

    更新日期:2020-10-12
  • The B-Subtle framework: tailoring subtitles to your needs
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-11
    Miguel Ventura, Jessica Veiga, Luisa Coheur, Sandra Gama

    Large amounts of subtitles, from movies and TV shows, can be easily found on the web, for free, in almost every language. Several corpora, built from subtitles, with different annotations and purposes, are currently available. Considering that new sets of subtitles are constantly being released, we propose B-Subtle, an open source framework that allows the automatic creation of corpora constituted

    更新日期:2020-10-11
  • Arabic real time entity resolution using inverted indexing
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-07
    Marwah Alian, Ghazi Al-Naymat, Banda Ramadan

    Arabic datasets that have two or more records for the same world entity (i.e. person, object, etc.) make institutions suffer from low quality and degraded performance due to duplication in their Arabic datasets without having any mechanism for detecting these duplicates. The operation that distinguishes records for the same real-world entity is called Entity Resolution (ER). It is considered as a tool

    更新日期:2020-10-07
  • Resources and benchmark corpora for hate speech detection: a systematic review
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-09-30
    Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, Viviana Patti

    Hate Speech in social media is a complex phenomenon, whose detection has recently gained significant traction in the Natural Language Processing community, as attested by several recent review works. Annotated corpora and benchmarks are key resources, considering the vast number of supervised approaches that have been proposed. Lexica play an important role as well for the development of hate speech

    更新日期:2020-09-30
  • The KAS corpus of Slovenian academic writing
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-09-24
    Tomaž Erjavec, Darja Fišer, Nikola Ljubešić

    The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made

    更新日期:2020-09-24
  • The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-09-04
    Richard Futrell, Edward Gibson, Harry J. Tily, Idan Blank, Anastasia Vishnevetsky, Steven T. Piantadosi, Evelina Fedorenko

    It is now a common practice to compare models of human language processing by comparing how well they predict behavioral and neural measures of processing difficulty, such as reading times, on corpora of rich naturalistic linguistic materials. However, many of these corpora, which are based on naturally-occurring text, do not contain many of the low-frequency syntactic constructions that are often

    更新日期:2020-09-05
  • ChoCo: a multimodal corpus of the Choctaw language
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-31
    Jacqueline Brixey, Ron Artstein

    This article presents a general use corpus for Choctaw, an American indigenous language (ISO 639-2: cho, endonym: Chahta). The corpus contains audio, video, and text resources, with many texts also translated in English. The Oklahoma Choctaw and the Mississippi Choctaw variants of the language are represented in the corpus. The data set provides documentation support for this threatened language, and

    更新日期:2020-08-01
  • Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-29
    Dawn Knight, Fernando Loizides, Steven Neale, Laurence Anthony, Irena Spasić

    CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent

    更新日期:2020-07-29
  • Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-26
    Sara Dahmani, Vincent Colotte, Slim Ouni

    In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing

    更新日期:2020-07-26
  • Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-18
    Toqeer Ehsan, Sarmad Hussain

    A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser

    更新日期:2020-07-24
  • Semantics-aware typographical choices via affective associations
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08
    Tugba Kulahcioglu, Gerard de Melo

    With the tens of thousands of fonts that are now readily available, it is non-trivial to select the most suitable font for a given use case. Considering the impact of the choice of font on human perception of the text, there is a strong need for semantic font search and recommendation. Aiming to fulfill this need, we induce a typographical lexicon providing associations between words and fonts. For

    更新日期:2020-07-24
  • Building referring expression corpora with and without feedback
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08
    Danillo da Silva Rocha, Ivandré Paraboni

    The design of data collection experiments involving human participants is a common task in Referring Expression Generation (REG) and related fields. Many (or most) REG data collection tasks are implemented by making use of a human–computer (e.g., web-based) communicative setting, in which participants do not have any particular addressee in mind and do not receive any feedback regarding the appropriateness

    更新日期:2020-07-24
  • Evaluating human corrections in a computer-assisted speaker diarization system
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-06
    Pierre-Alexandre Broux, Simon Petitrenaud, Sylvain Meignier, Jean Carrive, David Doukhan

    In this paper, we present a framework to evaluate the human corrections of a speaker diarization system. We propose four elementary actions to correct the diarization (“Create a boundary”, “Delete a boundary”, “Create a speaker label” and “Change the speaker label”) and we propose an automaton to simulate the correction sequence. A metric is described to evaluate the correction cost. The framework

    更新日期:2020-07-24
  • Evaluating cross-lingual textual similarity on dictionary alignment problem
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-29
    Yiğit Sever, Gönenç Ercan

    Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual

    更新日期:2020-07-24
  • ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-25
    Nhi-Thao Tran, Minh-Quoc Nghiem, Nhung T. H. Nguyen, Ngan Luu-Thuy Nguyen, Nam Van Chi, Dien Dinh

    Automatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has 199 clusters, there are only three documents in each cluster, which is small compared to typical datasets in English. This motivates us to construct ViMs—a

    更新日期:2020-07-24
  • C2SI corpus: a database of speech disorder productions to assess intelligibility and quality of life in head and neck cancers
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-15
    Virginie Woisard, Corine Astésano, Mathieu Balaguer, Jérôme Farinas, Corinne Fredouille, Pascal Gaillard, Alain Ghio, Laurence Giusti, Imed Laaridh, Muriel Lalain, Benoît Lepage, Julie Mauclair, Olivier Nocaudie, Julien Pinquier, Gilles Pouchoulin, Michèle Puech, Danièle Robert, Vincent Roger

    Within the framework of the Carcinologic Speech Severity Index (C2SI) INCa Project, we collected a large database of French speech recordings aiming at validating Disorder Severity Indexes. Such a database will be useful for measuring the impact of oral and pharyngeal cavity cancer on speech production. It will permit to assess patients’ quality of life after treatment. The database is composed of

    更新日期:2020-07-24
  • Writer’s uncertainty identification in scientific biomedical articles: a tool for automatic if-clause tagging
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-11
    Paolo Omero, Massimiliano Valotto, Riccardo Bellana, Ramona Bongelli, Ilaria Riccioni, Andrzej Zuczkowski, Carlo Tasso

    In a previous study, we manually identified seven categories (verbs, non-verbs, modal verbs in the simple present, modal verbs in the conditional mood, if, uncertain questions, and epistemic future) of Uncertainty Markers (UMs) in a corpus of 80 articles from the British Medical Journal randomly sampled from a 167-year period (1840–2007). The UMs detected on the base of an epistemic stance approach

    更新日期:2020-07-24
  • Language resources for Maghrebi Arabic dialects’ NLP: a survey
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-25
    Jihene Younes, Emna Souissi, Hadhemi Achour, Ahmed Ferchichi

    Diglossia is one of the main characteristics of Arabic language. In Arab countries, there are three forms of Arabic that co-exist: Classical Arabic (CA) which is mainly used in the Quran and in several classical literary texts, Modern Standard Arabic (MSA) that descends from CA and used as official language, and various regional colloquial varieties of Arabic that are usually referred to as Arabic

    更新日期:2020-04-25
  • Mapping languages: the Corpus of Global Language Use
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-08
    Jonathan Dunn

    This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports

    更新日期:2020-04-08
  • A multi-platform dataset for detecting cyberbullying in social media
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-06
    David Van Bruwaene, Qianjia Huang, Diana Inkpen

    Recent work on cyberbullying detection relies on using machine learning models with text and metadata in small datasets, mostly drawn from single social media platforms. Such models have succeeded in predicting cyberbullying when dealing with posts containing the text and the metadata structure as found on the platform. Instead, we develop a multi-platform dataset that consists purely of the text from

    更新日期:2020-04-06
  • Fake opinion detection: how similar are crowdsourced datasets to real data?
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-03-28
    Tommaso Fornaciari, Leticia Cagnina, Paolo Rosso, Massimo Poesio

    Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this

    更新日期:2020-03-28
  • Comparing web-crawled and traditional corpora
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-03-19
    Václav Cvrček; Zuzana Komrsková; David Lukeš; Petra Poukarová; Anna Řehořková; Adrian Jan Zasina; Vladimír Benko

    Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a “traditional” corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the “searchable” web. Both types of corpora are projected onto the space induced by the MD model,

    更新日期:2020-03-19
  • NorthEuraLex: a wide-coverage lexical database of Northern Eurasia
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-30
    Johannes Dellert; Thora Daneyko; Alla Münch; Alina Ladygina; Armin Buch; Natalie Clarius; Ilja Grigorjew; Mohamed Balabel; Hizniye Isabella Boga; Zalina Baysarova; Roland Mühlenbernd; Johannes Wahle; Gerhard Jäger

    This article describes the first release version of a new lexicostatistical database of Northern Eurasia, which includes Europe as the most well-researched linguistic area. Unlike in other areas of the world, where databases are restricted to covering a small number of concepts as far as possible based on often sparse documentation, good lexical resources providing wide coverage of the lexicon are

    更新日期:2019-11-30
  • Automatic dialect identification system for Kannada language using single and ensemble SVM algorithms
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-21
    Nagaratna B. Chittaragi; Shashidhar G. Koolagudi

    In this paper, an automatic dialect identification (ADI) system is proposed by extracting spectral and prosodic features for Kannada language. A new dialect dataset is collected from native speakers of Kannada language (A Dravidian language). This dataset includes five distinct dialects of Kannada language representing five geographical regions of Karnataka state. Investigation of the significance

    更新日期:2019-11-21
  • Reproduction, replication, analysis and adaptation of a term alignment approach
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-18
    Andraž Repar; Matej Martinc; Senja Pollak

    In this paper, we look at the issue of reproducibility and replicability in bilingual terminology alignment (BTA). We propose a set of best practices for reproducibility and replicability of NLP papers and analyze several influential BTA papers from this perspective. Next, we present our attempts at replication and reproduction, where we focus on a bilingual terminology alignment approach described

    更新日期:2019-11-18
  • Prosodic word boundary detection from Bengali continuous speech
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-13
    Tanmay Bhowmik; Shyamal Kumar Das Mandal

    Detection of word boundaries in continuous speech is a tedious process due to the absence of a definite pause or silence in the word boundary position. Thus, continuous speech recognition is a very challenging task. However, the prosodic word boundaries, unlike the written word boundaries, can be predicted using the prosodic parameters of continuous speech. This paper proposes a method for detecting

    更新日期:2019-11-13
  • A pragmatic guide to geoparsing evaluation
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-09-19
    Milan Gritta; Mohammad Taher Pilehvar; Nigel Collier

    Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym

    更新日期:2019-09-19
  • The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-08-23
    Karoline Kühl; Jan Heegård Petersen; Gert Foget Hansen

    This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian

    更新日期:2019-08-23
  • A Finnish news corpus for named entity recognition
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-08-01
    Teemu Ruokolainen; Pekka Kauppinen; Miikka Silfverberg; Krister Lindén

    We present a corpus of Finnish news articles with a manually prepared named entity annotation. The corpus consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event, and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source. The corpus is available for research purposes. We present

    更新日期:2019-08-01
  • The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-07-24
    Reem F. Alfuraih

    Around the world, a growing interest has been seen in learner translator corpora, which are invaluable resources for teaching and research. This paper introduces a new resource to support researchers from different interdisciplinary areas such as computational linguistics, descriptive translation studies, computer-aided translation technology, Arabic machine translation applications, cognitive science

    更新日期:2019-07-24
  • Computational text analysis within the Humanities: How to combine working practices from the contributing fields?
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-06-26
    Jonas Kuhn

    This position paper is based on a keynote presentation at the COLING 2016 Workshop on Language Technology for Digital Humanities in Osaka, Japan. It departs from observations about working practices in Humanities disciplines following a hermeneutic tradition of text interpretation versus the method-oriented research strategies in Computational Linguistics (CL). The respective praxeological traditions

    更新日期:2019-06-26
  • Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-06-04
    Aynat Rubinstein

    The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of

    更新日期:2019-06-04
  • Digitization of data for a historical medical dictionary
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-06-04
    Juhani Norri; Marko Junkkari; Timo Poranen

    What are known as specialized or specialist dictionaries are much more than lists of words and their definitions with occasional comments on things such as synonymy and homonymy. That is to say, a particular specialist term may be associated with many other concepts, including quotations, different senses, etymological categories, semantic categories, superordinate and subordinate terms in the terminological

    更新日期:2019-06-04
  • Spanish corpora for sentiment analysis: a survey
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-31
    María Navas-Loro; Víctor Rodríguez-Doncel

    Corpora play an important role when training machine learning systems for sentiment analysis. However, Spanish is underrepresented in these corpora, as most primarily include English texts. This paper describes 20 Spanish-language text corpora—collected to support different tasks related to sentiment analysis, ranging from polarity to emotion categorization. We present a brand-new framework for the

    更新日期:2019-05-31
  • Automatic detection and correction of discourse marker errors made by Spanish native speakers in Portuguese academic writing
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-06
    Lianet Sepúlveda-Torres; Magali Sanches Duran; Sandra Maria Aluísio

    Discourse markers are words and expressions (such as: firstly, then, for example, because, as a result, likewise, in comparison, in contrast) that explicitly state the relational structure of the information in the text, i.e. signalling a sequential relationship between the current message and the previous discourse. Using these markers improves the cohesion and coherence of texts, facilitating reading

    更新日期:2019-05-06
  • Dialogue analysis: a case study on the New Testament
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-02
    Chak Yan Yeung; John Lee

    There has been much research on the nature of dialogues in the Bible. While the research literature abounds with qualitative analyses on these dialogues, they are rarely corroborated on statistics from the entire text. In this article, we leverage a corpus of annotated direct speech in the New Testament, as well as recent advances in automatic speaker and listener identification, to present a quantitative

    更新日期:2019-05-02
  • Restoring Arabic vowels through omission-tolerant dictionary lookup
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-25
    Alexis Amid Neme; Sébastien Paumier

    Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring

    更新日期:2019-04-25
  • Building the first comprehensive machine-readable Turkish sign language resource: methods, challenges and solutions
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-17
    Gülşen Eryiğit; Cihat Eryiğit; Serpil Karabüklü; Meltem Kelepir; Aslı Özkul; Tuğba Pamay; Dilara Torunoğlu-Selamet; Hatice Köse

    This article describes the procedures employed during the development of the first comprehensive machine-readable Turkish Sign Language (TiD) resource: a bilingual lexical database and a parallel corpus between Turkish and TiD. In addition to sign language specific annotations (such as non-manual markers, classifiers and buoys) following the recently introduced TiD knowledge representation (Eryiğit

    更新日期:2019-04-17
  • Exploiting languages proximity for part-of-speech tagging of three French regional languages
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-16
    Pierre Magistry; Anne-Laure Ligozat; Sophie Rosset

    This paper presents experiments in part-of-speech tagging of low-resource languages. It addresses the case when no labeled data in the targeted language and no parallel corpus are available. We only rely on the proximity of the targeted language to a better-resourced language. We conduct experiments on three French regional languages. We try to exploit this proximity with two main strategies: delexicalization

    更新日期:2019-04-16
  • A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-16
    Muhidin Mohamed; Mourad Oussalah

    In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is

    更新日期:2019-04-16
  • Studying the history of the Arabic language: language technology and a large-scale historical corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-12
    Yonatan Belinkov; Alexander Magidow; Alberto Barrón-Cedeño; Avi Shmidman; Maxim Romanov

    Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our

    更新日期:2019-04-12
  • Approaching terminological ambiguity in cross-disciplinary communication as a word sense induction task: a pilot study
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-12
    Julie Mennes; Ted Pedersen; Els Lefever

    Cross-disciplinary communication is often impeded by terminological ambiguity. Hence, cross-disciplinary teams would greatly benefit from using a language technology-based tool that allows for the (at least semi-) automated resolution of ambiguous terms. Although no such tool is readily available, an interesting theoretical outline of one does exist. The main obstacle for the concrete realization of

    更新日期:2019-04-12
  • Digitising Swiss German: how to process and study a polycentric spoken language
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-11
    Yves Scherrer; Tanja Samardžić; Elvira Glaser

    Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus

    更新日期:2019-04-11
  • From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-08
    Sarah Schulz; Nora Ketschik

    By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training

    更新日期:2019-04-08
  • Beyond lexical frequencies: using R for text analysis in the digital humanities
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-08
    Taylor Arnold; Nicolas Ballier; Paula Lissón; Lauren Tilton

    This paper presents a combination of R packages—user contributed toolkits written in a common core programming language—to facilitate the humanistic investigation of digitised, text-based corpora. Our survey of text analysis packages includes those of our own creation (cleanNLP and fasttextM) as well as packages built by other research groups (stringi, readtext, hyphenatr, quanteda, and hunspell).

    更新日期:2019-04-08
  • TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-06
    Deniz Zeyrek; Amália Mendes; Yulia Grishina; Murathan Kurfalı; Samuel Gibbon; Maciej Ogrodniczuk

    TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature

    更新日期:2019-04-06
  • DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-01
    Kheireddine Abainia

    Algeria’s socio-linguistic situation is known as a complex phenomenon involving several historical, cultural and technological factors. However, there are three languages that are mainly spoken in Algeria (Arabic, Tamazight and French) and they can be mixed in the same sentence (code-switching). Moreover, there are several varieties of dialects that differ from one region to another and sometimes within

    更新日期:2019-04-01
  • Capturing and measuring thematic relatedness
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-03-27
    Magdalena Kacmajor; John D. Kelleher

    In this paper we explain the difference between two aspects of semantic relatedness: taxonomic and thematic relations. We notice the lack of evaluation tools for measuring thematic relatedness, identify two datasets that can be recommended as thematic benchmarks, and verify them experimentally. In further experiments, we use these datasets to perform a comprehensive analysis of the performance of an

    更新日期:2019-03-27
  • In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-03-26
    Ayla Rigouts Terryn; Véronique Hoste; Els Lefever

    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need

    更新日期:2019-03-26
  • Constructing two vietnamese corpora and building a lexical database
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-03-21
    Hien Pham; Benjamin V. Tucker; R. Harald Baayen

    Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese

    更新日期:2019-03-21
  • Geoparsing historical and contemporary literary text set in the City of Edinburgh
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-26
    Beatrice Alex; Claire Grover; Richard Tobin; Jon Oberlander

    While a reasonable amount of work has gone into automatically geoparsing text at the city or higher levels of granularity for different types of texts in different domains, there is relatively little research on geoparsing fine-grained locations such as buildings, green spaces and street names in text. This paper reports on how the Edinburgh Geoparser performs on this task for different types of literary

    更新日期:2019-02-26
  • DEMoS : an Italian emotional speech corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-22
    Emilia Parada-Cabaleiro; Giovanni Costantini; Anton Batliner; Maximilian Schmitt; Björn W. Schuller

    We present DEMoS (Database of Elicited Mood in Speech), a new, large database with Italian emotional speech: 68 speakers, some 9 k speech samples. As Italian is under-represented in speech emotion research, for a comparison with the state-of-the-art, we model the ‘big 6 emotions’ and guilt. Besides making available this database for research, our contribution is three-fold: First, we employ a variety

    更新日期:2019-02-22
  • Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-19
    Luboš Šmídl; Jan Švec; Daniel Tihelka; Jindřich Matoušek; Jan Romportl; Pavel Ircing

    The paper introduces the motivation for creating dedicated speech corpora of air traffic control communication, describes in detail the process of preparation of corpora for both automatic speech recognition and text-to-speech synthesis, presents an illustrative example of speech recognition system developed using the automatic speech recognition corpora and finally describes the technical aspects

    更新日期:2019-02-19
  • Argumentation in the 2016 US presidential elections: annotated corpora of television debates and social media reaction
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-09
    Jacky Visser; Barbara Konat; Rory Duthie; Marcin Koszowy; Katarzyna Budzynska; Chris Reed

    In this paper we present US2016, the largest publicly available set of corpora of annotated dialogical argumentation. The annotation covers argumentative relations, dialogue acts and pragmatic features. The corpora comprise transcriptions of television debates leading up to the 2016 US presidential elections, and reactions to the debates on Reddit. These two constitutive parts of the corpora are integrated

    更新日期:2019-02-09
  • Token-based spelling variant detection in Middle Low German texts
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-09
    Fabian Barteld; Chris Biemann; Heike Zinsmeister

    In this paper we present a pipeline for the detection of spelling variants, i.e., different spellings that represent the same word, in non-standard texts. For example, in Middle Low German texts in and ihn (among others) are potential spellings of a single word, the personal pronoun ‘him’. Spelling variation is usually addressed by normalization, in which non-standard variants are mapped to a corresponding

    更新日期:2019-02-09
  • Vector space explorations of literary language
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-09
    Andreas van Cranenburgh; Karina van Dalen-Oskam; Joris van Zundert

    Literary novels are said to distinguish themselves from other novels through conventions associated with literariness. We investigate the task of predicting the literariness of novels as perceived by readers, based on a large reader survey of contemporary Dutch novels. Previous research showed that ratings of literariness are predictable from texts to a substantial extent using machine learning, suggesting

    更新日期:2019-02-09
  • The South African directory enquiries (SADE) name corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-06
    Jan W. F. Thirion; Charl van Heerden; Oluwapelumi Giwa; Marelie H. Davel

    We present the design and development of a South African directory enquiries corpus. It contains audio and orthographic transcriptions of a wide range of South African names produced by first-language speakers of four languages, namely Afrikaans, English, isiZulu and Sesotho. Useful as a resource to understand the effect of name language and speaker language on pronunciation, this is the first corpus

    更新日期:2019-02-06
  • From Lexical Functional Grammar to enhanced Universal Dependencies
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-04
    Adam Przepiórkowski; Agnieszka Patejuk

    The paper describes the conversion of an LFG treebank of Polish into enhanced Universal Dependencies, and—more generally—identifies the kinds of information lost in translation from LFG to UD. The paper also presents the resulting UD treebank of Polish and compares it to the previous UD treebank of Polish.

    更新日期:2019-02-04
  • Emilia: a speech corpus for Argentine Spanish text to speech synthesis
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-02
    Humberto M. Torres; Jorge A. Gurlekian; Diego A. Evin; Christian G. Cossio Mercado

    This paper introduces Emilia, a speech corpus created to build a female voice in Spanish spoken in Buenos Aires for the Aromo text-to-speech system. Aromo is a unit selection text-to-speech system, which employs diphones as units of synthesis. The key requirements and design criteria for Emilia were: to synthesize any text in Spanish into high-quality speech with a minimum corpus size. The text corpus

    更新日期:2019-02-02
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
Springer 纳米技术权威期刊征稿
全球视野覆盖
施普林格·自然新
chemistry
3分钟学术视频演讲大赛
物理学研究前沿热点精选期刊推荐
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
ACS Publications填问卷
阿拉丁试剂right
麻省大学
西北大学
湖南大学
华东师范大学
王要兵
化学所
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
天合科研
x-mol收录
陆军军医大学
杨财广
廖矿标
试剂库存
down
wechat
bug