当前期刊: Language Resources and Evaluation Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • ChoCo: a multimodal corpus of the Choctaw language
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-31
    Jacqueline Brixey, Ron Artstein

    This article presents a general use corpus for Choctaw, an American indigenous language (ISO 639-2: cho, endonym: Chahta). The corpus contains audio, video, and text resources, with many texts also translated in English. The Oklahoma Choctaw and the Mississippi Choctaw variants of the language are represented in the corpus. The data set provides documentation support for this threatened language, and

    更新日期:2020-08-01
  • Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-29
    Dawn Knight, Fernando Loizides, Steven Neale, Laurence Anthony, Irena Spasić

    CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent

    更新日期:2020-07-29
  • Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-26
    Sara Dahmani, Vincent Colotte, Slim Ouni

    In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing

    更新日期:2020-07-26
  • Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-18
    Toqeer Ehsan, Sarmad Hussain

    A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser

    更新日期:2020-07-24
  • Semantics-aware typographical choices via affective associations
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08
    Tugba Kulahcioglu, Gerard de Melo

    With the tens of thousands of fonts that are now readily available, it is non-trivial to select the most suitable font for a given use case. Considering the impact of the choice of font on human perception of the text, there is a strong need for semantic font search and recommendation. Aiming to fulfill this need, we induce a typographical lexicon providing associations between words and fonts. For

    更新日期:2020-07-24
  • Building referring expression corpora with and without feedback
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08
    Danillo da Silva Rocha, Ivandré Paraboni

    The design of data collection experiments involving human participants is a common task in Referring Expression Generation (REG) and related fields. Many (or most) REG data collection tasks are implemented by making use of a human–computer (e.g., web-based) communicative setting, in which participants do not have any particular addressee in mind and do not receive any feedback regarding the appropriateness

    更新日期:2020-07-24
  • Evaluating human corrections in a computer-assisted speaker diarization system
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-06
    Pierre-Alexandre Broux, Simon Petitrenaud, Sylvain Meignier, Jean Carrive, David Doukhan

    In this paper, we present a framework to evaluate the human corrections of a speaker diarization system. We propose four elementary actions to correct the diarization (“Create a boundary”, “Delete a boundary”, “Create a speaker label” and “Change the speaker label”) and we propose an automaton to simulate the correction sequence. A metric is described to evaluate the correction cost. The framework

    更新日期:2020-07-24
  • Evaluating cross-lingual textual similarity on dictionary alignment problem
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-29
    Yiğit Sever, Gönenç Ercan

    Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual

    更新日期:2020-07-24
  • ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-25
    Nhi-Thao Tran, Minh-Quoc Nghiem, Nhung T. H. Nguyen, Ngan Luu-Thuy Nguyen, Nam Van Chi, Dien Dinh

    Automatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has 199 clusters, there are only three documents in each cluster, which is small compared to typical datasets in English. This motivates us to construct ViMs—a

    更新日期:2020-07-24
  • C2SI corpus: a database of speech disorder productions to assess intelligibility and quality of life in head and neck cancers
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-15
    Virginie Woisard, Corine Astésano, Mathieu Balaguer, Jérôme Farinas, Corinne Fredouille, Pascal Gaillard, Alain Ghio, Laurence Giusti, Imed Laaridh, Muriel Lalain, Benoît Lepage, Julie Mauclair, Olivier Nocaudie, Julien Pinquier, Gilles Pouchoulin, Michèle Puech, Danièle Robert, Vincent Roger

    Within the framework of the Carcinologic Speech Severity Index (C2SI) INCa Project, we collected a large database of French speech recordings aiming at validating Disorder Severity Indexes. Such a database will be useful for measuring the impact of oral and pharyngeal cavity cancer on speech production. It will permit to assess patients’ quality of life after treatment. The database is composed of

    更新日期:2020-07-24
  • Writer’s uncertainty identification in scientific biomedical articles: a tool for automatic if-clause tagging
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-11
    Paolo Omero, Massimiliano Valotto, Riccardo Bellana, Ramona Bongelli, Ilaria Riccioni, Andrzej Zuczkowski, Carlo Tasso

    In a previous study, we manually identified seven categories (verbs, non-verbs, modal verbs in the simple present, modal verbs in the conditional mood, if, uncertain questions, and epistemic future) of Uncertainty Markers (UMs) in a corpus of 80 articles from the British Medical Journal randomly sampled from a 167-year period (1840–2007). The UMs detected on the base of an epistemic stance approach

    更新日期:2020-07-24
  • Language resources for Maghrebi Arabic dialects’ NLP: a survey
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-25
    Jihene Younes, Emna Souissi, Hadhemi Achour, Ahmed Ferchichi

    Diglossia is one of the main characteristics of Arabic language. In Arab countries, there are three forms of Arabic that co-exist: Classical Arabic (CA) which is mainly used in the Quran and in several classical literary texts, Modern Standard Arabic (MSA) that descends from CA and used as official language, and various regional colloquial varieties of Arabic that are usually referred to as Arabic

    更新日期:2020-04-25
  • Mapping languages: the Corpus of Global Language Use
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-08
    Jonathan Dunn

    This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports

    更新日期:2020-04-08
  • A multi-platform dataset for detecting cyberbullying in social media
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-06
    David Van Bruwaene, Qianjia Huang, Diana Inkpen

    Recent work on cyberbullying detection relies on using machine learning models with text and metadata in small datasets, mostly drawn from single social media platforms. Such models have succeeded in predicting cyberbullying when dealing with posts containing the text and the metadata structure as found on the platform. Instead, we develop a multi-platform dataset that consists purely of the text from

    更新日期:2020-04-06
  • Fake opinion detection: how similar are crowdsourced datasets to real data?
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-03-28
    Tommaso Fornaciari, Leticia Cagnina, Paolo Rosso, Massimo Poesio

    Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this

    更新日期:2020-03-28
  • Automatic dialect identification system for Kannada language using single and ensemble SVM algorithms
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-21
    Nagaratna B. Chittaragi, Shashidhar G. Koolagudi

    In this paper, an automatic dialect identification (ADI) system is proposed by extracting spectral and prosodic features for Kannada language. A new dialect dataset is collected from native speakers of Kannada language (A Dravidian language). This dataset includes five distinct dialects of Kannada language representing five geographical regions of Karnataka state. Investigation of the significance

    更新日期:2019-11-21
  • What's missing in geographical parsing?
    Lang. Resour. Eval. Pub Date : 2018-01-01
    Milan Gritta,Mohammad Taher Pilehvar,Nut Limsopatham,Nigel Collier

    Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as emergency responses, real-time social media geographical event analysis, understanding location instructions in auto-response systems and more. However

    更新日期:2019-11-01
  • Investigating the cross-lingual translatability of VerbNet-style classification.
    Lang. Resour. Eval. Pub Date : 2018-01-01
    Olga Majewska,Ivan Vulić,Diana McCarthy,Yan Huang,Akira Murakami,Veronika Laippala,Anna Korhonen

    VerbNet-the most extensive online verb lexicon currently available for English-has proved useful in supporting a variety of NLP tasks. However, its exploitation in multilingual NLP has been limited by the fact that such classifications are available for few languages only. Since manual development of VerbNet is a major undertaking, researchers have recently translated VerbNet classes from English to

    更新日期:2019-11-01
  • Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus.
    Lang. Resour. Eval. Pub Date : 2016-08-30
    Aleksandar Savkov,John Carroll,Rob Koeling,Jackie Cassell

    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction

    更新日期:2019-11-01
  • A massively parallel corpus: the Bible in 100 languages.
    Lang. Resour. Eval. Pub Date : 2015-09-01
    Christos Christodouloupoulos,Mark Steedman

    We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English

    更新日期:2019-11-01
  • Domain adaptation of statistical machine translation with domain-focused web crawling.
    Lang. Resour. Eval. Pub Date : 2015-06-30
    Pavel Pecina,Antonio Toral,Vassilis Papavassiliou,Prokopis Prokopidis,Aleš Tamchyna,Andy Way,Josef van Genabith

    In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework

    更新日期:2019-11-01
  • The Hebrew CHILDES corpus: transcription and morphological analysis.
    Lang. Resour. Eval. Pub Date : 2014-11-25
    Aviad Albert,Brian MacWhinney,Bracha Nir,Shuly Wintner

    We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce

    更新日期:2019-11-01
  • Building and evaluating resources for sentiment analysis in the Greek language.
    Lang. Resour. Eval. Pub Date : null
    Adam Tsakalidis,Symeon Papadopoulos,Rania Voskaki,Kyriaki Ioannidou,Christina Boididou,Alexandra I Cristea,Maria Liakata,Yiannis Kompatsiaris

    Sentiment lexicons and word embeddings constitute well-established sources of information for sentiment analysis in online social media. Although their effectiveness has been demonstrated in state-of-the-art sentiment analysis and related tasks in the English language, such publicly available resources are much less developed and evaluated for the Greek language. In this paper, we tackle the problems

    更新日期:2019-11-01
  • Spanish corpora for sentiment analysis: a survey
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-31
    María Navas-Loro, Víctor Rodríguez-Doncel

    Corpora play an important role when training machine learning systems for sentiment analysis. However, Spanish is underrepresented in these corpora, as most primarily include English texts. This paper describes 20 Spanish-language text corpora—collected to support different tasks related to sentiment analysis, ranging from polarity to emotion categorization. We present a brand-new framework for the

    更新日期:2019-05-31
  • Restoring Arabic vowels through omission-tolerant dictionary lookup
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-25
    Alexis Amid Neme, Sébastien Paumier

    Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring

    更新日期:2019-04-25
  • A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-16
    Muhidin Mohamed, Mourad Oussalah

    In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is

    更新日期:2019-04-16
  • TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-06
    Deniz Zeyrek, Amália Mendes, Yulia Grishina, Murathan Kurfalı, Samuel Gibbon, Maciej Ogrodniczuk

    TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature

    更新日期:2019-04-06
  • DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-01
    Kheireddine Abainia

    Algeria’s socio-linguistic situation is known as a complex phenomenon involving several historical, cultural and technological factors. However, there are three languages that are mainly spoken in Algeria (Arabic, Tamazight and French) and they can be mixed in the same sentence (code-switching). Moreover, there are several varieties of dialects that differ from one region to another and sometimes within

    更新日期:2019-04-01
  • In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-03-26
    Ayla Rigouts Terryn, Véronique Hoste, Els Lefever

    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need

    更新日期:2019-03-26
  • DEMoS : an Italian emotional speech corpus
    Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-02-22
    Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Batliner, Maximilian Schmitt, Björn W. Schuller

    We present DEMoS (Database of Elicited Mood in Speech), a new, large database with Italian emotional speech: 68 speakers, some 9 k speech samples. As Italian is under-represented in speech emotion research, for a comparison with the state-of-the-art, we model the ‘big 6 emotions’ and guilt. Besides making available this database for research, our contribution is three-fold: First, we employ a variety

    更新日期:2019-02-22
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
欢迎访问IOP中国网站
自然职场线上招聘会
GIANT
产业、创新与基础设施
自然科研线上培训服务
材料学研究精选
胸腔和胸部成像专题
屿渡论文,编辑服务
何川
苏昭铭
陈刚
姜涛
李闯创
李刚
北大
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
天合科研
x-mol收录
上海纽约大学
张健
陈芬儿
厦门大学
史大永
吉林大学
卓春祥
张昊
杨中悦
试剂库存
down
wechat
bug