-
Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system Lang. Resour. Eval. (IF 1.014) Pub Date : 2021-01-20 Chuya China Bhanja, Mohammad Azharuddin Laskar, Rabul Hussain Laskar
In this paper an attempt has been made to prepare an automatic tonal and non-tonal pre-classification-based Indian language identification (LID) system using multi-level prosody and spectral features. Languages are first categorized into tonal and non-tonal groups, and then, from among the languages of the respective groups, individual languages are identified. The system uses syllable, word (tri-syllable)
-
Annotating affective dimensions in user-generated content Lang. Resour. Eval. (IF 1.014) Pub Date : 2021-01-19 Luna De Bruyne, Orphée De Clercq, Véronique Hoste
In an era where user-generated content becomes ever more prevalent, reliable methods to judge emotional properties of these kinds of complex texts are needed, for example for developing corpora in machine learning contexts. In this study, we focus on Dutch Twitter messages, a genre which is high in emotional content and frequently investigated in the field of computational linguistics. We compare three
-
The LRE Map: what does it tell us about the last decade of our field? Lang. Resour. Eval. (IF 1.014) Pub Date : 2021-01-15 Riccardo Del Gratta, Sara Goggi, Gabriella Pardelli, Nicoletta Calzolari
The LRE Map of Language Resources was introduced at LREC 2010. Its intended purpose was: “to shed light on the vast amount of resources that represent the background of the research presented at LREC” (Calzolari et al. in: Calzolari et al. (eds) Proceedings of the seventh international conference on language resources and evaluation (LREC’10). European Language Resources Association (ELRA), Valletta
-
TuLeD (Tupían lexical database): introducing a database of a South American language family Lang. Resour. Eval. (IF 1.014) Pub Date : 2021-01-13 Fabrício Ferraz Gerardi, Stanislav Reichert, Carolina Coelho Aragon
The last two decades witnessed a rapid growth of publicly accessible online language resources. This has allowed for valuable data on lesser known languages to become available. Such resources provide linguists with opportunities for advancing their research. Yet despite the proliferation of lexical and morphological databases, the ca. 456 languages spoken in South America are poorly represented, particularly
-
Representing variation in a spoken corpus of an endangered dialect: the case of Torlak Lang. Resour. Eval. (IF 1.014) Pub Date : 2021-01-09 Teodora Vuković
The paper presents a spoken corpus of the endangered Torlak dialect from the Timok area of Southeast Serbia. This dialect expresses a great deal of variation in the use of non-standard features under the influence of standard Serbian (SSr). Accounting for this variation, a specific methodology has been selected for collection, sampling, transcription and annotation. Between 2015 and 2017, semi-structured
-
LDC-IL: The Indian repository of resources for language technology Lang. Resour. Eval. (IF 1.014) Pub Date : 2021-01-03 Narayan Choudhary
This paper introduces the Government of India Initiative on linguistic data creation in Indian languages. The Linguistic Data Consortium for Indian Languages (LDC-IL) is a fully funded Government of India scheme established in 2007 to cater to the needs of linguistic resources required for the development of language technology in Indian languages. LDC-IL worked silently for more than a decade with
-
Live blog summarization Lang. Resour. Eval. (IF 1.014) Pub Date : 2021-01-02 P. V. S. Avinesh, Maxime Peyrard, Christian M. Meyer
Live blogs are an increasingly popular news format to cover breaking news and live events in online journalism. Online news websites around the world are using this medium to give their readers a minute by minute update on an event. Good summaries enhance the value of the live blogs for a reader, but are often not available. In this article, (a) we first define the task of summarizing a live blog,
-
AI2D-RST: a multimodal corpus of 1000 primary school science diagrams Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-12-05 Tuomo Hiippala, Malihe Alikhani, Jonas Haverinen, Timo Kalliokoski, Evanfiya Logacheva, Serafina Orekhova, Aino Tuomainen, Matthew Stone, John A. Bateman
This article introduces AI2D-RST, a multimodal corpus of 1000 English-language diagrams that represent topics in primary school natural sciences, such as food webs, life cycles, moon phases and human physiology. The corpus is based on the Allen Institute for Artificial Intelligence Diagrams (AI2D) dataset, a collection of diagrams with crowdsourced descriptions, which was originally developed to support
-
DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-11-23 Rachel Bawden, Eric Bilinski, Thomas Lavergne, Sophie Rosset
We present a new English–French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue. The test set contains 144 spontaneous dialogues (5700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced
-
Orthographic features for emotion classification in Chinese in informal short texts Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-11-23 I-Hsuan Chen, Yunfei Long, Qin Lu, Chu-Ren Huang
Informal short texts on the web are rich in emotions as they often reflect unfiltered immediate reactions to breaking news events. The emotion density, however, stands in contrast to its poverty of linguistic contexts and features for emotion classification. This paper tackles that challenge by proposing orthographic features based on orthographic code mixing and code-switching for both non-ML and
-
Correction to: Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-11-17 Toqeer Ehsan, Sarmad Hussain
In the original publication of the article the column headers of the Tables 17 and 18 were incorrectly published. The corrected version of Tables 17 and 18 are provided with this Correction.
-
Morphological analysis and disambiguation for Breton Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-11-16 Francis M. Tyers, Nick Howell
In this paper we present an extended description of two resources for natural language processing of Breton, a morphological analyser and constraint grammar-based disambiguator. The constraint grammar was developed using a novel methodology by a linguist and a language consultant creating rules to solve specific errors in disambiguation in a machine translation system. In addition we introduce a new
-
Current limitations in cyberbullying detection: On evaluation criteria, reproducibility, and data scarcity Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-11-16 Chris Emmery, Ben Verhoeven, Guy De Pauw, Gilles Jacobs, Cynthia Van Hee, Els Lefever, Bart Desmet, Véronique Hoste, Walter Daelemans
The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. Nevertheless, while computational power and affordability of resources continue to increase, the access restrictions on high-quality data limit the applicability of state-of-the-art techniques. Consequently, much of the recent research uses small, heterogeneous datasets
-
Improvement of sentiment analysis via re-evaluation of objective words in SenticNet for hotel reviews Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-22 Chihli Hung, Wan-Rong Wu, Hsien-Ming Chou
In order to extract the correct sentiment polarity from word of mouth (WOM), a wide-scale and well-organized sentiment lexicon is generally beneficial. SenticNet is one such lexicon. However, it consists of a high proportion of objective words, which are generally considered to be of little use for sentiment classification due to their ambiguity and lack of sentiments. In the literature, there is a
-
MEmoFC: introducing the Multilingual Emotional Football Corpus Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-16 Nadine Braun, Chris van der Lee, Lorenzo Gatti, Martijn Goudbeek, Emiel Krahmer
This paper introduces a new corpus of paired football match reports, the Multilingual Emotional Football Corpus, (MEmoFC), which has been manually collected from English, German, and Dutch websites of individual football clubs to investigate the way different emotional states (e.g. happiness for winning and disappointment for losing) are realized in written language. In addition to the reports, it
-
Investigating the effects of gender, dialect, and training size on the performance of Arabic speech recognition Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-12 Eiman Alsharhan, Allan Ramsay
Research in Arabic automatic speech recognition (ASR) is constrained by datasets of limited size, and of highly variable content and quality. Arabic-language resources vary in the attributes that affect language resources in other languages (noise, channel, speaker, genre), but also vary significantly in the dialect and level of formality of the spoken Arabic they capture. Many languages suffer similar
-
The B-Subtle framework: tailoring subtitles to your needs Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-11 Miguel Ventura, Jessica Veiga, Luisa Coheur, Sandra Gama
Large amounts of subtitles, from movies and TV shows, can be easily found on the web, for free, in almost every language. Several corpora, built from subtitles, with different annotations and purposes, are currently available. Considering that new sets of subtitles are constantly being released, we propose B-Subtle, an open source framework that allows the automatic creation of corpora constituted
-
Arabic real time entity resolution using inverted indexing Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-10-07 Marwah Alian, Ghazi Al-Naymat, Banda Ramadan
Arabic datasets that have two or more records for the same world entity (i.e. person, object, etc.) make institutions suffer from low quality and degraded performance due to duplication in their Arabic datasets without having any mechanism for detecting these duplicates. The operation that distinguishes records for the same real-world entity is called Entity Resolution (ER). It is considered as a tool
-
Resources and benchmark corpora for hate speech detection: a systematic review Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-09-30 Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, Viviana Patti
Hate Speech in social media is a complex phenomenon, whose detection has recently gained significant traction in the Natural Language Processing community, as attested by several recent review works. Annotated corpora and benchmarks are key resources, considering the vast number of supervised approaches that have been proposed. Lexica play an important role as well for the development of hate speech
-
The KAS corpus of Slovenian academic writing Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-09-24 Tomaž Erjavec, Darja Fišer, Nikola Ljubešić
The paper presents the KAS corpus of Slovenian academic writing, which consists of almost 65,000 B.A./B.Sc., 16,000 M.A./M.Sc. and 1600 Ph.D. theses (5 million pages or 1.7 billion tokens) gathered from the digital libraries of Slovenian higher education institutions via the Slovenian Open Science portal. We discuss the compilation, meta-data, annotation, and distribution of the corpus, which is made
-
The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-09-04 Richard Futrell, Edward Gibson, Harry J. Tily, Idan Blank, Anastasia Vishnevetsky, Steven T. Piantadosi, Evelina Fedorenko
It is now a common practice to compare models of human language processing by comparing how well they predict behavioral and neural measures of processing difficulty, such as reading times, on corpora of rich naturalistic linguistic materials. However, many of these corpora, which are based on naturally-occurring text, do not contain many of the low-frequency syntactic constructions that are often
-
ChoCo: a multimodal corpus of the Choctaw language Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-31 Jacqueline Brixey, Ron Artstein
This article presents a general use corpus for Choctaw, an American indigenous language (ISO 639-2: cho, endonym: Chahta). The corpus contains audio, video, and text resources, with many texts also translated in English. The Oklahoma Choctaw and the Mississippi Choctaw variants of the language are represented in the corpus. The data set provides documentation support for this threatened language, and
-
Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-29 Dawn Knight, Fernando Loizides, Steven Neale, Laurence Anthony, Irena Spasić
CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes—National Corpus of Contemporary Welsh) is the first comprehensive corpus of Welsh designed to be reflective of language use across communication types, genres, speakers, language varieties (regional and social) and contexts. This article focuses on the computational infrastructure that we have designed to support data collection for CorCenCC, and the subsequent
-
Some consideration on expressive audiovisual speech corpus acquisition using a multimodal platform Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-26 Sara Dahmani, Vincent Colotte, Slim Ouni
In this paper, we present a multimodal acquisition setup that combines different motion-capture systems. This system is mainly aimed for recording expressive audiovisual corpus in the context of audiovisual speech synthesis. When dealing with speech recording, the standard optical motion-capture systems fail in tracking the articulators finely, especially the inner mouth region, due to the disappearing
-
Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-18 Toqeer Ehsan, Sarmad Hussain
A number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are essential data resources for language processing. This work presents the development and evaluation of an Urdu treebank, the CLE-UTB and a statistical parser
-
Semantics-aware typographical choices via affective associations Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08 Tugba Kulahcioglu, Gerard de Melo
With the tens of thousands of fonts that are now readily available, it is non-trivial to select the most suitable font for a given use case. Considering the impact of the choice of font on human perception of the text, there is a strong need for semantic font search and recommendation. Aiming to fulfill this need, we induce a typographical lexicon providing associations between words and fonts. For
-
Building referring expression corpora with and without feedback Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-08 Danillo da Silva Rocha, Ivandré Paraboni
The design of data collection experiments involving human participants is a common task in Referring Expression Generation (REG) and related fields. Many (or most) REG data collection tasks are implemented by making use of a human–computer (e.g., web-based) communicative setting, in which participants do not have any particular addressee in mind and do not receive any feedback regarding the appropriateness
-
Evaluating human corrections in a computer-assisted speaker diarization system Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-07-06 Pierre-Alexandre Broux, Simon Petitrenaud, Sylvain Meignier, Jean Carrive, David Doukhan
In this paper, we present a framework to evaluate the human corrections of a speaker diarization system. We propose four elementary actions to correct the diarization (“Create a boundary”, “Delete a boundary”, “Create a speaker label” and “Change the speaker label”) and we propose an automaton to simulate the correction sequence. A metric is described to evaluate the correction cost. The framework
-
Evaluating cross-lingual textual similarity on dictionary alignment problem Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-29 Yiğit Sever, Gönenç Ercan
Bilingual or even polylingual word embeddings created many possibilities for tasks involving multiple languages. While some tasks like cross-lingual information retrieval aim to satisfy users’ multilingual information needs, some enable transferring valuable information from resource-rich languages to resource-poor ones. In any case, it is important to build and evaluate methods that operate in a cross-lingual
-
ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-25 Nhi-Thao Tran, Minh-Quoc Nghiem, Nhung T. H. Nguyen, Ngan Luu-Thuy Nguyen, Nam Van Chi, Dien Dinh
Automatic text summarization is important in this era due to the exponential growth of documents available on the Internet. In the Vietnamese language, VietnameseMDS is the only publicly available dataset for this task. Although the dataset has 199 clusters, there are only three documents in each cluster, which is small compared to typical datasets in English. This motivates us to construct ViMs—a
-
C2SI corpus: a database of speech disorder productions to assess intelligibility and quality of life in head and neck cancers Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-15 Virginie Woisard, Corine Astésano, Mathieu Balaguer, Jérôme Farinas, Corinne Fredouille, Pascal Gaillard, Alain Ghio, Laurence Giusti, Imed Laaridh, Muriel Lalain, Benoît Lepage, Julie Mauclair, Olivier Nocaudie, Julien Pinquier, Gilles Pouchoulin, Michèle Puech, Danièle Robert, Vincent Roger
Within the framework of the Carcinologic Speech Severity Index (C2SI) INCa Project, we collected a large database of French speech recordings aiming at validating Disorder Severity Indexes. Such a database will be useful for measuring the impact of oral and pharyngeal cavity cancer on speech production. It will permit to assess patients’ quality of life after treatment. The database is composed of
-
Writer’s uncertainty identification in scientific biomedical articles: a tool for automatic if-clause tagging Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-06-11 Paolo Omero, Massimiliano Valotto, Riccardo Bellana, Ramona Bongelli, Ilaria Riccioni, Andrzej Zuczkowski, Carlo Tasso
In a previous study, we manually identified seven categories (verbs, non-verbs, modal verbs in the simple present, modal verbs in the conditional mood, if, uncertain questions, and epistemic future) of Uncertainty Markers (UMs) in a corpus of 80 articles from the British Medical Journal randomly sampled from a 167-year period (1840–2007). The UMs detected on the base of an epistemic stance approach
-
Language resources for Maghrebi Arabic dialects’ NLP: a survey Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-25 Jihene Younes, Emna Souissi, Hadhemi Achour, Ahmed Ferchichi
Diglossia is one of the main characteristics of Arabic language. In Arab countries, there are three forms of Arabic that co-exist: Classical Arabic (CA) which is mainly used in the Quran and in several classical literary texts, Modern Standard Arabic (MSA) that descends from CA and used as official language, and various regional colloquial varieties of Arabic that are usually referred to as Arabic
-
Mapping languages: the Corpus of Global Language Use Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-08 Jonathan Dunn
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports
-
A multi-platform dataset for detecting cyberbullying in social media Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-04-06 David Van Bruwaene, Qianjia Huang, Diana Inkpen
Recent work on cyberbullying detection relies on using machine learning models with text and metadata in small datasets, mostly drawn from single social media platforms. Such models have succeeded in predicting cyberbullying when dealing with posts containing the text and the metadata structure as found on the platform. Instead, we develop a multi-platform dataset that consists purely of the text from
-
Fake opinion detection: how similar are crowdsourced datasets to real data? Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-03-28 Tommaso Fornaciari, Leticia Cagnina, Paolo Rosso, Massimo Poesio
Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this
-
Comparing web-crawled and traditional corpora Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-03-19 Václav Cvrček; Zuzana Komrsková; David Lukeš; Petra Poukarová; Anna Řehořková; Adrian Jan Zasina; Vladimír Benko
Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a “traditional” corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the “searchable” web. Both types of corpora are projected onto the space induced by the MD model,
-
The Multilingual Student Translation corpus: a resource for translation teaching and research Lang. Resour. Eval. (IF 1.014) Pub Date : 2020-01-25 Sylviane Granger, Marie-Aude Lefer
The Multilingual Student Translation (MUST) corpus is a corpus of translations produced by foreign language learners or trainee translators collected collaboratively by a large number of partner teams internationally. The corpus represents a prime example of community sourcing, as the data are collected and shared by the members of the MUST network. Two key characteristics of the corpus are that it
-
NorthEuraLex: a wide-coverage lexical database of Northern Eurasia Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-30 Johannes Dellert; Thora Daneyko; Alla Münch; Alina Ladygina; Armin Buch; Natalie Clarius; Ilja Grigorjew; Mohamed Balabel; Hizniye Isabella Boga; Zalina Baysarova; Roland Mühlenbernd; Johannes Wahle; Gerhard Jäger
This article describes the first release version of a new lexicostatistical database of Northern Eurasia, which includes Europe as the most well-researched linguistic area. Unlike in other areas of the world, where databases are restricted to covering a small number of concepts as far as possible based on often sparse documentation, good lexical resources providing wide coverage of the lexicon are
-
Automatic dialect identification system for Kannada language using single and ensemble SVM algorithms Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-21 Nagaratna B. Chittaragi; Shashidhar G. Koolagudi
In this paper, an automatic dialect identification (ADI) system is proposed by extracting spectral and prosodic features for Kannada language. A new dialect dataset is collected from native speakers of Kannada language (A Dravidian language). This dataset includes five distinct dialects of Kannada language representing five geographical regions of Karnataka state. Investigation of the significance
-
Reproduction, replication, analysis and adaptation of a term alignment approach Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-18 Andraž Repar; Matej Martinc; Senja Pollak
In this paper, we look at the issue of reproducibility and replicability in bilingual terminology alignment (BTA). We propose a set of best practices for reproducibility and replicability of NLP papers and analyze several influential BTA papers from this perspective. Next, we present our attempts at replication and reproduction, where we focus on a bilingual terminology alignment approach described
-
Prosodic word boundary detection from Bengali continuous speech Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-11-13 Tanmay Bhowmik; Shyamal Kumar Das Mandal
Detection of word boundaries in continuous speech is a tedious process due to the absence of a definite pause or silence in the word boundary position. Thus, continuous speech recognition is a very challenging task. However, the prosodic word boundaries, unlike the written word boundaries, can be predicted using the prosodic parameters of continuous speech. This paper proposes a method for detecting
-
A pragmatic guide to geoparsing evaluation Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-09-19 Milan Gritta; Mohammad Taher Pilehvar; Nigel Collier
Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym
-
The Corpus of American Danish: a language resource of spoken immigrant Danish in North and South America Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-08-23 Karoline Kühl; Jan Heegård Petersen; Gert Foget Hansen
This paper describes the ‘Corpus of American Danish’ (CoAmDa), a newly established corpus of spoken immigrant Danish in North and South America. The CoAmDa amounts to approx. 1.7 million tokens, making it one of the largest corpora of heritage language at present. With regard to text type, the CoAmDa is a non-standard multilingual spoken language resource as Danish is mixed with American English, Canadian
-
A Finnish news corpus for named entity recognition Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-08-01 Teemu Ruokolainen; Pekka Kauppinen; Miikka Silfverberg; Krister Lindén
We present a corpus of Finnish news articles with a manually prepared named entity annotation. The corpus consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event, and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source. The corpus is available for research purposes. We present
-
The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-07-24 Reem F. Alfuraih
Around the world, a growing interest has been seen in learner translator corpora, which are invaluable resources for teaching and research. This paper introduces a new resource to support researchers from different interdisciplinary areas such as computational linguistics, descriptive translation studies, computer-aided translation technology, Arabic machine translation applications, cognitive science
-
Computational text analysis within the Humanities: How to combine working practices from the contributing fields? Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-06-26 Jonas Kuhn
This position paper is based on a keynote presentation at the COLING 2016 Workshop on Language Technology for Digital Humanities in Osaka, Japan. It departs from observations about working practices in Humanities disciplines following a hermeneutic tradition of text interpretation versus the method-oriented research strategies in Computational Linguistics (CL). The respective praxeological traditions
-
Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-06-04 Aynat Rubinstein
The paper describes the creation of the first open access multi-genre historical corpus of Emergent Modern Hebrew, made possible by implementation of digital humanities methods in the process of corpus curation, encoding, and dissemination. Corpus contents originate in the Ben-Yehuda Project, an open access repository of Hebrew literature online, and in digital images curated from the collections of
-
Digitization of data for a historical medical dictionary Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-06-04 Juhani Norri; Marko Junkkari; Timo Poranen
What are known as specialized or specialist dictionaries are much more than lists of words and their definitions with occasional comments on things such as synonymy and homonymy. That is to say, a particular specialist term may be associated with many other concepts, including quotations, different senses, etymological categories, semantic categories, superordinate and subordinate terms in the terminological
-
Spanish corpora for sentiment analysis: a survey Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-31 María Navas-Loro; Víctor Rodríguez-Doncel
Corpora play an important role when training machine learning systems for sentiment analysis. However, Spanish is underrepresented in these corpora, as most primarily include English texts. This paper describes 20 Spanish-language text corpora—collected to support different tasks related to sentiment analysis, ranging from polarity to emotion categorization. We present a brand-new framework for the
-
Automatic detection and correction of discourse marker errors made by Spanish native speakers in Portuguese academic writing Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-06 Lianet Sepúlveda-Torres; Magali Sanches Duran; Sandra Maria Aluísio
Discourse markers are words and expressions (such as: firstly, then, for example, because, as a result, likewise, in comparison, in contrast) that explicitly state the relational structure of the information in the text, i.e. signalling a sequential relationship between the current message and the previous discourse. Using these markers improves the cohesion and coherence of texts, facilitating reading
-
Dialogue analysis: a case study on the New Testament Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-05-02 Chak Yan Yeung; John Lee
There has been much research on the nature of dialogues in the Bible. While the research literature abounds with qualitative analyses on these dialogues, they are rarely corroborated on statistics from the entire text. In this article, we leverage a corpus of annotated direct speech in the New Testament, as well as recent advances in automatic speaker and listener identification, to present a quantitative
-
Restoring Arabic vowels through omission-tolerant dictionary lookup Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-25 Alexis Amid Neme; Sébastien Paumier
Vowels in Arabic are optional orthographic symbols written as diacritics above or below letters. In Arabic texts, typically more than 97 percent of written words do not explicitly show any of the vowels they contain; that is to say, depending on the author, genre and field, less than 3 percent of words include any explicit vowel. Although numerous studies have been published on the issue of restoring
-
Building the first comprehensive machine-readable Turkish sign language resource: methods, challenges and solutions Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-17 Gülşen Eryiğit; Cihat Eryiğit; Serpil Karabüklü; Meltem Kelepir; Aslı Özkul; Tuğba Pamay; Dilara Torunoğlu-Selamet; Hatice Köse
This article describes the procedures employed during the development of the first comprehensive machine-readable Turkish Sign Language (TiD) resource: a bilingual lexical database and a parallel corpus between Turkish and TiD. In addition to sign language specific annotations (such as non-manual markers, classifiers and buoys) following the recently introduced TiD knowledge representation (Eryiğit
-
Exploiting languages proximity for part-of-speech tagging of three French regional languages Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-16 Pierre Magistry; Anne-Laure Ligozat; Sophie Rosset
This paper presents experiments in part-of-speech tagging of low-resource languages. It addresses the case when no labeled data in the targeted language and no parallel corpus are available. We only rely on the proximity of the targeted language to a better-resourced language. We conduct experiments on three French regional languages. We try to exploit this proximity with two main strategies: delexicalization
-
A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-16 Muhidin Mohamed; Mourad Oussalah
In this paper, we propose a hybrid approach for sentence paraphrase identification. The proposal addresses the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named-entities. The essence of the proposal is to distinguish the computation of the semantic similarity of named-entity tokens from the rest of the sentence text. More specifically, this is
-
Studying the history of the Arabic language: language technology and a large-scale historical corpus Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-12 Yonatan Belinkov; Alexander Magidow; Alberto Barrón-Cedeño; Avi Shmidman; Maxim Romanov
Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our
-
Approaching terminological ambiguity in cross-disciplinary communication as a word sense induction task: a pilot study Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-12 Julie Mennes; Ted Pedersen; Els Lefever
Cross-disciplinary communication is often impeded by terminological ambiguity. Hence, cross-disciplinary teams would greatly benefit from using a language technology-based tool that allows for the (at least semi-) automated resolution of ambiguous terms. Although no such tool is readily available, an interesting theoretical outline of one does exist. The main obstacle for the concrete realization of
-
Digitising Swiss German: how to process and study a polycentric spoken language Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-11 Yves Scherrer; Tanja Samardžić; Elvira Glaser
Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus
-
From 0 to 10 million annotated words: part-of-speech tagging for Middle High German Lang. Resour. Eval. (IF 1.014) Pub Date : 2019-04-08 Sarah Schulz; Nora Ketschik
By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training
Contents have been reproduced by permission of the publishers.