NERO: A Biomedical Named-entity (Recognition) Ontology with a Large, Annotated Corpus Reveals Meaningful Associations Through Text Embedding

Kanix Wang; Robert Stevens; Halima Alachram; Yu Li; Larisa Soldatova; Ross King; Sophia Ananiadou; Maolin Li; Fenia Christopoulou; Jose Luis Ambite; Sahil Garg; Ulf Hermjakob; Daniel Marcu; Emily Sheng; Tim Beißbarth; Edgar Wingender; Aram Galstyan; Xin Gao; Brendan Chambers; Bohdan B. Khomtchouk; James A. Evans; Andrey Rzhetsky

doi:10.1101/2020.11.05.368969

Abstract

Machine reading is essential for unlocking valuable knowledge contained in the millions of existing biomedical documents. Over the last two decades ^1,2, the most dramatic advances in machine-reading have followed in the wake of critical corpus development³. Large, well-annotated corpora have been associated with punctuated advances in machine reading methodology and automated knowledge extraction systems in the same way that ImageNet ⁴ was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named-entity analysis tool for biomedicine: (a) a new, Named-Entity Recognition Ontology (NERO) developed specifically for describing entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named-entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named-entity recognition automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

https://pypi.org/project/NERO-nlp/

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.