A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries

Gou, Yuxiao; Zhang, Yiping; Zhu, Jian; Shu, Yidan

doi:10.1038/s41597-024-03196-1

Download PDF

Analysis
Open access
Published: 11 April 2024

A document-level information extraction pipeline for layered cathode materials for sodium-ion batteries

Yuxiao Gou¹,
Yiping Zhang¹,
Jian Zhu¹ &
…
Yidan Shu^1,2

Scientific Data volume 11, Article number: 372 (2024) Cite this article

433 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Natural language processing techniques enable extraction of valuable information from large amounts of published literature for the application of data science and technology, i.e. machine learning in the field of materials science. Nevertheless, the automated extraction of data from full-text documents remains a complex task. We propose a document-level natural language processing pipeline for literature extraction of comprehensive information on layered cathode materials for sodium-ion batteries. The pipeline enhances entity recognition with contextual supplementary information while capturing the article structure. Finally, a heuristic multi-level relationship extraction algorithm is employed in relation extraction to extract experimental parameters and complex performance relationships respectively. We successfully extracted a comprehensive dataset containing 5265 records from 1747 documents, encompassing essential information such as chemical composition, synthesis parameters, and electrochemical properties. By implementing our pipeline, we have made significant progress in overcoming the challenges associated with data scarcity in battery informatics. The extracted datasets provide a valuable resource for further research and development in the field of layered cathode materials.

High lithium oxide prevalence in the lithium solid–electrolyte interphase for high Coulombic efficiency

Article 08 April 2024

Hybridizing carbonate and ether at molecular scales for high-energy and high-safety lithium metal batteries

Article Open access 15 April 2024

Generative AI for designing and validating easily synthesizable and structurally novel antibiotics

Article 22 March 2024

Introduction

Sodium-ion batteries (SIBs) have garnered significant attention due to their similar working principles to lithium-ion batteries and the abundant, cheap, and widely distributed sodium resources. In the field of SIBs, various types of materials, including transition metal oxides^1,2, Prussian blue compounds³, polyanion-type compounds⁴, have been studied as cathode materials for sodium-ion batteries. Among these, layered transition metal oxides (LTMOs) including NaxTMO2 (TM = Fe, Ni, Co, Mn, etc.) have become the most promising cathode material for SIBs owing to its large specific capacity, high ionic conductivity, and feasible preparation conditions. This has led to extensive research efforts aimed at developing SIBs as a viable alternative to lithium-ion batteries for large-scale energy storage systems.

In recent years, the academic community has witnessed a significant increase in the number of published research papers specifically dedicated to investigating LTMOs as cathode materials for SIBs. However, many of these papers still rely on traditional trial-and-error methods for material development, which can be time-consuming. One potential solution to accelerate the design and development of new materials is to leverage data science techniques and adopt a systematic “materials-by-design” approach. By utilizing data science techniques, researchers can analyze large amounts of data and extract valuable insights to guide the design of new materials. This approach has the potential to significantly speed up the discovery of new cathode materials for SIBs. However, the non-machine-readable formats of publication leads to data scarcity in the field of battery materials informatics. This scarcity hinders the training of property predictors, as it requires laborious manual curation of relevant data from the literature. Ling⁵ provided a comprehensive summary of the various types of datasets that are applicable for battery informatics studies. However, it is noteworthy that there is still a lack of datasets specifically dedicated to the investigation of the cycling and rate performance of layered cathode materials for SIBs. Therefore, developing methods for automatically extracting data rapidly and accurately has increasingly become a necessity.

The application of natural language processing (NLP) techniques to automatically extract data on organic and inorganic chemical substances from articles in the fields of chemistry and materials science has shown promising results^6,7,8,9. Information extraction (IE) from written text is an extensively studied area within NLP. The idea of “self-supervised learning” through transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT)¹⁰, SciBERT¹¹ and BatteryBERT¹², which are pre-trained on massive corpora of unlabeled text to learn contextual embeddings and further fine-tuning through supervised learning on specific datasets, enabling them to surpass the performance of BERT in context-aware NLP applications such as text classification, entity recognition (NER), and question answering¹³, is the dominant paradigm of information extraction today. A common workflow for text classification, NER and relation extraction is to feed labelled inputs to BERT and use the output vector embedding of each sentence or word along with the corresponding labels as inputs to a task-specific machine learning model that learns to predict these labels. In practice, due to constraints such as the size of the labelled data and the complexity of the task, a mixture of model and rule processing is adopted for a given task to achieve a balance between accuracy and computational efficiency¹⁴. For example, the latest version of ChemDataExtractor version 2.1, incorporates a NER system that combines Bert, Conditional Random Field (CRF), rules, and dictionaries. This combined approach has shown impressive performance, achieving results that are comparable to the state-of-the-art for recognizing both organic and inorganic entities¹⁵.

On studies pertaining to batteries, El-Bousiydy¹⁶ et al. extracted ten characteristics related to the electrochemical performance of lithium-ion batteries (LIBs) (mass loading, porosity, thickness, surface area, electrode composition, electrolyte volume, electrolyte composition, separator, voltage cut-off, and voltage range) and provided a general perspective. Kononova¹⁷ et al. used an automated extraction pipeline to extract data on inorganic chemical substances containing cathode materials for LIBs only from solid-state synthesis passages of scientific publications to create a solid-state synthesis “codified recipes” database. Huang¹⁸ et al. modified ChemDataExtractor to automatically generate information from more than 229,000 papers related to battery. A large database including five material properties: capacity, voltage, conductivity, Coulombic efficiency, and energy was constructed, where rate capacity and cycling capacity are not separated, leading to the main problem that currents in the form of “C-rates” cannot be standardized. In summary, the majority of previous work utilizing NLP techniques in the field of materials science focused on constructing individual databases on either material synthesis parameters or properties through extracting from abstracts or partial texts. However, to the best of our knowledge, limited efforts have been made to comprehensively extract synthesis parameters and performance data from full-text and build a systematic database covering the information on both synthesis and performance of materials, specifically for battery cathode materials.

In this work, we construct a generic pipeline for extracting both material properties and synthetic parameter data of cathode materials of SIB. To the best of our knowledge, this work represents the first comprehensive extraction method of synthesis parameters and properties for sodium layered cathode materials generated based on literature data. Each instance in the dataset not only covers the chemical composition and synthesis parameters, such as sintering temperature, sintering time, but also functional properties such as cycling capacity, number of cycles, rate capacity, test current, and test voltage range. In total, 1747 articles on sodium cathode materials were filtered (See document classification for detailed filtering strategies) from a corpus of 63447 articles from Elsevier and Royal Society of Chemistry (RSC) publishers and a dataset of 5265 instances were extracted automatically. For extraction of complex relation, including materials, synthesis parameters, cycling and rate performance, our method presents a composite F1 score of 81.14% and 82.67% from articles from Elsevier and RSC respectively.

Results

As Fig. 1 shows, our automated text mining pipeline for transition metal layered cathode materials involves several stages as follows:

Firstly, we use document classifiers to filter out scientific articles in extensible markup language (XML) and hypertext markup language (HTML) formats that contain the topics we are interested in. The purpose of document classification is to accurately grasp the topic of the article (see in Document classification).

Secondly, document preprocessing was performed on the original archive corpus in order to generate a complete and standardized document record and filter out irrelevant information, including paragraph classification and text preprocessing. Paragraph classification aims to isolate paragraphs involving different topics for different following tasks. Text preprocessing aims to unify the description of chemical expressions and related properties (see in Document preprocessing).

Thirdly, NER is performed, which includes chemical named entity recognition (CNER), chemical supplementary information extraction (CSIE), electrochemical property extraction, and synthetic parameter extraction. The NER method aims to extract the named entities, properties, and property values of cathode materials from English texts. CSIE refers to the process of enhancing the information associated with identified chemical entities and unifying the sequence of these entities. This process primarily involves identifying abbreviation definitions and determining variable values (see in Named entity recognition).

Fourthly, relation extraction is carried out. Additionally, in the context of electrochemical property relation extraction, sentence classification is utilized to identify target sentences associated with different electrochemical properties. Relation extraction gives specific tuple relations of element contents and properties and interdependency parsing is used to find links between specific materials and their property data fragments.

Finally, the extracted tuple entities containing the digital object identifier (DOI) of the article, the year of publication, the layered cathode material entity and its abbreviation, property and property values are automatically compiled into a highly structured format to form the layered cathode materials for SIBs database.

Corpus of papers

We found that 29 manually annotated articles containing the label “Sodium layered oxide cathode” in the original dataset were not successfully identified, and they were added to the final corpus, resulting in a total of 1747 articles. Fig. 2 shows the distribution of research topics of articles in the sodium layered cathode material corpus, of which the maximum number of articles in pure material experimental synthesis research was 515 (about 29.48%). There were 484 articles (about 27.70%) and 212 articles (about 12.14%) studying materials by doping and coating modification experimental methods, respectively. It is worth noting that the number of research articles combined with computational science is about 346 (about 19.81%), which indicates that computational science plays an important role in the comprehensive research of sodium layered cathode materials. The corpus of this study contains 1702 articles on the research of sodium layered cathode materials from 2010 to June 2023 from Elsevier and RSC (the reason for the difference from the above 1747 is that the datasets contain articles before 2010). Fig. 3 shows the development trend of the number of articles on sodium layered cathode materials published by different publishers in the past 14 years. In general, with the rise of SIBs, the research on layered cathode materials of SIBs has been paid more and more attention by scholars.

Named entity recognition

In the literatures of layered cathode materials, the relationships between chemical entities and their properties to be extracted can be presented as a complex ternary tuple <chemical name, properties, physical parameters>. As shown in Fig. 4, chemical names encompass various forms of cathode materials for SIBs such as chemical formulae (e.g., Na_0.9Mn_0.52Fe_0.28Cu_0.2O₂ and NaNi_0.5-xMn_0.3Ti_0.2Sb_xO₂), abbreviations (e.g., CFM-Cu and NMTSb_x), or pronoun phrases (e.g., this material, or even this sample when x = 0.15). The properties comprise the name, value, and unit of the material’s property. In addition, physical parameters encompass both synthesis parameters and test condition parameters for measuring electrochemical properties. The synthesis parameters include material sintering temperature and sintering time, which are crucial factors in the fabrication process of the material. These parameters directly affect the physical properties and performance of the synthesized material. On the other hand, the test condition parameters, such as current density and voltage range, are important when evaluating the performance and behavior of the material in a specific application or testing setup. These parameters provide insights into how the material responds to different levels of electrical current and voltage, which are essential for understanding its electrochemical performance. Considering both synthesis parameters and test condition parameters is crucial for a comprehensive analysis of the material’s physical characteristics and its suitability for specific applications. In this work, a total of 1747 documents of layered cathode materials for SIBs were obtained through automatic document classification. After document preprocessing, a NER system based on the CNER module of ChemDataExtractor version 2.1 and heuristic rules was explored. Here, we illustrate the NER system, which provides a sequence of entities for subsequent relation extraction, as follows.

Chemical named entity recognition and chemical supplementary information extraction

CNER is a fundamental task in IE of material literature^19,20. However, accurate location of the entity boundaries by chemical entity recognition models is challenging due to the wide variety of materials, the flexible formation of chemical entities, and the limited availability of labeled data for model training. In the case of research papers focusing on layered cathode material systems, the presence of chemical formulas with stoichiometric and elemental variables, as well as chemical formula abbreviations (e.g., NaNi_0.5-xMn_0.3Ti_0.2Sb_xO₂, O3-NaNi_0.45Mn_0.3Ti_0.2M_0.05O₂, and NNMO), often leads to increased uncertainty in subsequent relation extraction. Therefore, CSIE becomes particularly important to accurately identify the values represented by these variables and the definitions of the abbreviated formulas.

To address this challenge, a hybrid CNER post-processing is proposed in this study. This method combines the domain knowledge of cathode materials with the text2chem (https://github.com/CederGroupHub/text2chem) toolkit, a Regex-based text parser developed by Kononova et al., can convert chemical terms and entities into chemical data structures. It builds upon the CNER module of ChemDataExtractor to take advantage of the supplementary information provided in parentheses after the entities (e.g., “(x = 0.03,0.05,0.07)” and “(M = Nb/Mo/Cr)”). The proposed approach aims to provide a sequence of entities consisting of defined elements for subsequent relation extraction tasks.

Identify the values of the variables

The tokens labeled as chemical entities by ChemDataExtractor are parsed with text2cem to cover multiple possible specific materials corresponding to a chemical formula with variables. Specifically, when stoichiometric number variables such as x, y, z, and element variables such as M (metal), and TM (transition metal) are found in the chemical entity, multiple stoichiometries or elements are extracted from a single material mention. For instance, “O3-NaNi_0.45Mn_0.3Ti_0.2M_0.05O₂ (M = Nb/Mo/Cr)” is converted to “O3-NaTi_0.2Nb_0.05Mn_0.3Ni_0.45O₂”, “O3-NaTi_0.2Mo_0.05Mn_0.3Ni_0.45O₂”, and “O3-NaTi_0.2Cr_0.05Mn_0.3Ni_0.45O₂”. Similarly, “NaNi_0.5-xMn_0.3Ti_0.2Sb_xO₂ (x = 0.03, 0.05, 0.07)” is converted to “NaTi_0.2Mn_0.3Ni_0.47Sb_0.03O₂”, “NaTi_0.2Mn_0.3Ni_0.45Sb_0.05O₂”, “NaTi_0.2Mn_0.3Ni_0.43Sb_0.07O₂”.

Abbreviation definitions

Chemical formulae abbreviations are related to full chemical formulae based on their adjacency to parentheses and naming verbs. This approach is inspired by the algorithm for identifying abbreviation definitions in biomedical text by Schwartz and Hearst²¹. Usually, in an article, a chemical formula and its abbreviation appear in two types of scenarios: (a) long-form ‘(‘ short-form ‘)’ and (b) long-form … naming-verb … short-form. The main difference is the possible location of the short-form, with pattern (a) in parentheses and pattern (b) after the naming verb. Here, the long-form refers to chemical entities, the short-form indicates chemical formula abbreviation, and the naming-verb includes “marked”, “denoted”, “named”, etc. In practice, short-form and long-form are often interchangeable.

In detail, the process of extracting abbreviations and their definitions from cathode material text involves two main subtasks. The first subtask is the extraction of < long-form, short-form > pair candidates from the sentence. Both the short-form and long-form are derived from tokens labeled as chemical entities by ChemDataExtractor. Tokens successfully parsed by text2chem are marked as long-form, while tokens with two consecutive uppercase letters are marked as short-form. Once these steps are completed, a list of short-form candidate words for the long-form is generated. The subtask is to choose the appropriate subset of words. The main idea is to count the number of uppercase characters in both the long-form and short-form. If the number of repetitions is not less than two and the uppercase letters in the short-form appear in the long-form, a successful match is determined. For example, given the input string “O3-NaNi_0.45Mn_0.3Ti_0.2M_0.05O₂ (M = Nb/Mo/Cr, abbreviated as NMTNb, NMTMo and NMTCr, respectively)” the following mapping relationships are returned: (‘NMTNb’, ‘O3-NaTi_0.2Nb_0.05Mn_0.3Ni_0.45O_2.0’), (‘NMTMo’, ‘O3-NaTi_0.2Mo_0.05Mn_0.3Ni_0.45O_2.0’), (‘NMTCr’, ‘O3-NaTi_0.2Cr_0.05Mn_0.3Ni_0.45O_2.0’) in a list.

This algorithm is applied to all sentences in paragraphs to produce a list of mappings between chemical formula abbreviations and their corresponding chemical formula. These mappings are then used to transform chemical formula abbreviations in the literature extraction records into more informative chemical formulas corresponding to elemental compositions.

Synthesis parameter and electrochemical property extraction

There are two categories of information that are the most important to describe layered cathode materials for SIBs: electrochemical performance and synthesis method. The former is usually evaluated using cycling performance and rate performance. The latter can be covered using sintering temperature and sintering time because layered cathode materials are typically synthesized using a simple high-temperature solid-phase method²².

The paragraphs to extract synthesis parameters and electrochemical properties are derived from the previous paragraph classification process. To cover the properties mentioned in the sentences as comprehensively as possible, we have customized multiple matching rules in the form of regular expressions, as Supplementary Table [1] summarizes. In particular, three common units of current density properties, namely C, mA g-1, and A g-1, are considered in this study. To ensure consistency, conversion expressions between different units of current density that may appear in the article are taken into account. For example, conversions such as “1 C = 150 mAg-1”, “20 mAg-1 (0.1 C)”, and “0.2 C (50 mAg-1)” are considered to standardize the units of current density to mAg-1. To precisely determine capacity retention, we introduce the property descriptors “retention” and “capacity”. For the number of cycles, descriptors such as “first” and “initial” are considered to indicate the initial number of cycles. Subsequently, the identified character sequence covered by the aforementioned pattern is tokenized to extract the property values.

The final property extraction is recorded as a key-value pair, where the key consists of the property name with its unit, and the value represents the corresponding property value. To illustrate the property extraction process, consider the following example sentence: “After 300 cycles, a discharge capacity of 47 mAh g-1 is obtained, only equal to 66% of the initial capacity.”²³. Based on the provided sentence, the extracted property values can be recorded as a key-value pair in the following format “{‘cycle’: [300, 1], ‘capacity: mAhg-1’: [47], ‘retention: %’: [66]}”.

Relation extraction

Extracting multiple relationships from cathode material papers is indeed a challenging task. It involves the extraction of two types of property relationships, as shown in Fig. 4. The first type encompasses electrochemical property relationships, which can be further classified into cycling performance property relation and rate performance property relation, they are tabulated as complex tuples, <cathode material, the number of cycles, capacity/retention, current density, and voltage range> and <cathode material, capacity, current density, and voltage range>, respectively. The second type pertains to synthetic parameter relationships, which can be represented as a binary tuple consisting of sintering temperature and sintering time. Multiple cathode materials, their specified properties, and corresponding property values are often reported in separate sentences. Alternatively, multiple properties and their corresponding values may be reported for a single cathode material. These complexities pose obstacles for relation extraction, especially when working with a limited corpus of cathode materials for SIBs. Hierarchical relation extraction supervised algorithms require more than ~70,000 labelled sentence samples²⁴. Even semi-supervised methods require a certain number of labelled samples as seeds to start learning. It is worth noting that most relation extraction systems primarily focus on extracting relations at the sentence level and do not consider relations that span across multiple sentences or paragraphs²⁵. In this study, we present a novel order and distance-based algorithm that that does not rely on labelled samples for relationship extraction. Our proposed approach considers relationships across sentence levels. In feature-based relationship extraction methods, the number of words and word sequences between entities and properties can be used as the main syntactic features²⁶. Thus, the number and order of entity occurrences, along with the distance between entities provide a basis for assessing relation dependencies.

Specifically, the algorithm we propose starts from the target sentence, where the target sentence for different tasks of relation extraction is determined by the combination of entities appearing in the sentence. Sentences mentioning number of cycles and capacity/retention are used for cycling performance property relation extraction, while those mentioning capacity and current density but not number of cycles are used for rate performance property relation extraction. Once the target sentence is identified, the relationship tuples for extraction are determined. Subsequently, matching commences from entities with stronger dependency relationships, considering the quantities of the entity sequences, denoted as n1 and n2.

1.
Shortest Character Distance: This addresses cases where specific entities are absent in the sentence (n1 = 0 or n2 = 0). It focuses on instances where entities like cathode material or voltage range is missing, and fill it by finding the closest entity to the sentence head index based on character distance in the preceding text.
2.
Sequential Matching: It tackles situations where n1 = n2 in the sentence, and the matching order is determined by the order of the two entity sequences. Particularly, when either n1 or n2 equals 1, the entity with a quantity of 1 is multiplied to match n1 = n2 based on a greedy rule. Here, we haven’t emphasized the remaining cases where n1 is not equal to n2, as they occur less frequently due to the precise nature of battery literature.

Once the matching of the two entities is completed, several tuples with associated relationships are obtained. These tuples are treated as new entity sequences and are matched again with other entities in the target tuple, repeating the process until the target tuple is filled as comprehensively as possible.

For matching cycling performance property relation, the process begins with the number of cycles and capacity/retention, and then sequentially matches with cathode materials, voltage range, and current density.

For matching rate performance property relation, matching starts with current density and capacity, and then sequentially matches with cathode materials and voltage range.

Fig. 5 illustrates a complete information extraction process, where sentences S1, S2, and S3 are dispersed throughout the paper²⁷ with certain contextual relationships, and S3 is the target sentence.

In the NER stage, CSIE establishes linkages for contextual cathode material entities and their abbreviations. The identified entity sequence in S3, including the number of cycles and capacity/retention, provides the descriptive intent for the sentence, used for subsequent extraction of cycling performance property relation.

In the relation extraction stage depicted in the Fig. 5, numerical values are used to describe the lengths of entity sequences. In Step 1, matching is performed for the relationships involving the number of cycles and capacity/retention. It’s worth noting that due to the comparative nature of capacity retention, it cannot be matched with the initial cycle, hence in this step, both lengths of entity sequences are the same, and sequential matching is executed. In Step 2, based on the tuples matched in the first step, further matching with cathode material entity is conducted, establishing linkages with preceding supplementary information, and sequential matching is performed based on quantity relationships. In Step 3, matching with voltage ranges is performed based on the results of the second step. Since the voltage range entity sequence is missing in S3, the shortest character distance links to the voltage range extracted from S2, followed by sequential matching based on the greedy approach. Step 4 follows a similar logic, resulting in the complete target tuple.

In the synthesis parameter binary property relation, which can be considered as the most basic scenario for this algorithm to handle, a greedy-based sequential matching strategy is used to match two sequences of entities, as depicted in Fig. 6. This strategy aims to generate pairs of sintering temperature, sintering time> that are present in the experimental paragraphs of the article. Ultimately, the highest sintering temperature in the binary relation is selected as the synthesis parameter of the material investigated in the article.

Dataset overview

Using our pipeline, we successfully extracted 5265 property records of layered cathode materials for SIBs from a total of 1747 articles. Among these records, 3557 correspond to cycling performance properties, while 1708 pertain to rate performance properties.

Fig. 7 visually represents the distribution and percentage of missing values in the records, where each column corresponds to a specific property. Fig. 7a,b illustrate the distribution of missing values of in the cycling and rate performance data among records of materials, respectively. In this representation, each row in the figure represents a single piece of data and the presence of a blank cell indicates missing values of the property corresponding to the column. Notably, the sintering temperature and sintering time are merged into the column of “Synthesis parameters”, so are the upper and lower test voltage are into “Voltage range” in order to show the data distribution more clearly. The current density property is presented in units of mAg-1 or C, depending on the specific record. Regarding the property of cycling performance, each row of data contains the retention or capacity value of the material for a specific number of cycles. On the other hand, for records related to rate performance properties, each row of data represents the material’s capacity at a particular current density. Fig. 7c,d provide an overview of the proportion and count of missing properties in the cycling and rate performance records, respectively. In the cycling performance records, there is a notably high number of missing values for the current density attribute. This can be attributed to the dispersed nature of the descriptions of test condition parameters of cycling performance in literatures, particularly when current density is described. Indeed, this observation underscores the challenge posed by the diverse patterns of descriptions of current density test in battery literature in natural language, which complicate the task of accurately extracting relevant information. Specifically, challenge may arise from differences in units, formatting, or the use of various terminology across different sources. Consequently, it becomes crucial for future work to develop robust extraction methods that can effectively handle these diverse patterns and ensure accurate and reliable information extraction from battery literature.

Among the 5265 unique data records, a total of 592 unique layered oxide cathode materials were identified. Fig. 8 provides an overview of the chemical distributions of layered oxides with the general formula NaxTMO₂, where TM represents transition metals. These transition metals can be single elements or mixtures of two, three, or multiple elements. Additionally, Fig. 8 specifies the distribution of the top seven representative materials for each type of layered oxide material. This facilitates the identification of the most popular layered cathode materials in recent studies. For example, the figure highlights the most popular single transition metal layered cathode materials, such as NaxCrO₂, NaxMnO₂, and NaxFeO₂. It also showcases the distribution of binary transition metal layered cathode materials like Na_xMn_1-yNi_yO₂, where M represents elements such as Ni, Fe, or Co. Furthermore, the figure illustrates the distribution of ternary transition metal layered cathode materials, including Na_xMn_1-y-zM_yNi_zO₂, where M can be Co, Fe, or Zn. Lastly, it showcases the distribution of multi-transition metal Mn-based layered cathode materials that have been studied extensively in recent years.

Model Performance

Multi-label text classification

Table 1 presents the combined Precision, Recall, F1 score, as well as Micro average and Macro average results of the model’s binary document classification for each label on the test set. The binary classification results indicate that the model achieves the highest overall performance on the “Experiment” label, with an F1 score of 95.86%. However, the model shows a relatively poorer performance on the “Coating” label, with an F1 score of 74.33%. This discrepancy can be attributed to differences in the model’s learning effect due to the uneven distribution of training samples.

Table 1 Evaluation of multi-label text classification model.

Full size table

Specifically, the “Coating” label has a limited number of training samples, resulting in a weaker performance from the model on this label. In contrast, the ‘Experiment’ label benefits from a larger number of training samples, allowing the model to learn more effectively and perform better in its prediction (F1 score of 95.86%). This distribution of labeled data also confirms that experimental synthesis is the predominant research approach in the field of traditional materials.

Furthermore, the Aggregate Metric evaluation results demonstrate that the F1 scores for all models reach 85%. Additionally, the models outperform the Macro average in all indicators of the Micro average. This can be observed from Eqs. (8), (9), and (10), which indicate that the Macro average, being a simple arithmetic average, does not consider the issue of sample distribution. As a result, the scores may be biased towards labels with significant differences in values.

In conclusion, the model can be successfully utilized for the screening of articles on layered cathode materials for SIBs and for analyzing the research methodologies employed in these articles.

Paragraph classification

In this study, a total of 50 documents were randomly selected from the document collection. The categorized paragraphs within these documents were manually checked. The performance of the categorization process was evaluated using precision, recall, and F1 score metrics, as demonstrated in Table 2.

Table 2 Precision, Recall and F1 Score of the paragraph classification.

Full size table