J Korean Med Sci. 2021 Sep 06;36(35):e224. English.
Published online Aug 02, 2021.
© 2021 The Korean Academy of Medical Sciences.
Original Article

Word Embedding Reveals Cyfra 21-1 as a Biomarker for Chronic Obstructive Pulmonary Disease

Jeongwon Heo,1,2 Da Hye Moon,1,2 Yoonki Hong,1,2 So Hyeon Bak,3 Jeeyoung Kim,2,4 Joo Hyun Park,2,5 Byoung-Doo Oh,6 Yu-Seop Kim,6 and Woo Jin Kim1,2,4
    • 1Department of Internal Medicine, Kangwon National University Hospital, Chuncheon, Korea.
    • 2Department of Internal Medicine, School of Medicine, Kangwon National University, Chuncheon, Korea.
    • 3Department of Radiology, School of Medicine, Kangwon National University, Chuncheon, Korea.
    • 4Environmental Health Center, Kangwon National University Hospital, Chuncheon, Korea.
    • 5Department of Internal Medicine, Soonchunyang University Bucheon Hospital, Bucheon, Korea.
    • 6Department of Convergence Software, Hallym University, Chuncheon, Korea.
Received February 18, 2021; Accepted July 25, 2021.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Although patients with chronic obstructive pulmonary disease (COPD) experience high morbidity and mortality worldwide, few biomarkers are available for COPD. Here, we analyzed potential biomarkers for the diagnosis of COPD by using word embedding.

Methods

To determine which biomarkers are likely to be associated with COPD, we selected respiratory disease-related biomarkers. Degrees of similarity between the 26 selected biomarkers and COPD were measured by word embedding. And we infer the similarity with COPD through the word embedding model trained in the large-capacity medical corpus, and search for biomarkers with high similarity among them. We used Word2Vec, Canonical Correlation Analysis, and Global Vector for word embedding. We evaluated the associations of selected biomarkers with COPD parameters in a cohort of patients with COPD.

Results

Cytokeratin 19 fragment (Cyfra 21-1) was selected because of its high similarity and its significant correlation with the COPD phenotype. Serum Cyfra 21-1 levels were determined in patients with COPD and controls (4.3 ± 5.9 vs. 3.9 ± 3.6 ng/mL, P = 0.611). The emphysema index was significantly correlated with the serum Cyfra 21-1 level (correlation coefficient = 0.219, P = 0.015).

Conclusion

Word embedding may be used for the discovery of biomarkers for COPD and Cyfra 21-1 may be used as a biomarker for emphysema. Additional studies are needed to validate Cyfra 21-1 as a biomarker for COPD.

Graphical Abstract

Keywords
Chronic Obstructive Pulmonary Disease; Biomarker; Word Embedding; Cyfra 21-1

INTRODUCTION

Chronic obstructive pulmonary disease (COPD) is a pulmonary disorder characterized by irreversible airflow limitation, with a partially reversible component.1 COPD is caused by airway and pulmonary parenchymal damage, which can be prevented and treated. Notably, COPD has a high prevalence and mortality rate worldwide, such that it represents a very large socioeconomic burden.2, 3, 4 The Global Burden of Disease, revised in 2010, reported that 300 million people worldwide have COPD; moreover, 3 million deaths worldwide occur each year, due to COPD.5, 6 The reported prevalence of COPD significantly differs among studies; however, according to research using the Global Initiative for Obstructive Lung Disease guidelines and forced expiratory volume in 1 sec/forced vital capacity (FEV1/FVC) < 0.7, the overall prevalence is 10%.

A biomarker is a measurable substance or molecule that reflects the state and course of a disease.7 Suitable biomarkers can be used to identify patients early and improve the diagnosis of acute or chronic diseases.8, 9 FEV1 is relatively easy to determine in patients with COPD; it is also highly repeatable and is used to track health outcomes. However, FEV1 is difficult to change during treatment, does not reflect disease activity, and is not adequately related to mortality and hospitalization; thus, it is not an ideal biomarker.10 Accordingly, there is an ongoing effort to identify meaningful biomarkers in patients with COPD. However, no ideal biomarkers have been identified.

Word embedding is a method for learning a vector representation of all words in a given corpus. Inferences are added by measuring the similarity among multiple words and performing vector calculations by vectorizing the semantics. In the present study, we investigated the relationships of COPD with its potential biomarkers by using word embedding models, such as Canonical Correlation Analysis, Word2vec, and Global Vector (GloVe).10, 11

A few studies have attempted to identify new biomarkers for COPD from the literature.12 Word embedding offers the possibility of identifying new biomarkers that may exhibit robust associations with COPD, in addition to known biomarkers. However, word embedding cannot be used to verify the usefulness of a biomarker, because of the lack of clinical trials. In the present study, potential biomarkers were selected using word embedding, then measured in patients with COPD and assessed to determine their correlations with various COPD parameters.

METHODS

Cohort

Patients and controls were selected from the COPD in the Dusty Areas study cohort, which was established to observe the clinical outcomes of patients living near cement plants. The design of this cohort has been detailed in other previous studies.13, 14 Registered patients were assessed by medical interview, physical examination, spirometry, laboratory testing, and computed tomography. COPD was defined as FEV1/FVC < 0.7 after bronchodilator aspiration.15 COPD grade was defined as grade 2 (50% ≤ FEV1 < 80%), grade 3 (30% ≤ FEV1 < 50%), or grade 4 (FEV1 < 30%), according to the 2017 Global Initiative for Obstructive Lung Disease rating system.

Patients and study design

The COPD and control groups were matched for age, sex, and smoking history. In total, 130 patients were enrolled by one-to-one matching in the cohort; they were classified as 65 patients with COPD and 65 controls (Table 1). The blood concentrations of biomarkers identified by word embedding were examined. Eight patients in the COPD and control groups were excluded because of insufficient blood samples. Thus, 63 patients with COPD and 59 controls were included in the analysis.

Table 1
Basic characteristics of the participants in this study

The following variables were recorded for the included patients: age, sex, height, weight, smoking history, modified Medical Research Council Dyspnea Scale (mMRC) score, COPD Assessment Test (CAT) score, emphysema index (EI), and mean wall area% (MWA%).

Word embedding and Cyfra 21-1 measurements

To determine the degrees of similarity between COPD and its biomarkers, 26 biomarkers known to be associated with respiratory diseases were selected (Supplementary Table 1).16, 17 It was difficult to set the criteria for selecting these biomarkers to find new biomarkers of COPD, so we searched the papers for biomarkers related to other lung diseases and chronic disease. And we randomly selected 26 biomarkers for the word embedding. Information related to the biomarkers was extracted from Google searches. Similarities between several words were measured and vector computations with vectorized semantics were performed to enable additional inferences. Relationships between COPD and respective biomarkers were investigated using the Canonical Correlation Analysis, Word2vec, and GloVe word-embedding models. To replace the clinical evaluation, the titles and abstracts of papers retrieved from Google Scholar were analyzed and quantified to estimate the performance of the word embedding models.12

Serum Cyfra 21-1 concentrations were determined using the Cyfra 21-1 enzyme-linked immunosorbent assay kit (Demeditec Diagnostics GmbH, Kiel, Germany), in accordance with the manufacturer’s protocol. Enzyme-linked immunosorbent assay plates were analyzed using a microplate reader (Molecular Devices, Sunnyvale, CA, USA).

Quantitative analysis of CT was performed using an Aview® system (Coreline Soft Inc., Seoul, Korea). Emphysema index was calculated as the volume fraction (%) of the lung below −950 HU at full inspiration. MWA% [defined as WA/(WA + lumen area) × 100], were measured near the origin of the right apical and left apicoposterior segmental bronchi.

Statistical analysis

Statistical analysis was performed using Pearson's χ2 test and the t-test to compare characteristics between the COPD and control groups. Correlation analysis was used to determine the relationships of respective variables with the serum Cyfra 21-1 concentration. Multivariate logistic regression analysis was used to calibrate each variable, including age, sex, smoking, and body mass index (BMI); the correlation between serum Cyfra 21-1 concentration and EI was independently assessed. A P value < 0.05 was considered statistically significant. Data analysis was performed using SAS software, version 9.4 (SAS Institute, Cary, NC, USA).

Ethical statement

Witten informed consent was given by each participant. The study was reviewed by the Institutional Review Board (IRB) of Kangwon National University Hospital (IRB No. KNUH-2012-06-007).

RESULTS

Table 1 summarizes the characteristics of the 130 participants in this study. The proportion of men in the cohort was twofold greater than the proportion of women (67.7% vs. 32.3%). The mean age was 71.8 years, mean BMI was 23.81 kg/m2.

Table 2 shows the similarities between markers with the highest correlation values to COPD in the Word2vec 100 and GloVe 100 dimensions. ‘Title_both’ is the frequency at which the biomarker (eg, ‘CC-16’) appears simultaneously with COPD in the title of a medical paper. And ‘Abs_both’ is the frequency at which the biomarker (e.g., ‘CC-16’) appears simultaneously with COPD in the abstract of a medical paper. The Clara cell secretory protein (recommended by Word2vec) and adiponectin and leptin (recommended by GloVe) were involved in large amounts of research related to COPD. In contrast, eotaxin-1 and Cyfra 21-1 (recommended by Word2vec) and carcinoembryonic antigen and serum amyloid A (recommended by GloVe) were not involved in large amounts of research related to COPD. Notably, Cyfra 21-1 had a high correlation with COPD, but has had minimal research interest.

Table 2
Biomarkers exhibiting highest similarities with chronic obstructive pulmonary disease

Table 3 shows the demographic and clinical characteristics of the matched COPD and control groups. Significant differences between the two groups were observed in the mMRC score, CAT score, post-bronchodilator (post-BD) FEV1, EI, and MWA. The mMRC and CAT scores were higher in the COPD group than in the control group; the EI and MWA were also significantly higher in the COPD group. No differences were found in the serum levels of interleukin-6 or interleukin-8. No significant differences were detected in serum levels of Cyfra 21-1 (4.3 ± 5.9 vs. 3.9 ± 3.6, P = 0.611) between the COPD and control groups.

Table 3
Basic characteristics of participants in COPD and control groups

Correlations of demographic and clinical characteristics with serum Cyfra 21-1 concentration were analyzed (Table 4). BMI (P = 0.044) and EI (P = 0.015) were significantly correlated with serum Cyfra 21-1 concentration. Multivariate regression analysis confirmed these correlations: a negative correlation (B = -10.282, P = 0.030) between Cyfra 21-1 and BMI and a positive correlation (B = 8.44, P = 0.046) between Cyfra 21-1 and EI (Supplementary Table 2).

Table 4
Correlations of cytokeratin 19 fragment with clinical and demographic variables

DISCUSSION

In this study, we selected potential biomarkers using word-embedding techniques. We compared the serum concentrations of Cyfra 21-1 between a COPD group and a matched control group in the COPD in the Dusty Areas cohort, with the aim of characterizing Cyfra 21-1 as a potential biomarker for COPD. We found a higher EI and lower BMI in patients with a high serum concentration of Cyfra 21-1. Notably, we found that the serum Cyfra 21-1 concentration was not associated with other COPD clinical indicators, such as dyspnea or pulmonary function.

Word embedding is a method for digitization of a word constituting text. Word2vec was developed by Mikolov, who implemented continuous bag-of-words and Skip-Gram word embedding.10 Words can be embedded based on their meaning in the context of surrounding words. A sufficient number of documents is collected in the target field; embedding is then performed in a low-dimensional space using a representative technique (i.e., Word2vec). The embedded space is generated by reproduction of the contextual meaning of the corresponding vocabulary; if a particular word is very positive, it can be inferred that words near the corresponding word are also likely to have a positive meaning.18 Various studies have used word embedding; recent research on cancer and biomarkers has provided a new method to study potential biomarkers.19

COPD is a disease with high prevalence and mortality rates worldwide.20 Pulmonary function tests is relatively easy to determine from patients with COPD, with repeatability and ease of tracing; however, it is not an ideal biomarker because it does not reflect disease activity and is not associated with acute hospitalization, mortality, or quality of life.21 Ideal biomarkers improve patient care by allowing early detection of clinical conditions, thereby improving the diagnosis of acute or chronic diseases; they also enable prediction of patient risk, determination of optimal treatment for a given patient, observance of disease progression, and observance of the response to treatment.22 Many studies have been conducted to identify ideal biomarkers of COPD.23, 24, 25

Cyfra 21-1 is a cytokeratin 19 segment that was identified as a marker of lung cancer in 1993. It has been repeatedly validated as a marker for non-small cell lung cancer.26, 27 The Cyfra 21-1 serum tumor marker was related to the proximity of COPD using the Word2vec technique.12 In our study, serum Cyfra 21-1 levels were significantly correlated with BMI and EI values. In a study of the clinical utility of the tumor marker Cyfra 21-1 in health assessments, factors affecting Cyfra 21-1 were age, sex, smoking status, blood pressure, diabetes, and hepatitis B infection; no analysis of height or weight was performed in that study.27

To the best of our knowledge, the present study is the first to reveal associations of Cyfra 21-1 with BMI and the EI. Thus, in addition to its role as a marker for cancer, Cyfra 21-1 constitutes a marker for benign pulmonary disease. Several studies have shown increased Cyfra 21-1 levels in patients with interstitial pulmonary fibrosis, collagen disease-associated pulmonary fibrosis, and autoimmune pulmonary alveolar proteinosis.22, 26, 27

There were some limitations in our study. First, we did not consider additional variables associated with Cyfra 21-1 (e.g., hepatitis B infection, diabetes, and blood pressure), which were identified in previous studies. In particular, serum Cyfra 21-1 levels are known as adjuvants for the diagnosis and prognosis of lung squamous cell carcinoma.28 Additionally, Cyfra 21-1 has recently been reported to be elevated in idiopathic pulmonary fibrosis or head and neck squamous cell carcinoma.29 In our cohort, COPD patients did not have enough patients with underlying diseases to analyze these underlying diseases (Supplementary Table 3).30 Second, word embedding study was designed to review other potential biomarkers, in addition to Cyfra21-1. However, only one marker was identified due to research cost limitations. Third, COPD is a heterogeneous disease caused by a combination of small airway disease (chronic bronchitis) and parenchymal destruction (emphysema).31 However, there is a limitation that cannot be analyzed because there is no information according to the type of COPD in our cohort. Finally, even in the analysis based on the GOLD group, there were only 2 patients with stage 3 or higher, which could not be sufficient for analysis (Supplementary Table 4).

Our study is the first to identify a biomarker for COPD using word embedding. We measured serum Cyfra 21-1 concentrations in the COPD and control groups. Correlations of BMI and EI with Cyfra21-1, which were not identified in previous studies, were confirmed. Active research regarding the use of word embedding in various medical fields is expected to be performed in the future. Cyfra 21-1, BMI, and EI should be assessed in patients at risk of COPD, in addition to the conventional parameters of age, diabetes, cancer, and infection.

In conclusion, Cyfra 21-1 concentration was correlated with EI and BMI. Word embedding is expected to be involved in the discovery of new biomarkers for COPD and other medical conditions.

SUPPLEMENTARY MATERIALS

Supplementary Table 1

Twenty-six selected biomarkers

Click here to view.(28K, doc)

Supplementary Table 2

Multivariate stepwise regression analysis of the BMI and emphysema index

Click here to view.(33K, doc)

Supplementary Table 3

Analysis according to comorbidity in COPD and control groups

Click here to view.(36K, doc)

Supplementary Table 4

Characteristics of chronic obstructive pulmonary disease patients by Global Initiative for Chronic Obstructive Lung Disease stage

Click here to view.(43K, doc)

Notes

Funding:This work was supported by an Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No.2019-0-01343, Regional strategic industry convergence security core talent training business)

Disclosure:The authors have no potential conflicts of interest to disclose.

Footnote:The English in this document has been checked by at least two professional editors, both native speakers of English. For a certificate, please see: http://www.textcheck.com/certificate/U4pQlx

Author Contributions:

  • Conceptualization: Heo J, Moon DH, Hong Y, Bak SH, Park JH, Kim YS, Kim WJ.

  • Data curation: Heo J, Kim J, Oh BD, Kim YS.

  • Formal analysis: Heo J, Kim J, Oh BD, Kim YS.

  • Funding acquisition: Kim WJ.

  • Investigation: Heo J, Bak SH, Kim J, Park JH, Oh BD, Kim YS.

  • Methodology: Heo J, Moon DH, Hong Y, Bak SH, Kim J, Park JH, Oh BD, Kim YS, Kim WJ.

  • Software: Oh BD, Kim YS.

  • Supervision: Heo J, Hong Y, Bak SH, Park JH, Kim YS.

  • Validation: Heo J, Kim YS.

  • Visualization: Heo J, Kim YS.

  • Writing - original draft: Heo J, Park JH, Oh BD, Kim YS, Kim WJ.

  • Writing - review & editing: Heo J, Moon DH, Hong Y, Bak SH, Park JH, Kim YS, Kim WJ.

References

    1. Kim HK, Lee SD. Pathophysiology of chronic obstructive pulmonary disease. Tuberc Respir Dis (Seoul) 2005;59(1):5–13.
    1. Lopez AD, Shibuya K, Rao C, Mathers CD, Hansell AL, Held LS, et al. Chronic obstructive pulmonary disease: current burden and future projections. Eur Respir J 2006;27(2):397–412.
    1. Mathers CD, Loncar D. Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med 2006;3(11):e442
    1. Kim C, Kim Y, Yang DW, Rhee CK, Kim SK, Hwang YI, et al. Direct and indirect costs of chronic obstructive pulmonary disease in Korea. Tuberc Respir Dis (Seoul) 2019;82(1):27–34.
    1. Vos T, Flaxman AD, Naghavi M, Lozano R, Michaud C, Ezzati M, et al. Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990–2010: a systematic analysis for the Global Burden of Disease Study 2010. Lancet 2012;380(9859):2163–2196.
    1. Abubakar I, Tillmann T, Banerjee A. GBD 2013 Mortality and Causes of Death Collaborators. Global, regional, and national age-sex specific all-cause and cause-specific mortality for 240 causes of death, 1990–2013: a systematic analysis for the Global Burden of Disease Study 2013. Lancet 2015;385(9963):117–171.
    1. Cazzola M, MacNee W, Martinez FJ, Rabe KF, Franciosi LG, Barnes PJ, et al. Outcomes for COPD pharmacological trials: from lung function to biomarkers. Eur Respir J 2008;31(2):416–469.
    1. Morrow DA, de Lemos JA. Benchmarks for the assessment of novel cardiovascular biomarkers. Circulation 2007;115(8):949–952.
    1. Singh D. Blood eosinophil counts in chronic obstructive pulmonary disease: a biomarker of inhaled corticosteroid effects. Tuberc Respir Dis (Seoul) 2020;83(3):185–194.
    1. Sin DD. Chronic obstructive pulmonary disease: reactive past, preventive future. Proc Am Thorac Soc 2009;6(6):523–523.
    1. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv. 2013
    1. Yoon BH, Kim YS. Correlation analysis of chronic obstructive pulmonary disease (COPD) and its biomarkers using the word embeddings; Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers); Taipei, Taiwan; IJCNLP. 2017. pp. 337-342.
    1. Hahm CR, Lim MN, Kim HY, Hong SH, Han SS, Lee SJ, et al. Implications of the pulmonary artery to ascending aortic ratio in patients with relatively mild chronic obstructive pulmonary disease. J Thorac Dis 2016;8(7):1524–1531.
    1. Koo HK, Hong Y, Lim MN, Yim JJ, Kim WJ. Relationship between plasma matrix metalloproteinase levels, pulmonary function, bronchodilator response, and emphysema severity. Int J Chron Obstruct Pulmon Dis 2016;11:1129–1137.
    1. Vogelmeier CF, Criner GJ, Martinez FJ, Anzueto A, Barnes PJ, Bourbeau J, et al. Global strategy for the diagnosis, management, and prevention of chronic obstructive lung disease 2017 report. GOLD executive summary. Am J Respir Crit Care Med 2017;195(5):557–582.
    1. Ahn J, Cho J. Current serum lung cancer biomarkers. J Mol Biomark Diagn 2013;4:2.
    1. Srikanthan K, Feyh A, Visweshwar H, Shapiro JI, Sodhi K. Systematic review of metabolic syndrome biomarkers: a panel for early detection, management, and risk stratification in the West Virginian population. Int J Med Sci 2016;13(1):25–38.
    1. Fabbri LM, Hurd S. Global strategy for the diagnosis, management and prevention of COPD: 2003 update. Eur Respir J 2003;22(1):1–2.
    1. Nakayama M, Satoh H, Ishikawa H, Fujiwara M, Kamma H, Ohtsuka M, et al. Cytokeratin 19 fragment in patients with nonmalignant respiratory diseases. Chest 2003;123(6):2001–2006.
    1. Arai T, Inoue Y, Sugimoto C, Inoue Y, Nakao K, Takeuchi N, et al. CYFRA 21-1 as a disease severity marker for autoimmune pulmonary alveolar proteinosis. Respirology 2014;19(2):246–252.
    1. Han Y, Heo Y, Hong Y, Kwon SO, Kim WJ. Correlation between physical activity and lung function in dusty areas: results from the chronic obstructive pulmonary disease in dusty areas (CODA) cohort. Tuberc Respir Dis (Seoul) 2019;82(4):311–318.
    1. Stieber P, Bodenmüller H, Banauch D, Hasholzner U, Dessauer A, Ofenloch-Hähnle B, et al. Cytokeratin 19 fragments: a new marker for non-small-cell lung cancer. Clin Biochem 1993;26(4):301–304.
    1. Zemans RL, Jacobson S, Keene J, Kechris K, Miller BE, Tal-Singer R, et al. Multiple biomarkers predict disease severity, progression and mortality in COPD. Respir Res 2017;18(1):117.
    1. Kim DK, Cho MH, Hersh CP, Lomas DA, Miller BE, Kong X, et al. Genome-wide association analysis of blood biomarkers in chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2012;186(12):1238–1247.
    1. Brusselle G, Pavord ID, Landis S, Pascoe S, Lettis S, Morjaria N, et al. Blood eosinophil levels as a biomarker in COPD. Respir Med 2018;138:21–31.
    1. Park SY, Lee JG, Kim J, Park Y, Lee SK, Bae MK, et al. Preoperative serum CYFRA 21-1 level as a prognostic factor in surgically treated adenocarcinoma of lung. Lung Cancer 2013;79(2):156–160.
    1. Kim J, Jung H, Kim D, Lee S, Kim M, Park K. Lack of clinical utility for CYFRA 21-1 in medical screening. Korean J Fam Pract 2018;8(1):73–79.
    1. Wieskopf B, Demangeat C, Purohit A, Stenger R, Gries P, Kreisman H, et al. Cyfra 21-1 as a biologic marker of non-small cell lung cancer. Evaluation of sensitivity, specificity, and prognostic role. Chest 1995;108(1):163–169.
    1. Simpson JK, Maher TM, Bentley J, Braybrooke R, Carter P, Costa MJ, et al. CYFRA-21-1 as a biomarker with prognostic potential in idiopathic pulmonary fibrosis: an analysis of the PROFILE cohort. Am J Respir Crit Care Med 2017;195:A6791.
    1. Joo H, Park J, Lee SD, Oh YM. Comorbidities of chronic obstructive pulmonary disease in Koreans: a population-based study. J Korean Med Sci 2012;27(8):901–906.
    1. Park TS, Lee JS, Seo JB, Hong Y, Yoo JW, Kang BJ, et al. KOLD Study GroupStudy design and outcomes of Korean obstructive lung disease (KOLD) cohort study. Tuberc Respir Dis (Seoul) 2014;76(4):169–174.

Metrics
Share
Tables

1 / 4

Funding Information
PERMALINK