Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Jiang, Xiangying; Ringwald, Martin; Blake, Judith; Shatkay, Haggit

doi:10.1093/database/baaa043

The authors thank Dr. Kathleen F. McCoy for pointing out an error in the formulae used for calculating the Utility measures in the original version of the publication “Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)” by Xiangying Jiang, Martin Ringwald, Judith Blake and Hagit Shatkay (Database, Volume 2017, bax017). As a consequence, two of the tables and two of the figures are also corrected.

In the publication [1], the formulae of utility measures (Page 6, Right Column, Lines 1-2) were provided as:

$$\begin{align*} UTIL-10&=\frac{10\times TP-FP}{10\times(TP+FP)}\\ UTIL-20 &= \frac{20 \times TP-FP}{20\times(TP+FP)}\end{align*}$$

The correct formulae should be as follows (where FP in the denominator is replaced by FN):

$$\begin{align*} UTIL-10&=\frac{10\times TP-FP}{10\times(TP+FN)}\\ UTIL-20 &= \frac{20 \times TP-FP}{20\times(TP+\textrm{FN})}\end{align*}$$

Accordingly the phrase on Page 6, Right Column, Lines 3-4: “where TP is the number of true positives, FP is the number of false positives” should be replaced as follows: “where TP is the number of true positives, FP is the number of false positives, and FN denotes false negatives.”

These errors affect Table 2 and Table 3, as well as Figure 1 and Figure 2.

The rightmost two columns in Table 2 in the original publication [1] were:

Utility-10	Utility-20
.876(.006)	.881(.005)
.895(.007)	.899(.007)
.875(.008)	.881(.008)
.894(.009)	.899(.009)
.896	.900

Open in new tab

Utility-10	Utility-20
.876(.006)	.881(.005)
.895(.007)	.899(.007)
.875(.008)	.881(.008)
.894(.009)	.899(.009)
.896	.900

Open in new tab

The values based on the corrected formulae are:

Utility-10	Utility-20
.945(.003)	.951(.003)
.912(.005)	.917(.005)
.945(.006)	.951(.006)
.912(.009)	.917(.008)
.916	.921

Open in new tab

Utility-10	Utility-20
.945(.003)	.951(.003)
.912(.005)	.917(.005)
.945(.006)	.951(.006)
.912(.009)	.917(.008)
.916	.921

Open in new tab

The complete corrected Table 2 is as follows:

Table 2

Classification evaluation measures for our classifiers on the GXD dataset using different cross-validation settings. NB denotes Na1ve Bayes classifier; RF denotes Random Forest classifier. The suffix 5 indicates using 5 complete runs of 5-fold cross validation; the suffix 10 indicates using 10 complete runs of 10-fold cross validation. RF-H-and-H represents using half of the GXD dataset for training and the other half for testing. Standard deviation is shown in parentheses.

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB5	.892(.005)	.957(.003)	.923(.003)	.917(.004)	.945(.003)	.951(.003)
RF5	.908(.006)	.921(.005)	.915(.004)	.912(.005)	.912(.005)	.917(.005)
NB10	.891(.007)	.957(.006)	.923(.004)	.917(.005)	.945(.006)	.951(.006)
RF10	.908(.008)	.922(.008)	.915(.006)	.912(.007)	.912(.009)	.917(.008)
RF-H-and-H	.905	.925	.915	.913	.916	.921

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB5	.892(.005)	.957(.003)	.923(.003)	.917(.004)	.945(.003)	.951(.003)
RF5	.908(.006)	.921(.005)	.915(.004)	.912(.005)	.912(.005)	.917(.005)
NB10	.891(.007)	.957(.006)	.923(.004)	.917(.005)	.945(.006)	.951(.006)
RF10	.908(.008)	.922(.008)	.915(.006)	.912(.007)	.912(.009)	.917(.008)
RF-H-and-H	.905	.925	.915	.913	.916	.921

Open in new tab

Table 2

Classification evaluation measures for our classifiers on the GXD dataset using different cross-validation settings. NB denotes Na1ve Bayes classifier; RF denotes Random Forest classifier. The suffix 5 indicates using 5 complete runs of 5-fold cross validation; the suffix 10 indicates using 10 complete runs of 10-fold cross validation. RF-H-and-H represents using half of the GXD dataset for training and the other half for testing. Standard deviation is shown in parentheses.

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB5	.892(.005)	.957(.003)	.923(.003)	.917(.004)	.945(.003)	.951(.003)
RF5	.908(.006)	.921(.005)	.915(.004)	.912(.005)	.912(.005)	.917(.005)
NB10	.891(.007)	.957(.006)	.923(.004)	.917(.005)	.945(.006)	.951(.006)
RF10	.908(.008)	.922(.008)	.915(.006)	.912(.007)	.912(.009)	.917(.008)
RF-H-and-H	.905	.925	.915	.913	.916	.921

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB5	.892(.005)	.957(.003)	.923(.003)	.917(.004)	.945(.003)	.951(.003)
RF5	.908(.006)	.921(.005)	.915(.004)	.912(.005)	.912(.005)	.917(.005)
NB10	.891(.007)	.957(.006)	.923(.004)	.917(.005)	.945(.006)	.951(.006)
RF10	.908(.008)	.922(.008)	.915(.006)	.912(.007)	.912(.009)	.917(.008)
RF-H-and-H	.905	.925	.915	.913	.916	.921

Open in new tab

Figure 1, which depicts these values showing the performance of our classifiers, was:

Open in new tab Download slide

It is updated as follows, where the histograms depicting Utility-10 and Utility-20 have been corrected:

Open in new tab Download slide

In addition, the rightmost two columns in Table 3 in the original publication were:

Utility-10	Utility-20
.866(.024)	.872(.023)
.765(.018)	.776(.017)
.820(.019)	.828(.028)
.846(.021)	.853(.020)
.870(0.19)	.876(0.18)
.869(0.20)	.875(0.19)

Open in new tab

Utility-10	Utility-20
.866(.024)	.872(.023)
.765(.018)	.776(.017)
.820(.019)	.828(.028)
.846(.021)	.853(.020)
.870(0.19)	.876(0.18)
.869(0.20)	.875(0.19)

Open in new tab

The values based on the corrected formulae are:

Utility-10	Utility-20
.647(.027)	.652(.027)
.746(.024)	.757(.024)
.751(.024)	.758(.024)
.745(.018)	.752(.018)
.805(.018)	.810(.018)
.817(.015)	.823(.015)

Open in new tab

Utility-10	Utility-20
.647(.027)	.652(.027)
.746(.024)	.757(.024)
.751(.024)	.758(.024)
.745(.018)	.752(.018)
.805(.018)	.810(.018)
.817(.015)	.823(.015)

Open in new tab

The complete corrected Table 3 is as follows:

Table 3

Classification results obtained over the GXD-caption dataset using different set of features. AB indicates using features form titles/abstracts only. CAP indicates using features from captions alone. AB&CAP indicates using features from both captions and titles/abstracts. NB denotes Naáve Bayes; RF denotes Random Forest classifier. Standard deviation is shown in parentheses. The best performance in each metric is indicated using Boldface.

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB_AB	.874(.023)	.656(.027)	.749(.022)	.783(.017)	.647(.027)	.652(.027)
RF_AB	.779(.016)	.768(.024)	.773(.017)	.802(.015)	.746(.024)	.757(.024)
NB_CAP	.831(.018)	.766(.024)	.797(.019)	.808(.017)	.751(.024)	.758(.024)
RF_CAP	.855(.019)	.758(.017)	.804(.015)	.817(.014)	.745(.018)	.752(.018)
NB_AB&CAP	.877(.018)	.816(.018)	.846(.014)	.853(.013)	.805(.018)	.810(.018)
RF_AB&CAP	.876(.019)	.829(.015)	.852(.013)	.858(.012)	.817(.015)	.823(.015)

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB_AB	.874(.023)	.656(.027)	.749(.022)	.783(.017)	.647(.027)	.652(.027)
RF_AB	.779(.016)	.768(.024)	.773(.017)	.802(.015)	.746(.024)	.757(.024)
NB_CAP	.831(.018)	.766(.024)	.797(.019)	.808(.017)	.751(.024)	.758(.024)
RF_CAP	.855(.019)	.758(.017)	.804(.015)	.817(.014)	.745(.018)	.752(.018)
NB_AB&CAP	.877(.018)	.816(.018)	.846(.014)	.853(.013)	.805(.018)	.810(.018)
RF_AB&CAP	.876(.019)	.829(.015)	.852(.013)	.858(.012)	.817(.015)	.823(.015)

Open in new tab

Table 3

Classification results obtained over the GXD-caption dataset using different set of features. AB indicates using features form titles/abstracts only. CAP indicates using features from captions alone. AB&CAP indicates using features from both captions and titles/abstracts. NB denotes Naáve Bayes; RF denotes Random Forest classifier. Standard deviation is shown in parentheses. The best performance in each metric is indicated using Boldface.

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB_AB	.874(.023)	.656(.027)	.749(.022)	.783(.017)	.647(.027)	.652(.027)
RF_AB	.779(.016)	.768(.024)	.773(.017)	.802(.015)	.746(.024)	.757(.024)
NB_CAP	.831(.018)	.766(.024)	.797(.019)	.808(.017)	.751(.024)	.758(.024)
RF_CAP	.855(.019)	.758(.017)	.804(.015)	.817(.014)	.745(.018)	.752(.018)
NB_AB&CAP	.877(.018)	.816(.018)	.846(.014)	.853(.013)	.805(.018)	.810(.018)
RF_AB&CAP	.876(.019)	.829(.015)	.852(.013)	.858(.012)	.817(.015)	.823(.015)

Classifiers	Precision	Recall	F-measure	Accuracy	Utility-10	Utility-20
NB_AB	.874(.023)	.656(.027)	.749(.022)	.783(.017)	.647(.027)	.652(.027)
RF_AB	.779(.016)	.768(.024)	.773(.017)	.802(.015)	.746(.024)	.757(.024)
NB_CAP	.831(.018)	.766(.024)	.797(.019)	.808(.017)	.751(.024)	.758(.024)
RF_CAP	.855(.019)	.758(.017)	.804(.015)	.817(.014)	.745(.018)	.752(.018)
NB_AB&CAP	.877(.018)	.816(.018)	.846(.014)	.853(.013)	.805(.018)	.810(.018)
RF_AB&CAP	.876(.019)	.829(.015)	.852(.013)	.858(.012)	.817(.015)	.823(.015)

Open in new tab

Figure 2, shows a comparison of classification results obtained using the different sets of features:

Open in new tab Download slide

It is updated as follows, where the histograms depicting Utility-10 and Utility-20 have been corrected:

Open in new tab Download slide

All other parts of the paper including all the conclusions remain unaltered by these corrections.

Reference

Jiang

X.

,

Ringwald

M.

,

Blake

J.

,

Shatkay

H

. (

2017

).

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

.

Database

,

2017

,

bax017

.

Google Scholar

Crossref

WorldCat

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
June 2020	140
July 2020	85
August 2020	68
September 2020	100
October 2020	69
November 2020	131
December 2020	46
January 2021	38
February 2021	17
March 2021	56
April 2021	8
May 2021	10
June 2021	6
July 2021	19
August 2021	8
September 2021	7
October 2021	15
November 2021	14
December 2021	3
January 2022	11
February 2022	2
April 2022	17
May 2022	8
June 2022	6
July 2022	10
August 2022	8
September 2022	39
October 2022	3
November 2022	6
December 2022	1
January 2023	5
February 2023	3
March 2023	4
April 2023	6
May 2023	17
June 2023	25
July 2023	7
August 2023	13
September 2023	2
October 2023	14
November 2023	10
December 2023	13
January 2024	16
February 2024	49
March 2024	15
April 2024	12

Article Contents

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Reference

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Reference

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only