1 Introduction

Positive and unlabelled (PU) learning focuses on the setting where the data contains labelled positive examples and unlabelled ones. Unlabelled examples can be either positive or negative. In this setting, the true class label \(Y\in \{0,1\}\) is not observed directly. We only observe the surrogate variable \(S\in \{0,1\}\), which indicates whether an example is labelled (and thus positive; \(S=1\)) or unlabelled (\(S=0\)). This type of data naturally arises in many applications (see Bekker and Davis 2020; Jaskie and Spanias 2019 for reviews). Below we describe some illustrative examples. As a first example, consider the problem of predicting disease based on patient characteristics. Medical databases usually list only diagnosed diseases. However, many diseases, such as hypertension or diabetes, are often undiagnosed (Walley 2018). Therefore, the absence of the diagnosis does not mean that the patient does not have the disease in question. Consequently, we can distinguish three groups of patients: patients with the diagnosed disease (\(S=1\) and thus \(Y=1\)); patients without diagnosed disease who have the disease (\(S=0\) and \(Y=1\)) and finally patients without diagnosed disease who really do not have the disease (\(S=0, Y=0\)). Importantly, it is not possible to make a distinction between the second and the third group using observed data. Secondly, PU data occur frequently in text classification problems (Liu et al. 2003; Fung et al. 2006; Li and Liu 2003). For example, when classifying web page preferences, some web pages can be bookmarked as positive (\(S=1\)) by the user whereas all other pages are treated as unlabelled (\(S=0\)). Among unlabelled pages (\(S=0\)), one can find both positive and negative pages. Thirdly, PU data stems from under-reporting (Sechidis et al. 2017) which frequently happens in survey data, and it refers to the situation when some respondents fail to answer a question truthfully. Under-reporting may occur, e.g. when the question concerns dangerous or unlawful behaviour such as taking illicit drugs (Bahorik et al. 2014; Chen et al. 2006). So in surveys, one group of respondents may admit such behaviours truthfully (\(S=1\implies Y=1\)) and the other group do not (\(S=0\)). The second group consists of respondents who have engaged in dangerous behaviours but do not report them (\(Y=1, S=0\)) and those who have nothing to report (\(Y=0, S=0\)). Other examples include modelling wildlife habitat selection (Ward et al. 2009; Pearce and Boyce 2006), detection of causative genes for various human diseases (Yang et al. 2014) and predicting drug-target interactions (Lan et al. 2016).

There are many partial observability schemes related to PU learning (we refer to Section 8 in Bekker and Davis (2020) for detailed discussion, see also Menon et al. (2015)). Semi-supervised learning is a general related scenario in which the goal is to learn from labelled and unlabelled data but, in contrast to PU learning, labeled examples from both classes are assumed to be present in the data (Chapelle et al. 2010). PU setting can be also seen as a special case of more general problem of learning from noisy labels (Natarajan et al. 2013; Frenay and Verleysen 2014) when labels are incorrectly assigned. In such general scenario, value of the true class variable Y can be flipped with some probability, i.e. instead Y we observe \(S=1-Y\). Probabilities of incorrect assignment are: \(\omega _1:=P(S=0|Y=1)\) and \(\omega _2:=P(S=1|Y=0)\). Obviously, this problem reduces to PU setting for \(\omega _2=0\). Even more general scenario is learning from two contaminated distributions (Scott et al. 2013). Finally, PU learning can be also seen as a special case of ’coarse data’ analysis (Heitjan and Rubin 1991; Couso et al. 2017), which covers situations where one does not have access to the exact value of the class variable Y, but only to some subset of the possible values of Y that contains it.

It is known that in PU learning, a classifier can be successfully learned if the class prior \(\alpha =P(Y=1)\) is available (Bekker and Davis 2020; Elkan and Noto 2008). More precisely, knowledge of the class prior can be used in three ways. The first approach is represented by so-called post-processing methods which first train classifier using S as a surrogate variable for Y (which is called non-traditional classification or naive classification) and then modify output probabilities using the class prior (Elkan and Noto 2008). The second approach are pre-processing methods that weigh the examples using the class prior (Steinberg and Cardell 1992; Lancaster and Imbens 1996; Kiryo et al. 2017). We refer to Bekker and Davis (2020) (Section 5.3.2) for a description of a general empirical risk minimization framework in which the weights of observations depending on \(\alpha \) are determined for any loss function. In the third approach, class prior is incorporated into learning algorithms. A representative algorithm from this group is POSC4.5 (Denis et al. 2005), which is PU tree learning method.

The class prior is usually not known (except from situations when, for example, disease prevalence is known or can be learnt from other studies) and therefore the problem of its estimation from PU data has attracted significant attention (Elkan and Noto 2008; Jain et al. 2016; Plessis et al. 2017; Bekker and Davis 2018). There are three key contributions of the present paper concerning this problem. First, we formally analyse the problem of prior estimation; we show that in general this problem is ill-defined in non-parametric setting, i.e. class prior is not identifiable without some assumptions on conditional distribution of Y given X. We show however that the class prior becomes identifiable when we impose mild semi-parametric model assumptions on conditional distribution of Y given X. Secondly, we analyse in detail the most popular existing methods: the classical method EN proposed by Elkan and Noto (2008), TIcE algorithm (Bekker and Davis 2018), Partial Matching (Plessis et al. 2017), KM estimators (Ramaswamy et al. 2016) and MLR (Jaskie et al. 2020). We formally show that in some situations, some of the above methods underestimate label frequency \(c=P(S=1|Y=1)\) and thus overestimate class prior. Finally, we consider the method based on logistic regression which allows to estimate label frequency and parameters of the logistic model simultaneously. The method (called JOINT method) was proposed in recent work (Teisseyre et al. 2020), in the more general context of estimation of the posterior for PU data. Its main limitation lies in necessity of the optimization of the non-concave log-likelihood function. Teisseyre et al. (2020) used simple gradient method, which may fail in some situations. Here we propose a novel procedure, called CD+MM, that combines cyclic coordinate descent (CD) and minimization-majorization (MM) algorithms and allows for more stable and efficient optimization. The method is based on a simple but consequential fact that the log-likelihood treated as function of logistic parameters is bounded from below by a concave function. Indeed, the experiments indicate that CD+MM outperforms other methods, including JOINT method, with respect to estimation error of the class prior.

This paper is organized as follows. In Sect. 2 we introduce notation and basic assumptions. In Sect. 3 we discuss identifiability of class prior; in Sect. 4 we analyse the existing methods; Sect. 5.1 introduces the novel method, Sect. 5.2 compares it with MLR method (Jaskie et al. 2020), Sect. 6 summarizes the results of numerical experiments and Sect. 7 concludes the paper.

2 Notation and assumptions

We first introduce basic notations. Let \(X \in {\mathcal {X}}\) be a feature vector, \(Y\in \{0,1\}\) be a true class label and \(S\in \{0,1\}\) an indicator of whether an example is labelled (\(S=1\)) or not (\(S=0\)). We assume that there is some unknown distribution P(YXS) such that \((y_i,x_i,s_i), i=1,\ldots ,n\) is i.i.d. sample drawn from it and data \((x_i,s_i),i=1,\ldots ,n\), is observed. Only positive examples (\(Y=1\)) can be labelled, i.e. \(P(S=1|X,Y=0)=0\). Thus we know that \(Y=1\) when \(S=1\) but when \(S=0\), Y can be either 1 or 0 . As the aim is to learn the distribution of (XY) and we only observe samples from distribution of (XS), where \(S=Y\) with a certain probability, this is a partial observability scenario similar e.g. to a right censoring scheme when Y is observed provided its value is smaller than value of a censoring variable. In this work we also adopt a commonly used assumption called SCAR (Selected Completely At Random) which states that labelled examples are selected randomly from a set of positives examples, independently from X, i.e.

$$\begin{aligned} P(S=1|Y=1,X) = P(S=1|Y=1). \end{aligned}$$
(1)

Parameter \(c:=P(S=1|Y=1)\) is called the label frequency and plays an important role in PU learning. In particular it is closely related to the class prior \(\alpha =P(Y=1)\) Elkan and Noto (2008), i.e. we have

$$\begin{aligned} \alpha =P(Y=1) = P(S=1)/c. \end{aligned}$$
(2)

The probability \(P(S=1)\) can be directly estimated as the fraction of labelled examples. In the proposed method we first estimate c and then use (2) to estimate \(\alpha \), the similar approach was used e.g. in Bekker and Davis (2018). Under SCAR assumption we have the following property

$$\begin{aligned} P(S=1|X)=cP(Y=1|X) , \end{aligned}$$
(3)

which will be directly used in the proposed method.

In the paper we also take advantage of a representation of variable (XS) when SCAR assumption is valid, introduced recently in Teisseyre et al. (2020). Namely, we have shown that S can be represented as

$$\begin{aligned} S=Y\cdot \varepsilon ,\,\, \mathrm{where}\,\,\varepsilon \perp (X,Y)\,\, \mathrm{and}\,\varepsilon \sim Bern(1,p) \end{aligned}$$
(4)

(with \(\perp \) denoting independence), for a certain \(0<p<1\) and where Bern(1, p) stands for Bernoulli distribution. Indeed, we have \(S=Y\varepsilon \perp X\) given Y, as \(\varepsilon \perp (X,Y)\) implies that \(\varepsilon \perp X\) given Y. Moreover,

$$\begin{aligned} P(S=1|Y=1)=P(Y\varepsilon =1|Y=1)=P(\varepsilon =1)=p. \end{aligned}$$

Thus probability of success \(P(\varepsilon =1)\) coincides with c. Moreover, it easily follows from the fact that \(\varepsilon \perp X\) given \(Y=1\) that \(P(X|Y=1)=P(X|S=1)\) and thus conditional distribution of X given \(Y=1\) is identifiable.

We stress that the scenario we consider here, namely that i.i.d. sample \((x_i,s_i)\) is observed, called single sample scenario should be distinguished from case-control scenario when one sample is drawn from distribution \(P(X|Y=1)\) and the second one is drawn independently from P(X). The differences between these two schemes are discussed in (Bekker and Davis 2020).

3 Identifiability of class prior

We start by stating an intuitive fact that neither c or \(\alpha \) is identifiable for a PU single sample scenario given a full knowledge of the distribution of (XS), if no assumptions on distribution of (XY) are imposed. This was already noticed for PU case control scenario in Ward et al. (2009), see also Scott (2015). In the present case of single sample scenario it follows from noting that distribution of X is mixture of two distributions

$$\begin{aligned} p(x)= \alpha cp_l(x) + (1-\alpha c)p_u(x), \end{aligned}$$

where \(p_l(\cdot )\) and \(p_u(\cdot )\) are densities or distribution mass functions of conditional distributions of X given \(S=1\) and \(S=0\), respectively. Note that \(\alpha c=P(S=1)\). We have that \(p_l(\cdot )\) coincides with density of X given \(Y=1\) (see Sect. 2), whereas \(p_u\) equals

$$\begin{aligned} p_u(x)=\frac{p(x)-\alpha cp_l(x)}{1-\alpha c}=\frac{p(x)-\alpha cp(x|Y=1)}{1-\alpha c}. \end{aligned}$$
(5)

Thus mixing proportion is \(\alpha c\) and distributions \(P(X|S=i)\) of labelled and unlabelled mixture components depend solely on distributions P(X), \(P(X|Y=i)\) and \(\alpha c\). This means that changing \(\alpha \) and c in such a way that their product is constant we obtain the same distribution (XS), however the situations when \(\alpha \) is small and c is large or otherwise are very different. Thus neither \(\alpha \) or c is identifiable.

The situation changes dramatically when we impose semi-parametric model assumptions on conditional distribution of Y given X. We consider so called single index model (see e.g. Ichimura 1993) and prove directly that the parameters of the corresponding model for aposteriori probability of Y are identifiable. This in turn implies Fisher consistency of the likelihood method (part (ii)) which is the theoretical underpinning of empirical likelihood optimisation considered in this paper. In Remark 1 below we discuss that the considered model satisfies the positive function condition which is one of the identifiability conditions considered in Bekker and Davis (2020).

Theorem 1

(i) Let f(s) be an arbitrary strictly monotone response function with values in [0, 1] such that \(P(Y=1|X=x)=f(\beta ^Tx)\), for some \(\beta =(\beta _0,\beta _1,\ldots ,\beta _p)^T\). In addition assume that there exists \(i\ne 0\), such that \(\beta _i\ne 0\) (i.e. Y and X are not independent). If

$$\begin{aligned} c f( \beta ^Tx) = {\tilde{c}} f({{\tilde{\beta }}}^Tx) \end{aligned}$$

for all \(x\in R^{p+1}\) where \({{\tilde{\beta }}}=({{\tilde{\beta }}}_0,{{\tilde{\beta }}}_1,\ldots ,{{\tilde{\beta }}}_p)\), then \(c={\tilde{c}}\) and \(\beta ={{\tilde{\beta }}}\).

(ii) Assume that (XY) is as in (i), \(c=P(S=1|Y=1)\) and let \(p_{{\tilde{c}},{{\tilde{\beta }}}}(x,1)=P(X=x, S=1)={\tilde{c}} f({{\tilde{\beta }}}^Tx)p(x), p_{{\tilde{c}},{{\tilde{\beta }}}}(x,0)=P(X=x, S=0)=1-p_{{\tilde{c}},{{\tilde{\beta }}}}(x,1)\). Then the maximiser of the expected loglikelihood

$$\begin{aligned} {\mathbb {E}}_{X,S} \{\log p_{{\tilde{c}},{{\tilde{\beta }}}}(X,S)\} \end{aligned}$$

with respect to \(({\tilde{c}}, {{\tilde{\beta }}})\) is unique and equals \((c,\beta )\).

Part (i) claims that when \(P(Y=1|x)\) is logistic then parameters determining \(P(S=1|x)\), that is c and \(\beta \) are uniquely determined. This is proved by assuming the contrary, checking the stated equality for the specific values of x and showing that it forces two possibly different sets of parameters to coincide. In part (ii) it is shown that when the expected likelihood is maximised, the maximisers correspond to the true parameters of generating mechanism, which are uniquely determined in view of (i).

Proof

(i) Note that \(\beta _0\) is the intercept corresponding to the first coordinate of x which is equal 1. We first consider a situation when there exists an index \(i\ne 0\) such that \(\beta _i\ne {{\tilde{\beta }}}_i\). Plugging the value \(x:=(1,0,\ldots ,0,({{\tilde{\beta }}}_0- \beta _0)/(\beta _i-{{\tilde{\beta }}}_i),0,\ldots ,0)\) (with \((i+1)\)th coordinate equal \(({{\tilde{\beta }}}_0- \beta _0)/(\beta _i-{{\tilde{\beta }}}_i)\)) into the assumed functional equality we obtain \(c={\tilde{c}}\) as \(f(\beta ^Tx)=f({{\tilde{\beta }}}^Tx)\ne 0\). Taking \(x:=(1,0,\ldots ,0)\) yields \(\beta _0={{\tilde{\beta }}}_0\). Considering \(x=(1,0,\ldots ,1,0,\ldots ,0)\), with the first and the \((i+1)\)th coordinate equal to 1, we obtain \(\beta _i = {{\tilde{\beta }}}_i\), a contradiction. Thus it is enough to consider the equality

$$\begin{aligned} c f( \beta _0 + \beta _{-0}^Tx_{-0}) = {\tilde{c}} f({{\tilde{\beta }}}_0 + \beta _{-0}^Tx_{-0}), \end{aligned}$$

where \(\beta =(\beta _0,\beta _{-0}^T)^T\). It follows from assumption that we can take coordinate \(x_i\) such that \(\beta _i\ne 0\) and \(i>0\). Now considering the sequence \(x^{(n)}=(1,0,\ldots ,0,sign(\beta _i)\ \times n,0,\ldots ,0)\), with \(sign(\beta _i)\ \times n\) at the \((i+1)\)th place, we obtain \({\tilde{c}} f(\infty )= c f(\infty )\) and thus \({\tilde{c}}=c \) again. Taking \(x=(1,0,\ldots ,0)\) yields now \(\beta _0={{\tilde{\beta }}}_0\) and thus in a view of the first part of the proof \(\beta ={{\tilde{\beta }}}\).

Part (ii) follows from properties of Kullback–Leibler divergence which imply that for any \(X=x\) the conditional expected value \({\mathbb {E}}_{S|X=x} \{\log p_{{\tilde{c}},{{\tilde{\beta }}}}(x,S)\}\) is maximised by success probability equal \(cf(\beta ^Tx)\) in view of Information Inequality (see Cover and Thomas (2006), Theorem 2.6.3). Then it follows from (i) that c and \(\beta \) are uniquely defined. \(\square \)

Some remarks are in order.

Remark 1

The setting where \(P(Y=1|X=x)= f(\beta ^Tx)\) and f is some unknown monotone response function is very flexible semi-parametric model. However, its assumptions imply that the positive function property is valid which ensures identifiability (see Bekker and Davis 2020, Section 3.3 for taxonomy of identifiability assumptions and (Ramaswamy et al. 2016), Definition 9 for the precise statement of the positive function property called there separability condition with margin \(\alpha \)). Namely, letting \(A:=\{x:\beta ^Tx>t\}\) where \(t\in R\) is chosen such that \(f(t)\ge 1/2 +\varepsilon \) for some \(\varepsilon >0\) and h(x) being the indicator function of A, it is easy to see that

$$\begin{aligned} {\mathbb {E}}_{X|Y=1} h(X)\ge \Big (\frac{1}{2}+\varepsilon \Big )\frac{P(X\in A)}{P(Y=1)}\ge \Big (\frac{1}{2}-\varepsilon \Big )\frac{P(X\in A)}{P(Y=1)}\ge {\mathbb {E}}_{X|Y=0} h(X). \end{aligned}$$

Thus separability condition with margin \(\alpha \) holds with \(\alpha =\varepsilon P(X\in A)/P(Y=1)\) and \(\beta = P(X\in A)/(2P(Y=1))\).

Remark 2

Observe that in the above Theorem we assume that Y and X are not independent. Note that this is obviously necessary. Indeed, when Y is independent from X, it is not possible to infer anything about Y (and S) using X. In such situation the class prior cannot be identified using solely knowledge of S. Part (ii) states that when the model for (XY) is correctly specified then log-likelihood method based on (XS) is Fisher consistent (Li and Duan 1989). In particular, the above result holds true for the logistic response function and also for f being the cumulative distribution function of the standard normal distribution which corresponds to the probit model. It is known that fitting logistic model is robust to misspecification of the response function under parametric assumptions on distribution of X (Li and Duan 1989; Mielniczuk and Teisseyre 2016) what suggests that estimation of c will also enjoy this property. This will be investigated for artificially generated data in Sect. 6. In Sect. 5.1 we focus on the case when the conditional distribution of Y given X is logistic.

4 Existing methods of class prior estimation: theoretical analysis

In this section we review and discuss all available (best to our knowledge) non-parametric methods of class prior estimation except methods based on parametric models including the new proposal which are discussed in the next section.

4.1 Elkan-Noto estimator

We first consider Elkan–Noto estimator of c (cf. Elkan and Noto (2008)) denoted by \(e_1\) on p. 214 of their paper. It is introduced under tacit separability assumption stating that supports of conditional distributions of X given \(Y=1\) and \(Y=0\) are disjoint. The estimator is defined as follows

$$\begin{aligned} {\hat{c}}_{EN}=\frac{1}{|{{\mathcal {A}}}|} \sum _{i\in {{\mathcal {A}}}} {\hat{P}}(S=1|X_i), \end{aligned}$$
(6)

where \({{\mathcal {A}}}\) is a labelled part of a test sample \({{\mathcal {T}}}\), namely \({{\mathcal {A}}}=\{i: (X_i,S_i)\in {{\mathcal {T}}}:\, S_i=1\}\) and estimator \({\hat{P}}\) of P is based on training sample \({{\mathcal {U}}}\) independent of \({{\mathcal {T}}}\). We show below that limit of \({\hat{c}}_{EN}\) is c under the separability condition but differs from it if the condition is not satisfied.

Lemma 1

Assume that estimator \({\hat{P}}(S=1|X=\cdot )\) is uniformly consistent i.e.

$$\begin{aligned} \sup _x |{\hat{P}}(S=1|X=x) - P(S=1|X=x)|\rightarrow 0 \end{aligned}$$
(7)

in probability then

$$\begin{aligned} {\hat{c}}_{EN} \rightarrow \frac{{\mathbb {E}}_X P^2(S=1|X)}{P(S=1)} \end{aligned}$$
(8)

in probability when sample size \(n=|{{\mathcal {T}}}|\) tends to infinity. When the separability assumption holds, the limit in (8) equals c, i.e. \({\hat{c}}_{EN}\) is consistent.

Intuitively, the proof exploits the idea that since \({\hat{c}}_{EN}\) is given as an average of random variables, Law of Large Numbers allows us to study its limit. Before we prove the result, we discuss its consequences. First, we note that if Y is independent of X and thus S is independent of X, we have that the limit in (8) equals \(P(S=1)\le P(S=1)/P(Y=1)=c\). Moreover, considering representation \(S=Y\varepsilon \), discussed in Sect. 2, it is easy to see that \(c=P(Y\varepsilon =1|Y=1)=P(\varepsilon =1)\) whereas

$$\begin{aligned} \frac{{\mathbb {E}}_X P^2(S=1|X)}{P(S=1)}=\frac{P^2(\varepsilon =1){\mathbb {E}}_X P^2(Y=1|X)}{P(Y=1) P(\varepsilon =1)}= P(\varepsilon =1)\frac{{\mathbb {E}}_X P^2(Y=1|X)}{P(Y=1)}. \end{aligned}$$

Thus the estimator is not consistent in general and multiplicative bias is \({{\mathbb {E}}P^2(Y=1|X)}/{P(Y=1)}\). Note that it holds

$$\begin{aligned} \frac{{\mathbb {E}}_X P^2(Y=1|X)}{P(Y=1)}\le \frac{{\mathbb {E}}_X P(Y=1|X)}{P(Y=1)}= \frac{P(Y=1)}{P(Y=1)}=1 \end{aligned}$$

with the inequality above being strict in discrete case if for some x there is \(0<P(Y=1|X=x)<1\). In the separability case, when we have \(P(Y=0|X=x)=1\) or \(P(Y=1|X=x)=1\) for any x, inequality above becomes equality and the limit in (8) is c. When the separability condition does not hold, the bias of \({\hat{c}}_{EN}\) is always negative and increases with c. In particular, it is easily checked that in the case of Example 1 discussed below the multiplicative bias equals 0.68.

We also note that consistency assumption (7) in the Lemma is satisfied for discrete finite X when \(|{{\mathcal {U}}}|\rightarrow \infty \) in view of weak Law of Large Numbers and in general case, as S is binary, from uniform consistency of nonparametric regression estimators (cf. e.g. Bierens 1983, for uniform consistency of kernel estimators).

Proof

From consistency assumption (7) it easily follows that the limit of \({\hat{c}}_{EN}\) is the same as the limit

$$\begin{aligned} \frac{1}{|{{\mathcal {A}}}|}\sum _{i\in {{\mathcal {A}}}} P(S=1|X_i)=\frac{n}{n_1}\frac{1}{n}\sum _{i\in {{\mathcal {T}}}}P(S=1|X_i)I\{S_i=1\}=:I_1\times I_2, \end{aligned}$$
(9)

where \(n_1={|{{\mathcal {A}}}|}\). Obviously \(I_1=n/n_1\rightarrow P(S=1)^{-1}\) in probability and in view of the weak Law of Large Numbers

$$\begin{aligned} I_2 \rightarrow {\mathbb {E}}_{(X,S)}\Big ( P(S=1|X)I\{S=1\}\Big ). \end{aligned}$$

We compute the above limit for discrete X, in the general case proof is similar.

$$\begin{aligned} {\mathbb {E}}_{(X,S)} \Big (P(S=1|X)I\{S=1\}\Big )= & {} \sum _{x \in {{\mathcal {X}}}}\frac{P(S=1, X=x)}{P(X=x)}P(S=1,X=x)\nonumber \\= & {} \sum _{x \in {{\mathcal {X}}}}\frac{P^2(S=1, X=x)}{P^2(X=x)}P(X=x)\nonumber \\= & {} {\mathbb {E}}P^2(S=1|X). \end{aligned}$$
(10)

From convergences of \(I_1\) and \(I_2\) the result readily follows. \(\square \)

4.2 TIcE estimator

We now discuss recent estimator of c introduced in Bekker and Davis (2018). It relies on the observation that SCAR assumption i.e. conditional independence of X and S given Y implies that for any \(A\subseteq {{\mathcal {X}}}\)

$$\begin{aligned} c=P(S=1|Y=1)= P(S=1|X\in A, Y=1) \end{aligned}$$
(11)

and the right hand side above is equal

$$\begin{aligned} \frac{P(S=1, X\in A, Y=1)}{P(X\in A, Y=1)}=\frac{P(S=1|X\in A)}{P(Y=1|X\in A)}. \end{aligned}$$
(12)

Thus if A is such a set that \(P(Y=1|X\in A)\approx 1\) (known as positive subdomain assumption and A is called an anchor set in Bekker and Davis (2020)) then c may be estimated as a fraction of observations with \(S=1\) such that their concomitant X falls into A. This is an essence of the proposed method in which A is found using induction tree built on the training sample and \(P(S=1|X\in A)\) is estimated using testing sample. Denote by \({\hat{c}}_{TIcE}\) the resulting estimator (TIcE standing for ’Tree Induction for c Estimation’). Such an approach will not yield satisfactory results if \(P(Y=1|X\in A)\) is bounded away from 1 for any set A. Consider the following simple example.

Example 1

Let (XY) be such that both X and Y take values 0, 1 with probability 1/2 and

$$\begin{aligned} P(Y=1|X=1)=P(Y=0|X=0)=0.8. \end{aligned}$$

Thus the anchor set A in this case (taken as a set maximising \(P(Y=1|X\in A))\) equals \(A=\{X=1\}\) and for any S such that \(S\perp X|Y\) the corresponding value of \(c=c(S)\) will satisfy

$$\begin{aligned} c=\frac{P(S=1|X=1)}{P(Y=1|X=1)} = \frac{10}{8}P(S=1|X=1). \end{aligned}$$

Divide data into training (i.e. ’tree’ using Authors’ terminology) and testing (i.e. ’estimation’ ) data as in Bekker and Davis (2020) using (default) proportion 1:4 and let n be the total number of observations, \(T_{est}=(4/5)n\) be the size of estimation data. Moreover, using notation from Bekker and Davis (2018), let \(T_{est}(X=1)\) be the expected number of observations from estimation data with \(X=1\) and

$$\begin{aligned} L_{est}(X=1)=P(S=1|X=1)T_{est}(X=1)=\frac{8}{10}\times \frac{1}{2}\times \frac{4}{5}\times n\times c \end{aligned}$$

the expected number of observations from estimation data with \(X=1\) and \(S=1\). Using value of \(\delta \) defined by equation (8) on p. 2714 in Bekker and Davis (2018), we can obtain ’ideal’ value of \({\hat{c}}_{TIcE}\) from the equation

$$\begin{aligned} {\hat{c}}:=\frac{L_{est}(X=1)}{T_{est}(X=1)} - \varepsilon , \end{aligned}$$
(13)

where the correction (error) term equals to \(\varepsilon =({\hat{c}}(1-{\hat{c}})(1-\delta )/\delta T_{est}(X=1))^{1/2}\) and is derived from the one-sided Chebyschev inequality. The ’ideal’ value is the output value of the algorithm (cf. definition of \(c_{low}\) in Algorithm 1 of Bekker and Davis 2018) when an estimated anchor set \(\{S_e: a^*=v\}\) in the algorithm is replaced by the true anchor set \(\{X=1\}\). Note that value of \(\delta \) is given by (8) in Bekker and Davis 2018). Figure 1 shows values of ’ideal’ \({\hat{c}}_{TIcE}\) obtained from (13) (red dots) and actual values of \({\hat{c}}_{TIcE}\) (light red dots) for 100 trials and for c ranging from 0 to 1 and \(n=10^3\) and \(n=10^4\). The performances of other methods discussed in the paper is also shown for comparison. KM2 estimator is discussed in Sect. 4.4 whereas MLR, JOINT and CD+MM methods are introduced in Sect. 5. It is seen that \({\hat{c}}_{TIcE}\) underestimates true c with pronounced bias for larger c and that bias is not negligible even for \(n=10^4\). The situation does not improve even with ’ideal’ \({\hat{c}}_{TIcE}\) when the anchor set is assumed known. The reason for this behaviour is, that for any A, probability \(P(Y=1|X\in A)\) is significantly smaller than 1. The Elkan–Noto estimator performs similarly, on the other hand KM2 defined below has the smaller bias and the CD+MM method proposed in this paper is approximately unbiased in this case.

Fig. 1
figure 1

Example 1. Values of \({\hat{c}}_{TIcE}\) (light red dots) for 100 trials and ’ideal’ \({\hat{c}}_{TIcE}\) (red dots) for \(n=10^3, 10^4\) (colour figure online)

4.3 Partial matching

PE estimator proposed in du Plessis and Sugiyama (2014) is based on a partial matching in terms of the Pearson divergence defined as \(PE(p(x),q(x))=\int (p(x)-q(x))^2/p(x)\,dx\) and is approximately unbiased under separability assumption. Namely, it is proposed to find the minimiser of \(PE(a\times f(x|Y=1),f(x))\) over \(a>0\) and then consider its estimator as an estimator of \(\alpha \). Note that \(\alpha f(x|Y=1)\approx f(x)\) only for such x that \((1-\alpha )f(x|Y=0)\approx 0\). It is easily derivable that the minimiser equals

$$\begin{aligned} \left( \int \frac{P^2(x|Y=1)}{p(x)}\,dx \right) ^{-1}= \frac{P(Y=1)}{1 - (1-P(Y=1))A, } \end{aligned}$$
(14)

where \(A=\int {p(x|Y=1)p(x|Y=0)}/{p(x)}\, dx\) and the equality follows from (5). This quantity is estimable as both p(x) and \(p(x|Y=1)=p(x|S=1)\) are observable. The estimator is approximately unbiased under the separability assumption i.e. when the class-conditional densities have disjoint supports, as then \(A=0\). Otherwise, it is positively biased and thus suffers from intrinsic bias problem. We also note that the minimiser equals

$$\begin{aligned} \Bigg (\frac{\int p(x)P^2(Y=1|X=x)}{P^2(Y=1)}\,dx\Bigg )^{-1} =\frac{P^2(Y=1)}{{\mathbb {E}}P^2(Y=1|X)}=\alpha \Bigg \{\frac{{\mathbb {E}}P^2(Y=1|X)}{P(Y=1)}\Bigg \}^{-1} \end{aligned}$$

which shows that estimation of the minimiser will lead to biased estimator of \(\alpha \). Note also that the multiplicative constant above is the reciprocal of the multiplicative bias of \({\hat{c}}_{EN}\). In view of the correspondence between \(\alpha \) and c (cf. (2)) this suggests close correspondence between PE estimator and Elkan-Noto estimator.

4.4 KM estimators (Ramaswamy et al. (2016))

Denote by F and H the distribution functions of X given \(S=0\) and \(S=1\), respectively and note that (5) implies that F can be represented as a mixture

$$\begin{aligned} F(x)= \frac{\alpha -\alpha c}{1-\alpha c} H(x)+ \frac{1-\alpha }{1-\alpha c} G(x), \end{aligned}$$

where G is distribution of X given \(Y=0\). Following (Ramaswamy et al. 2016), denote the mixing proportion by \(\kappa ^*=(\alpha -\alpha c)/(1-\alpha c)\) and note that estimator of \(\alpha \) may be obtained from the equality \(\alpha =\kappa ^*(1-P(S=1)) + P(S=1)\) once \(\kappa ^*\) is estimated, as S is observable. The problem of estimating \(\kappa ^*\) is approached by transforming PU data into Reproducing Kernel Hilbert Space (RKHS) \({{\mathcal {H}}}\) by appropriate function \(\phi \) and solving this problem in \({{\mathcal {H}}}\). More specifically, it is shown that \(\kappa ^*\) can be recovered by truncating either the distance function \(d(\lambda )=\inf _{w\in {{\mathcal {C}}}} ||\lambda \phi ({\hat{F}}) + (1-\lambda )\phi ({\hat{H}}) - w||_{{\mathcal {H}}}\), where the set \({{\mathcal {C}}}\) consists of convex combinations of transformed data points and \(||\cdot ||_{{{\mathcal {H}}}}\) denotes the norm of \({{\mathcal {H}}}\), or the gradient of \(d(\lambda )\). In this way two estimators KM1 and KM2 are obtained, which are shown to be consistent under appropriate conditions (Theorems 12 and 13 in Ramaswamy et al. 2016). The advantage of the results is that they hold provided the positive function condition is valid, which is weaker than the anchor set condition (see Ramaswamy et al. 2016, Section 4). Disadvantage is that the formal consistency result holds only when permissible truncation thresholds for d or its gradient depend on the unknown \(\kappa ^*\) which we want to estimate. Thus the studied versions of estimators are not necessarily consistent. Theoretical comparison of KM1 and KM2 with other proposals seems out of reach at the moment. The relevant numerical analysis is provided in the following section. We stress that KM1 and KM2 are based on fully nonparametric approach whereas the method proposed here relies on the assumption that distribution of Y given X is logistic.

5 Estimating the class prior via logistic regression

5.1 CD+MM algorithm: description and its properties

In the following we introduce CD+MM algorithm which attempts to optimize log-likelihood function of PU data. We compare it with JOINT method introduced in Teisseyre et al. (2020) and discuss why the new method is beneficial. For standard supervised scenario, the logistic regression involves optimizing log-likelihood function

$$\begin{aligned} \sum _{i=1}^{n}[y_i\log (\sigma (x_i^{T}b))+(1-y_i)\log (1-\sigma (x_i^{T}b))], \end{aligned}$$
(15)

with respect to \(b=(b_0,\ldots ,b_p)\), where \(\sigma (t):=\exp (t)/(1+\exp (t))\) is sigmoid logistic function and posterior probability \(P(Y=1|X=x)\) is assumed equal to \(\sigma (x^{T}b^*)\) for a certain \(b^*\in R^{p+1}\). Obviously, for PU data, this approach is not feasible as we do not observe Y and the loss function has to be based on S. To tackle the above problem, we can use equality \(P(S=1|X)=cP(Y=1|X)\). In this context we mention (Jaskie et al. 2020) who used modified logistic function to approximate \(P(S=1|X)\). We write the observed log-likelihood function

$$\begin{aligned} L(c,b)=\sum _{i=1}^{n}[s_i\log (c\sigma (x_i^{T}b))+(1-s_i)\log (1-c\sigma (x_i^{T}b))]. \end{aligned}$$
(16)

The main idea of the proposed approach is to maximise the above function simultaneously with respect to b and c. This has been already investigated in Teisseyre et al. (2020) as JOINT Method which relied on simple gradient algorithm to maximise L(cb) where it is shown that such approach works on par or better than other competitors for estimating aposteriori probability \(P(Y=1|X=x)\). However, it is known that L(cb) is not a concave function jointly in both arguments (see Song and Raskutti 2020, p. 5). This can be seen by noting that the sum in (16) over \(s_i=0\) is not concave as a function of b and can dominate the remaining sum over \(s_i=1\) (which is concave) to that effect that \(L(c,\cdot )\) will not be concave.Footnote 1 Thus, despite established good performance of JOINT Method, it may fail to find a global maximum of L(cb) by using gradient search. Below we introduce a different method of searching for maximizer of L(cb) and show this gives improvement for the problem of estimating class prior. It is based on Minorization–Maximisation (MM) algorithm (see e.g. Lange 2010) in which a (non-concave) criterion function is bounded from below by a concave function at each step of iteration procedure. Under mild conditions it is shown that the maximizers of the lower bounds in the consecutive iterations yield a non-decreasing sequence of criterion function values.

Define function \(L_{b}(c):=L(c,b)\), to be profile log-likelihood function i.e. the log-likelihood treated as a function of c for fixed b. Below we prove its concavity by showing that its second derivative is non-positive.

Lemma 2

Function \(L_{b}(c)\) is concave with respect to c.

Proof

The proof follows from a simple calculation showing that

$$\begin{aligned} \frac{\partial ^2}{\partial ^2c}L_{b}(c)=-\sum _{i=1}^n \Big (\frac{s_i}{c^2} +\frac{(1-s_i)\sigma ^2(x_i^Tb)}{(1-c\sigma (x_i^Tb))^2}\Big ) \le 0 \end{aligned}$$

which implies that \(L_{b}(\cdot )\) is concave. \(\square \)

Define function \(L_{c}(b):=L(c,b)\), i.e. profile log-likelihood function for fixed c. Moreover, let X be a \(n\times p\) matrix of features, whose ith row is \(x_i^T\); \(v(x_i^T b):=\sigma (x_i^T b)(1-\sigma (x_i^T b))\), \(\varSigma \) is \(n\times n\) diagonal matrix with

$$\begin{aligned} \frac{v(x_i^T b)}{\sigma (x_i^Tb)(1-c\sigma (x_1^T b))}= \frac{1 - \sigma (x_i^T b)}{1-c\sigma (x_i^T b)}, \end{aligned}$$

\(i=1,\ldots ,n\) on the diagonal; \(\mathbf {s}=(s_1,\ldots ,s_n)^T\) and \(\mathbf {p}=(c\sigma (x_1^T b),\ldots ,c\sigma (x_n^T b))^T\). Gradient of \(L_{c}(b)\) with respect to b is of the form

$$\begin{aligned} \nabla L_{c}(b)=X^T \varSigma (\mathbf {s}-\mathbf {p}). \end{aligned}$$
(17)

We consider function

$$\begin{aligned} \varPsi _{c}(b,b^0):=L_{c}(b^0)+(b-b^0)^T \nabla L_{c}(b^0) - \frac{1}{8}(b-b^0)^T X^{T}X (b-b^0). \end{aligned}$$

Note that \(\varPsi _{c}(\cdot ,b^0)\) is concave. The following Lemma gives the lower bound for function \(L_{c}(b)\). The lower bound will serve as a proxy to be maximised in MM algorithm and it is obtained by bounding from below the second term in Taylor expansion of \(L_c(b)\).

Lemma 3

The following inequality holds: \(L_c(b)\ge \varPsi (b,b^{0})\) for any vector \(b^0\).

Proof

We denote by \(L_{i,c}(b)\) the ith summand of (16) treated as a function of b. We first calculate a form of a second derivative of \(L_{i,c}(b)\) using (17). Namely, noting that \(\sigma '(t)=\sigma (t)(1-\sigma (t))= v(t)\), we have for \(s_i=1\) and \(x_i=(x_{i1},\ldots ,x_{ip})^T\)

$$\begin{aligned} \frac{\partial ^2}{\partial b_j\partial b_k}L_{i,c}(b)= -x_{ij}x_{ik}v(x_i^Tb) \end{aligned}$$

and for \(s_i=0\), using the fact that \(v'(t)=v(t)(1-2\sigma (t))\) we have

$$\begin{aligned} \frac{\partial ^2}{\partial b_j \partial b_k}L_{i,c}(b)= & {} -cx_{ij}x_{ik}\Big [\frac{v(x_i^Tb)(1-2\sigma (x_i^Tb))(1-c \sigma (x_i^Tb))+ cv^2(x_i^Tb)}{(1-c\sigma ^2(x_i^Tb))^2}\Big ]\\= & {} -cx_{ij}x_{ik}{v(x_i^T b)}\frac{(c\sigma ^2(x_i^Tb)-2\sigma (x_i^Tb) +1)}{(1-c\sigma ^2(x_i^Tb))^2}. \end{aligned}$$

Observe that

$$\begin{aligned} \frac{c(c\sigma ^2(x_i^Tb)-2\sigma (x_i^Tb) +1)}{(1-c\sigma ^2(x_i^Tb))^2}=\frac{(1-c\sigma (x_i^Tb))^2 +c -1}{(1-c\sigma ^2(x_i^Tb))^2} \le 1, \end{aligned}$$

as \(c\le 1\) and \(0\le \sigma (s)\le 1\). Thus denoting by \(H(b)=\nabla ^2 L_c(b)=\left( \frac{\partial ^2}{\partial b_j\partial b_k}L_c(b)\right) _{j,k}\) Hessian of \(L_{c}(b)\) with respect to b and taking into account the inequality \(v(t)\le 1/4\) we have that

$$\begin{aligned} H(b)\ge -\frac{1}{4} X^TX, \end{aligned}$$
(18)

where \('\ge '\) above denotes matrix ordering (\(A\ge B\) when \(A-B\) is a positive semi-definite matrix). For (18) we additionally used \(X^T\varDelta X \le X^TX\) when \(\varDelta \) is a diagonal matrix with all elements on the diagonal not larger then 1. Taylor expanding \(L_c(b)\) around \(b_0\) we have

$$\begin{aligned} L_{c}(b)= L_{c}(b^0)+(b-b^0)^T \nabla L_{c}(b^0) + \frac{1}{2}(b-b^0)^T H(\tilde{b}^*)(b-b^0) \end{aligned}$$

where \(\tilde{b}^*\) belongs to the interval \([b^0,b]\). Using inequality (18) for the Hessian with H(b) replaced by \(H(b^*)\) it follows that

$$\begin{aligned} (b-b^0)^T H(\tilde{b}^*)(b-b^0) \ge -\frac{1}{4} (b-b^0)^TX^TX (b-b^0), \end{aligned}$$

what combined with the previous inequality gives the proof of the Lemma. \(\square \)

figure a

The result below shows that in the consecutive iterations \(t=1,2,\ldots \) the values of \(L_{c^0}(\hat{b}^{t})\) form a nondecreasing sequence. This in practical terms means its convergence to the local minimum.

Theorem 2

Let \({\hat{b}}^0\in R^p\) and \(\hat{b}^{t}=\arg \max _{b}\varPsi _{c^0}(b,\hat{b}^{t-1})\) for \(t\ge 1\). Then

$$\begin{aligned} L_{c^0}(\hat{b}^{t-1})\le L_{c^0}(\hat{b}^{t}). \end{aligned}$$

Proof

Proof of the Theorem follows from properties of Minorization-Maximization (MM) algorithm (see e.g. inequality (5.69) in Hastie et al. (2015) applied to the negative loglikelihood). Indeed, in view of Lemma 3 we have \(L_c(b)\ge \varPsi (b,b^{0})\), moreover \(L_c(b_0)= \varPsi (b^{0},b^{0})\) from the definition of \(\varPsi (b,b^{0})\) and \(\varPsi (\cdot ,b^{0})\) is concave. \(\square \)

Theorem 2 justifies Algorithm 1 which for a given value of \(c^0\) and \(\hat{b}^{t-1}\) yields \(\hat{b}^{t}\) such that \(L_{c^0}(\hat{b}^{t-1})\le L_{c^0}(\hat{b}^{t})\) by maximizing \(\varPsi _{c^0}(\cdot ,\hat{b}^{t-1})\). Note here that maximization of \(\varPsi _{c^0}(\cdot ,\hat{b}^{t-1})\) is fast as it is simple maximization of the quadratic function. This provides very plausible justification for applying MM algorithm in this case.

We now describe a novel algorithm CD+MM which combines MM algorithm with Cyclic Coordinate Descent (CD), see Algorithm 2 for the pseudo-code. The algorithm works as follows. We cyclically iterate through b and c, one at a time, maximizing the objective function with respect to each coordinate at a time. After \((t-1)^{\mathrm{th}}\) iteration in which \(\hat{b}^{t-1}\) is obtained, \({\hat{c}}^t\) is sought by maximising function \(L_{\hat{b}^{t-1}}(c)\) with respect to c (note that in the view of Lemma 2 this a concave function of c) and then \(\hat{b}^{t}\) is obtained by maximising \(L_{\hat{c}^{t}}(b)\), which is done using Algorithm 1.

figure b

5.2 MLR estimator and its comparision with JOINT method

The idea of MLR estimator (Jaskie et al. 2020) is as follows. First note that \(c\le \max _x P(S=1|x)\) (and equality holds when \(\max _{x}P(Y=1|x)=1\))) and thus c can be estimated as \(\hat{c}=\max _{x}\hat{P}(S=1|x)\), where \(\hat{P}(S=1|x)\) is some estimator of posterior probability. In MLR method, the following parametric model is used to estimate \(P(S=1|x)\)

$$\begin{aligned} g_{MLR}(x,b,\gamma )=\frac{1}{1+b^2 + \exp (-\gamma ^Tx)}, \end{aligned}$$
(19)

where \(b>0\) and \(\gamma \in R^{p+1}\). Estimator \(({\hat{b}},{{\hat{\gamma }}})\) is obtained using gradient-based algorithms. Then noting that \(\max _x g_{MLR}(x,b,\gamma )=1/(1+b^2)\) one considers \({\hat{c}}=1/(1+{\hat{b}}^2)\). Assume now, similarly to JOINT method, that aposteriori probability \(P(Y=1|x)\) is logistic, i.e. \(P(Y=1|x)=\sigma (\beta ^{T}x)\). The following Lemma clarifies the relation between \((c,\beta )\) and \((b,\gamma )\).

Lemma 4

Assume that \(P(Y=1|x)=\sigma (\beta ^Tx)\), \(\beta _i\ne 0\) for at least one \(i\ge 1\) and for certain \((b,\gamma ^T)^T\) and all \(x\in R^{p+1}\) it holds that

$$\begin{aligned} P(S=1|x)=c\sigma (\beta ^Tx)= g_{MLR}(x,b,\gamma ). \end{aligned}$$
(20)

Then we have

$$\begin{aligned} \gamma _0=\ln c +\beta _0,\quad \gamma _{-0}= \beta _{-0},\quad c=\frac{1}{1+b^2}, \end{aligned}$$
(21)

where \(\beta ^T=(\beta _0,\beta _{-0}^T)\) and \(\gamma ^T=(\gamma _0,\gamma _{-0}^T)\).

Proof

Similarly to the Proof of Theorem 1, by choosing appropriate values of x we deduce the relations between parameters of the two competing models.

The proof easily follows from (20) after noting that for any \(i\ge 1\) such that \(\beta _i\ne 0\) (existence of such i is guaranteed by assumptions) and taking \(x_n=(0,\ldots ,-(\beta _i)\times n,0\ldots ,0)^T\), where non-zero element is placed at \((i+1)\)th coordinate, it implies that \(P(S=1|x_n)\) is equivalent to \(c\exp (\beta _0 -|\beta _i|\times n)\) when n tends to infinity. Comparison with the rate of convergence of RHS of (21) for \(x_n\) yields that \(\beta _i=\gamma _i\) and \(c\exp (\beta _0)=\exp (\gamma _0)\). If \(\beta _i=0\) by similar reasoning we obtain \(\gamma _i=0\). Thus we obtain the first two desired equalities. The last one follows immediately. \(\square \)

We note that although (21) suggests that \(g_{MLR}(x,b,\gamma )\) yields equivalent parametrisation of \(P(S=1|X)\) to that given by \((c,\beta )\), this is not true. This is due to subtle but crucial difference. Namely, in contrast to c and \(\beta \) which are independent parameters, b and \(\gamma \) are not algebraically independent in view of equality \(\gamma _0=-\ln (1+b^2) +\beta _0\), where \(\beta _0\) is an unknown constant. This entails that b and \(\gamma \) may not be treated as independent parameters while performing gradient-based optimization. In particular, the derivative of \(g_{MLR}(x,b,\gamma )\) with respect to b is not \(-2b/(1+b^2 + \exp (-\gamma ^Tx))^2\). Inadequate estimation may be expected in particular for small c when the absolute value of \(\gamma _0\) in the view of (21) becomes large. This is indeed confirmed by our numerical analysis in the next section.

6 Experiments

In the experiments we compare the performance of the proposed method CD+MM with the following methods: JOINT method (Teisseyre et al. 2020), TIcE (Bekker and Davis 2018), EN (Elkan and Noto 2008), MLR Jaskie et al. (2020), KM1, KM2 (Ramaswamy et al. 2016). We do not show the results for Partial Matching method (du Plessis and Sugiyama 2014) due to similarity with EN estimator discussed above. The source code of our method is available at https://github.com/teisseyrep/PU_class_prior.

Fig. 2
figure 2

Density functions of distributions \(X|Y=0\) and \(X|Y=1\) for simulation models 1 and 2

Fig. 3
figure 3

Estimation error \(|\alpha -{\hat{\alpha }}|\) (averaged over 100 simulations) wrt c for simulation model 1. Parameters: \(b=1,2\), \(\alpha =0.1,0.25,0.5\) and \(n=5000\)

6.1 Simulation models

We generate artificial data as follows. First Y is drawn from Bernoulli distribution with \(\alpha =P(Y=1)\). We consider \(\alpha =0.1,0.25,0.5\). Observed binary variable S is generated in such a way that \(P(S=1|Y=1)=c\), where c is treated as a parameter ranging from 0.1 to 0.9 and \(P(S=1|Y=0)=0\). Then \(X_1\) is generated using conditional distributions described below. We consider 2 scenarios:

  • Simulation model 1 \(X_1\) is generated using conditional distributions \(X_1|Y=0\sim N(0,1)\) and \(X_1|Y=1\sim N(b,1)\), where b is a parameter.

  • Simulation model 2 \(X_1\) is generated using conditional distributions \(X_1|Y=0\sim N(0,1)\) and \(X_1|Y=1\sim 0.25 N(0,1)+0.75 N(b,1)\).

Figure 2 shows density functions corresponding to conditional distributions of \(X_1|Y=0\) and \(X_1|Y=1\) for the simulation models. Parameter b controls the dependence strength between \(X_1\) and Y. We consider two values \(b=1\) and \(b=2\). Larger value of b corresponds to stronger dependence between \(X_1\) and Y. In order to make a task more challenging, we also add spurious noise variables \(X_2,\ldots ,X_{10}\), generated from N(0, 1), independently from Y and let \(X=(X_1,\ldots ,X_{10})\). Note that model 1 corresponds to logistic regression model, whereas in model 2 the dependence between Y and \(X_1\) cannot be described by logistic model. Thus the first model favours the proposed method which is based on logistic model, whereas model 2 allows to analyse the robustness of the proposed method.

Fig. 4
figure 4

Estimation error \(|\alpha -{\hat{\alpha }}|\) (averaged over 100 simulations) wrt c for simulation model 2. Parameters: \(b=1,2\), \(\alpha =0.1,0.25,0.5\) and \(n=5000\)

Figures 3 and 4 show estimation errors \(|\alpha -{\hat{\alpha }}|\) (averaged over 100 simulations) wrt c for simulation models (for better presentation we do not show the curves for KM1 as it worked systematically much worse than KM2). Observe that CD+MM achieves small averaged errors for both models and almost all parameter settings. The JOINT method and MLR usually work worse than CD+MM, whereas TIcE, EN and KM2 work poorly in most cases. The averaged estimation error for CD+MM decreases with c. We observed the largest estimation errors for small \(\alpha \) and small c, which is due to the fact that for this setting we observe very few labelled observations. Indeed, for sample size \(n=5000\), \(\alpha =0.1\) and \(c=0.1\) we have on average only 50 labelled examples. For simulation model 1, the estimation error of CD+MM increases when b decreases. This suggests that, when the dependence between Y and \(X_1\) is weak,s then estimation of \(\alpha \) is more challenging. In extreme case \(b=0\), Y and \(X_1\) are independent and in such case \(\alpha \) is not identifiable, see the discussion in Sect. 3. Advantage of CD+MM and JOINT methods over competitors is larger for Model 1 than for Model 2. This is understandable since, in the first case the logistic model for which these methods are designed, is well specified. Good performance of CD+MM and JOINT in the case of Model 2 indicates that the methods are robust against departures from the logistic model.

Fig. 5
figure 5

Distribution of \({\hat{\alpha }}\) wrt the true parameter c, for simulation model 1. The horizontal line corresponds to the true \(\alpha \)

Figures  5 and 6 show empirical distributions of \({\hat{\alpha }}\) in form of boxplots against the true parameter c for simulation models 1 and 2. First, it is clearly seen that EN, TIcE, KM1 and KM2 overestimate \(\alpha \), which is a result of their underestimation of c and agrees with theoretical analysis of Sect. 4. Secondly, we observe large variance for JOINT method, especially for small \(\alpha \) and small b. This suggests that simple gradient optimization used in JOINT method may be insufficient; the algorithm probably is getting stuck in local minima. The variance of CD+MM is much lower, when compared to JOINT method, which indicates that the proposed optimization procedure based on MM-algorithm allows to find optimal solutions more frequently which results in more stable estimation. We also stress that KM methods are the most computationally expensive among studied methods, especially for larger samples sizes. For example, when dimension of feature vector X is 10 and sample size is \(n=2000\), KM works about 2 times slower than CD+MM, whereas for sample size \(n=5000\), KM works about 30 times slower than CD+MM (assuming that for CD+MM, the maximal numbers of iterations are 300 and 50, for CD and MM algorithms, respectively; for KM we used default settings).

Finally, we performed convergence analysis of the three methods based on parametric modelling: JOINT method, MLR and the proposed CD+MM. Recall, that in CD+MM, in each step of cyclic coordinate descent algorithm 2 we perform iterative MM algorithm 1. To make a comparison between CD+MM and two remaining methods fair, for CD+MM algorithm, we analyze the total number of iterations, i.e. the number of iterations in cyclic coordinate descent Algorithm 2 multiplied by the number of iterations of the MM algorithm. In this experiment, we take only 10 iterations of MM algorithm for each step. Figure 7 shows how the value of log-likelihood changes with the number of iterations for simulation models 1 and 2 for \(c=0.3,0.5,0.7\). In all considered cases, CD+MM achieves larger value of the loglikelihood than two remaining methods, within 100 first iterations. Interestingly, the curves for JOINT method and MLR are similar. The plateau of the curves corresponding to JOINT method and MLR may indicate the problem of non-convergence to the global optimum discussed above.

Fig. 6
figure 6

Distribution of \({\hat{\alpha }}\) wrt the true parameter c, for simulation model 2. The horizontal line corresponds to the true \(\alpha \)

Fig. 7
figure 7

Convergence analysis. Log-likelihood function with respect to the number of iterations for logistic regression-based methods: JOINT METHOD, CD+MM and MLR

6.2 Benchmark datasets

We use 8 popular benchmark datasets from UCI Machine Learning Repository and one that was used for the IJCNN 2001 neural network competition (Prokhorov 2001). A short summary of each dataset can be found in Table 1. They were chosen to represent various characteristics of data (number of observations, number of features and the fraction of positive examples). To adjust these data to our problem, we created PU datasets from the labelled datasets, the positive examples were selected to be labelled with label frequencies \(c= 0.1,0.2,\ldots ,0.9\). For each label frequency c, we generated 100 PU datasets labelling randomly elements having \(Y = 1\) with probability c and then averaged the results over 100 repetitions. The true class prior for each dataset was estimated as the number of positive examples divided by the number of examples. All numerical features were scaled between 0 and 1 with the standard transformation \( (x-\min (x))/(\max (x)-\min (x))\). Such transformation was recommended for TIcE algorithm (Bekker and Davis 2018). Due to computational cost of KM1 and KM2, for these methods, as in Ramaswamy et al. (2016) and Bekker and Davis (2018), we subsampled two largest datasets (mushroom and ijcnn2001) choosing \(n=2000\) observations and averaged the results of 5 such trials for each experiment. Figure  8 shows averaged values of \(|{{\hat{\alpha }}}-\alpha |\) and Fig. 9 empirical distributions of \({\hat{\alpha }}\) against c. In terms of an error \(|\alpha - {\hat{\alpha }}|\) the method CD+MM achieves the most accurate results for five datasets (credit-g, diabetes, heart-c, mushroom, spambase) for all or almost all c values with the JOINT or MLR method being the second best (see Fig. 8). In two cases the KM2 works best (vote and wdbc). The MLR method works well on average except for small values of \(\alpha \) and two data sets: credit-g and ijcnn2001 (in the latter case it behaved very erratically and due to this it has been removed from the respective plot). It can be seen from Fig. 9 that for both CD+MM and JOINT method \({\hat{\alpha }}\) is usually less variable than EN and TIcE and the estimation error decreases with c. For EN and TIcE underestimation of c results in overestimated \(\alpha \) for BreastCancer, credit-g, diabetes, heart-c, spambase and mushroom datasets. On the other hand, the CD+MM and MLR methods tend to underestimate \(\alpha \) for small values of c.

Table 1 Summary statistics of benchmark datasets
Fig. 8
figure 8

Estimation error \(|\alpha -{\hat{\alpha }}|\) (averaged over 100 simulations) wrt c for benchmark datasets

Fig. 9
figure 9

Distribution of \({\hat{\alpha }}\) wrt the true parameter c, for benchmark datasets. The horizontal line corresponds to the true \(\alpha \). The plot for MLR on ijcnn2001 data is omitted to its erratic behaviour

6.3 Experiment on clinical dataset MIMIC

We performed an experiment on large clinical database MIMIC III (Johnson et al. 2016). The database contains information on 33166 patients treated in intensive care units (ICU) who are diagnosed according to the coding scheme ICD-9. Patients are diagnosed with various diseases, among which we consider 5 diseases: hypertension, kidney and liver disease, diabetes and chronic pulmonary obstructive disease (copd). The above families of diseases were already investigated in previous studies (Zufferey et al. 2015; Teisseyre et al. 2019; Teisseyre 2020). Similarly as for benchmark datasets, the true prevalence \(\alpha \) is computed as a fraction of patients in the database with the given disease. Table 2 shows the values of \(\alpha \) for the considered diseases. Note that these values do not match the prevalences of the diseases in the population, which is due to the fact that the considered database is related to the ICU patients and thus it cannot be treated as a representative sample from the population. The original dataset consists of 308 features which correspond to certain blood and diagnostic tests (e.g. Glucose, Sodium, etc.), administrative information (e.g. sex, age, marital status) and medical scores used to track a person’s status during the stay in an intensive care unit (e.g. Braden score used to assess a risk of developing a pressure ulcer). The list of all features can be found at https://home.ipipan.waw.pl/p.teisseyre/PUBLICATIONS/parcc/parcc_supplement.pdf. For each disease, we select 30 features using a simple filter based on mutual information, i.e. we first calculate the mutual information between the given disease and the features and then select 30 features corresponding to largest values of mutual information.

To create PU datasets from the completely labelled datasets, the positive examples are selected to be labelled with label frequencies \(c=0.1,0.2,\ldots ,0.9\). For each label frequency c we generated 100 PU datasets labelling randomly elements having \(Y=1\) with probability c and then averaged the results over 100 repetitions. The above scheme corresponds to the situation when actually occurring disease is diagnosed with probability c.

Figure 10 shows distributions of \(\hat{c}\) wrt the true parameter c. Observe that EN and TIcE underestimate label frequency c, for all 5 considered datasets. The same is true for KM1 and KM2, though they work better than EN and TIcE. The proposed methods work much better for all datasets except liver, for which they show the same tendency to underestimate c as EN and TIcE but to lesser degree. Figure 11 shows distributions of \({\hat{\alpha }}\) wrt the true parameter c. Underestimation of c results in considerable overestimation of \(\alpha \) in the case of both EN and TIcE. The proposed method CD+MM gives fully satisfactory results for diabetes and kidney. Although, CD+MM slightly overestimates \(\alpha \) for copd and liver, its error is still much lower than for EN and TIcE. For hypertension, CD+MM underestimates \(\alpha \) for small c and overestimates \(\alpha \) for large c. Finally, for CD+MM we observe smaller errors than for JOINT method, which indicates that the proposed optimization procedure based on MM-algorithm allows to improve the estimation accuracy. Figure 12 shows estimation error \(|\alpha -{\hat{\alpha }}|\) (averaged over 100 simulations) wrt c. We observe the smallest averaged estimation errors for CD+MM for all cases but two with \(c=0.1\). CD+MM outperforms MLR, which usually works on par with JOINT method, but in some situations is unstable (for example liver disease and small c).

Table 2 Summary statistics of MIMIC III database
Fig. 10
figure 10

Distribution of \(\hat{c}\) wrt the true parameter c, for MIMIC-III datasets

Fig. 11
figure 11

Distribution of \({\hat{\alpha }}\) wrt the true parameter c, for MIMIC-III datasets. The horizontal line corresponds to the true \(\alpha \)

Fig. 12
figure 12

Estimation error \(|\alpha -{\hat{\alpha }}|\) (averaged over 100 simulations) wrt c for MIMIC III datasets

7 Conclusions and future work

In this paper we analysed different methods of class prior estimation in positive unlabelled learning. We showed that class prior probability is not identifiable for a PU single sample scenario given a full knowledge of distribution of (XS) if no assumptions on distribution of (XY) are imposed. The class prior becomes identifiable when we impose mild semi-parametric model assumptions on conditional distribution of Y given X. We formally show that in some situations, some of the existing algorithms tend to underestimate label frequency c and overestimate class prior probability. This property is confirmed by the numerical experiments. The proposed approach, based on logistic regression, involves simultaneous estimation of label frequency c and model parameters. In order to account for the non-concavity of the likelihood function, a novel optimization procedure, called CD+MM, is proposed in this paper, which is a combination of cyclic coordinate descent and Minorization-Maximization algorithms. In the method, we iteratively optimize profile log-likelihood functions. The experiments, performed on artificial and benchmark datasets as well as on large clinical database MIMIC, indicate that the proposed method CD+MM achieves lower estimation errors of the class prior than other considered methods, for most of the datasets and parameter settings. Indirectly, this indicates that CD+MM is robust to departures from parametric setting of a logistic model from which it has been derived. For the benchmark datasets, KM2 was very competitive, however it was computationally very costly. Moreover, CD+MM is significantly less variable than related JOINT method which uses simple gradient optimization of the likelihood function. It follows from experiments, that class prior estimation becomes more challenging when label frequency is small and the dependence between class variable Y and feature vector X is weak.

Since the performance of the methods based on logistic regression (JOINT method and CD+MM) seems promising, we believe that this approach is worth pursuing. It would be of interest to develop modifications of the method for number of predictors p larger than sample size n using e.g. regularised version of logistic regression. Such version would be particularly useful to deal with high-dimensional data. Moreover, finding concave lower bound of L(cb) both in c and b would lead to another potentially interesting modification of the proposed method.