1 Introduction

How well do international organizations (IOs) perform? The literature on this core question of IO research has made huge progress in recent years, thanks also to the increasing availability of independent evaluationsFootnote 1 published by both international organizations and national governments. Some of these evaluations summarize the performance of evaluated activities in standardized numeric ratings which has enabled researchers to compile major data collections on the performance of IO activities (e.g., Honig et al. 2022). Nevertheless, the rich information contained in the actual evaluation reports – often several hundred pages long – has not yet informed comparative IO research. What is more, standardized performance ratings exist primarily for single projects and for activities implemented by organizations in the field of multilateral financial aid and development assistance. This implies that evaluations on broader thematic topics or institutional activities, as well as those published by organizations in other policy fields, have not yet been considered for IO performance research. Organizations of the United Nations (UN) system, for example, publish around 750 evaluation reports per year, but the majority of these organizations does not publish a performance score.Footnote 2

Against this backdrop, this article breaks new ground by introducing and validating a method suited to extract performance measures from the text of an evaluation report. This provides researchers with an alternative way to measure the performance of an IO activity for those organizations that publish performance scores, and it offers new insights for other organizations which publish evaluation reports without the scores.

First, we introduce the new IOEval dataset that contains sentence-level text of 1,082 evaluation reports from nine major UN-system IOsFootnote 3 produced between 2012 and 2021. These evaluations assess four different types of IO activities, namely projects, programs, institutional and thematic activities. These activities take place at the country, regional, or global level.

Second, we introduce a procedure that for the first time enables extracting quantitative performance measures from the text of evaluation reports. Our rationale considers the structure of written reports, which typically break down broader evaluation questions into sub-questions that are then individually assessed. Single positive or negative assessments take place at the level of sentences. We expect that the more positive assessments a report contains, the more positive the performance of the evaluated activity.

Given that the dataset contains close to one million sentences, we use deep learning for sentence classification. Based on 10,296 hand-coded sentences from 180 reports selected at random, we fine-tune a pre-trained language model (BERT) that ultimately allows classifying sentences as containing a positive or negative assessment of the evaluated activity (or neutral descriptive text). This procedure enables computing the share of positive assessment sentences per report, our performance measure.

We apply three strategies to demonstrate the validity of the novel performance measure (Adcock & Collier, 2001). First, content validation at the report and sentence level shows that the share of positive sentences in a report adequately reflects the underlying concept of performance. We secondly conduct convergent validation by applying our measurement procedure to 661 evaluation reports by the World Bank which were not in the original training data. In addition to the text, these reports also contain a human-provided performance score. We find a strong positive correlation between our own measure and the scores provided by the World Bank evaluators. Lastly, construct validation finds that our measure yields theoretically expected results: Comparing the performance of projects (narrow goals, short time frame, often at sub-national level) and programs (bundle of projects, long-term goals, often at national or regional level), we confirm that the former perform better than the latter across all IOs.

Overall, this article contributes to the study of international organizations by providing a validated, reliable, replicable, and easily scalable performance measure based on the full text contained in IO evaluation reports. We publish the IOEval dataset, which enables comparative research on the performance of 1,082 evaluated project, program, institutional or thematic activities at the country, regional or global level by nine UN-system IOs. This helps expanding the study of IO performance – both its causes and consequences – beyond organizations for which human-provided performance scores already exist. As we also publish the language model’s algorithm, anybody can replicate our procedure and expand the IOEval dataset in the future.

In the following, we first situate our research in the context of studies on IO performance, and then proceed along the structure set out above, ending with a discussion on limitations and opportunities for future research.

2 IO performance and evaluation

In this section, we define the latent concept of organizational performance, discuss how organizations use evaluation for performance measurement, and report to what extent the existing IO literature has used such performance data.

In its simplest terms, the performance of a public organization has been defined as its ability to achieve pre-defined goals (Heinrich, 2012). The Oxford dictionaryFootnote 4 defines performance as “[t]he accomplishment or carrying out of something commanded or undertaken”. As Lipson (2010: 256) put it, it is about “an organization’s use of its resources, technology, and relationships with its organizational environment to achieve collective goals”.

Public sector organizations have long attempted to measure their performance “along a set of key indicators” (Poister et al., 2015: 7). But since not all questions of organizational goal accomplishment can be answered at the level of indicators, organizations also use evaluation to “analyz[e] the performance data” (Poister et al., 2015: 9). Evaluations therefore contain analyses of a broad range of data sources to assess the performance of an organization in a specific activity.

Academic research, too, has attempted to measure IO performance, employing a broad range of concepts and definitions of the term (for an overview, see Gutner & Thompson, 2010). Some studies address performance at the level of institutional design choices, staffing practices or management processes (Graham, 2014; Heinzel, 2021). Others capture performance at the level of outputs, for instance by counting and characterizing IO governing body resolutions (Sommerer et al., 2021; Tallberg et al., 2016). Some IOs, most notably in the field of development assistance, publish performance metrics for single projects which have been used extensively as a reference for IO performance (Heinzel, 2022; Honig, 2020; Honig et al., 2022; Lall, 2017). Lastly, there are also attempts to measure performance at the level of societal outcomes, examining the extent to which IOs contribute to managing the global economy (e.g. Parízek, 2020), fostering good governance (e.g. Honig & Weaver, 2019), or protecting human rights (e.g. Lebovic & Voeten, 2009).

Evaluation, as defined by the UN system (see introduction), can attempt to measure performance at either one of these levels, depending on the exact evaluation research question. The UN Evaluation Group’s classification distinguishes four main evaluation types. Evaluation can measure performance of 1) single projects; 2) programs, which typically contain a bundle of projects; 3) institutional processes or activities, such as human resources, public relations, or procurement; and 4) broader thematic questions, such as an IO’s support to youth or gender equality.Footnote 5 When assessing one of these activities, evaluators consider the performance along all or several of the UN Evaluation Group’s six evaluation criteria. These are the relevance, coherence, effectiveness, efficiency, impact as well as sustainability of an IO activity (UNEG, 2016: 10).

IO evaluations are typically conducted or supervised by an IO’s central evaluation unit. These are established as independent entities within the organization (hence they are sometimes named independent evaluation units). Which IO activities will be evaluated is not random, but subject to stakeholder consultations which usually involve IO management and member state representatives in the IO governing bodies (Eckhard & Jankauskas, 2019). The aim is to generate “the most relevant, useful and timely information” about a wide spectrum of IO activities (UNEG, 2016: 21). Sometimes it is also a decentral operative unit that commissions the evaluations, with the central evaluation unit providing guidance and oversight. The actual evaluation research is conducted by a team of experts, either internal or external consultants, or a mix of both. The process of drafting the final evaluation report involves feedback and stakeholder consultation.

Some IOs also publish a standardized performance rating alongside their evaluation reports. Using this data, Honig et al. (2022) published the hitherto most comprehensive dataset covering project ratings for more than 20,000 projects by 12 bilateral and multilateral aid agencies. Scholars can easily use such standardized performance ratings for their analyses. For instance, Honig et al. (2022) studied the link between transparency and performance for more than 20,000 projects, Denizer et al. (2013) utilized 6,000 World Bank project evaluations to study micro and macro correlates of aid project outcomes, and Bulman et al. (2017) looked into 3,797 World Bank and 1,322 Asian Development Bank projects (see also Buntaine & Parks, 2013; Dreher et al., 2013; Feeny & Vuong, 2017; Geli et al., 2014).

So far, however, mostly organizations in the field of multilateral financial aid and development assistance have published standardized performance ratings. This means that there is still a lack of comparative insights for other policy fields, such as health, humanitarian aid or social policy. In this regard, two key obstacles currently prevent scholars from unleashing the potential that IO evaluation reports offer. First, there is no comprehensive empirical basis upon which scholars could build their analyses, since internal IO evaluation reports are scattered between IOs, timeframes, and evaluation types and levels. Second, given that many IOs do not provide standardized ratings for each evaluation, scholars lack analytical tools to extract performance measures from large numbers of reports. Both reasons explain why thousands of internal evaluations from the UN system remain neglected in IO research.

To be sure, evaluations of IO activities can also be conducted externally, such as those produced by donors who seek to scrutinize how and to what end IOs employ their financial contributions. For example, a bilateral donor or multilateral development bank may sponsor a project by an UN agency and may publish the resulting evaluation data. However, such external donor evaluations usually only assess those projects which are funded by or of interest to the evaluating donor. This “eye of the beholder” problem (Gutner & Thompson, 2010: 233) biases the available performance data, considering that it is primarily Western donors who provide the bulk of funding for IO activities.Footnote 6 Also, these reports are usually not public.

Therefore, the remainder of this article introduces a novel dataset of internal IO evaluation reports along with a method suited to extract performance measures from their text.

3 The IOEval dataset

The UN Evaluation Group maintains a database with over 20,000 evaluation reports (2023) published by its 21 member organizations. But the database does not store all reports. There are often missing documents or links to reports in other databases that are not accessible. We therefore hand-collected reports, which is labor intensive. Thus, for this first version of the dataset, we limited the selection to reports from nine major UN system IOs: ILO, UNDP, UNICEF, FAO, UNESCO, WHO, IOM, UNHCR, and UN WOMEN.Footnote 7 We chose these IOs to gain a diverse set of organizations while ensuring that each published a large number of reports: First, these IOs vary in their policy fields, staff size and constellations, as well as budgetary scale and scope (see Appendix Table A I.1). Second, each IO published between 43 and 244 reports between 2012 and 2021. Third, as members of the UN Evaluation Group, these IOs are subject to the same system-wide evaluation norms and guidelines (UNEG, 2016), making their evaluation reports comparable. To verify that the definition of evaluation and the evaluation criteria indeed matched the UN Evaluation Group standards, we restricted the data collection for each organization to the period for which we could access their evaluation policies.

We proceeded in several rounds to compile the dataset: First, we web-scraped the UN Evaluation Group’s data repository, which however yielded missing entries in the data due to broken links and missing PDFs; second, we manually collected additional reports from individual IO websites or requested reports directly from evaluation units (approximately 74% of the whole dataset). All PDF-documents were converted into raw text using Optical Character Recognition. Raw text was cleaned by applying standard procedures of natural language processing (e.g., removal of special characters and numbers) and split into sentences. The final IOEval dataset includes a total of 1,082 evaluation reports published from 2012 to 2021 and 995,743 distinct sentences, indicating their order in the original report. For further details on data collection and cleaning see Appendix I.

In addition, the IOEval dataset also includes metadata variables at the level of reports: report title, publication date, evaluation type (project, program, institutional or thematic), evaluation level (country (specifying its name), regional, global), and commissioning unit (centralized or decentralized). At a sentence level, we specify to which text section a sentence belongs (executive summary, main text, appendix). See Table A II.1 in the Appendix for further details and examples.

4 Measuring performance based on evaluation reports

To extract a performance measure from the text-based reports, we consider the structure of the text. Each report typically breaks down a broader evaluation question into smaller sub-questions that are being analyzed and assessed, sometimes with a mix of research methods. The structure can also follow the UNEG’s six evaluation criteria as introduced above. The main location of individual assessments is the analytical or findings section of a report, where the different sub-questions are raised, discussed and answered. In addition, summaries of the main findings are provided in the executive summary, introduction, recommendation section, and the conclusion. It is plausible to expect that the more positive or negative assessments a report contains, the more positive or negative should the overall judgement about the evaluated activity be.

This logic can be further extended to the level of sentences. Evaluation reports in our dataset contain on average 850 sentences. Around half of these sentences contain no assessments. They are descriptive and provide information on the evaluation’s background, structure or methodology. The other half of the sentences contain assessments, i.e., either positive or negative judgements about the evaluated activity. Our central claim is that the more positive (or negative) sentence-level assessments a report contains, the more positive (or negative) the overall performance of the evaluated activity. An important limitation in that regard is that this measure treats all sentences equally. It is well possible that some sentences contain several judgements or are more important than others. We account for this in the validation below.

The IOEval dataset contains close to one million sentences, which exceeds the scope of sentences that can reasonably be classified by human coders. For the classification of sentences as containing a positive or negative assessment (or neutral descriptive text), we therefore utilize recent breakthroughs in natural language processing techniques in the area of deep learning-based contextualized language models. In particular, we employ a state-of-the-art language model BERT (Bidirectional Encoder Representations from Transformers)Footnote 8 that was developed by Google in 2018 (Devlin et al., 2019). These models have been developed to conduct classification tasks, and they can be fine-tuned to more specific applications (see Zhuang et al., 2021), an approach which has become the standard for deep learning in natural language processing (Huo & Iwaihara, 2020; Sun et al., 2019). For instance, scholars used a fine-tuned BERT model for classifying textual review and social media (e.g., Twitter) data (Chiorrini et al., 2021; Pota et al., 2021). However, according to our knowledge, no prior work has applied a fine-tuned BERT model on lengthy political report data.

Fine-tuning requires input data. We used manually coded (labeled) text from 180 evaluation reports selected at random from the nine IOs in the IOEval dataset. After establishing that the nature of sentences in executive summaries is not different as to how sentences in the main text of the reports are written, we coded all sentences in the executive summaries. This enabled to cover a broader variety of reports (and therefore topics and activities) compared to an approach that would have manually coded full reports. Overall, the 180 executive summaries contained 10,296 sentences. The coding involved three coders who first coded the same set of reports to establish a common understanding. Once the inter-coder agreement exceeded 80%, coding proceeded individually but weekly meetings were held to clarify questions and maintain the inter-coder agreement. Each sentence was labeled as either positive or negative assessments of the evaluand; or as neutral when it contained no performance judgement.Footnote 9 Below we give examples for typical sentences:

  • Positive assessment sentence: “Effective management and efficient allocation and use of resources have contributed to the achievement of results grounded in EU normative standards…” (UN WOMEN, 2020: 16).

  • Negative assessment sentence: “However, the lack of an institutional host or anchor for the awareness-raising campaign strategy … casts some doubts with regards to the sustainability of the campaign” (IOM, 2019: 3).

  • Descriptive sentence (neutral): “Two distinct training approaches … were compared in this evaluation – the foundational and the enhanced ECD kit interventions” (UNICEF, 2018: 8).

To implement the fine-tuning, all sentences were tokenized using the BERT tokenizer, and the tokens were converted to their identifiers according to the BERT’s dictionary. We fine-tuned the model on around 90% of the labeled data (the rest was used for validation, see below). The transfer-learning step took approximately 0.5 h for four epochs using google collab’s graphics processing unit. The output of the fine-tuned BERT are three values, i.e., logits that get normalized using the softmax function to retrieve prediction probabilities for positive, negative or neutral assessment. To determine the predicted class, we took the maximum among these three probability values. In other words, the class label with the highest probability is the predicted class label for the particular input sentence. For more details on the manual coding rules and procedures, see Appendix III

Overall, the language model enabled us to classify 995,743 sentences comprised in the 1,082 reports of the IOEval dataset, predicting for each sentence whether it contains a positive or negative assessment of the IO activity under evaluation – or neutral text. For each evaluation report, we then computed a simple proportional calculation where the ratio of positive sentences in a report to negative sentences is used to construct a continuous measure reaching from 0 to 1, whereby 1 denotes that 100% of sentence-level assessments (excluding neutral) are positive. Figure 1 provides an overview of the resulting IOEval dataset, showing year coverage and positive assessment share at the report level for each IO.

Fig. 1
figure 1

IO performance at report level. The figure plots individual reports in our dataset and their share of positive findings (y-axis) for IOs over time (x-axis)

5 Validation

The performance of IO activities is a latent concept, and as such its true values are unknown (Kellstedt et al., 1993). Certain frameworks have been proposed to estimate the validity of novel measures for latent concepts (Adcock & Collier, 2001; see also Lührmann et al., 2020; Weidmann & Schutte, 2017). Following these, we report findings from content, convergent, and construct validation of our novel performance measure.

5.1 Content validation

Content validation examines how a measure converges with underlying concepts (Adcock & Collier, 2001). In this case, we address the “adequacy of content” (Adcock & Collier, 2001: 538) both quantitively and qualitatively.

First, we quantitively assess the extent to which the algorithm accurately predicts sentences as compared to a human’s decision regarding the evaluated activity’s performance. For that, our human coders manually labeled approximately 2,000 sentences as positive, negative, or neutral assessments from randomly selected evaluation reports’ executive summaries which were not in the original training data. The hand-coded sentences were equally distributed among the three code dimensions (666 sentences for each group comprising of either positive, negative or neutral assessments). Then, the same sentences were classified by the algorithm and the results were compared. Supporting the model’s validity, its accuracy (i.e., the human-algorithm coding agreement) on this test data reaches 89% (87% for positive assessment sentences, 90% for negative assessment sentences, and 92% for neutral sentences).

In addition, for each predicted sentence the algorithm gives us a probability score which shows how confident the model is regarding the allocated code (positive, negative, neutral). The histogram in Fig. 2 shows the probability distribution for the predicted labels. The chart reveals that most of the class labels are predicted with a high probability (> = 95%) indicating high model confidence. This yields no indication of a systematic bias. In the Appendix (Tables A IV.1–2), we show example sentences for rare cases when the model’s probability for a prediction did not exceed 50%.

Fig. 2
figure 2

Probability scores for the predicted label (x-axis) for all test data sentences (y-axis, number of sentences)

As the second approach to assess content validity, we employ content analysis of extreme case reports. This aims to establish whether reports that contain a very high or low share of positive sentences indeed represent cases of very successful or unsuccessful IO activities. We selected four reports located at the far end of our scale from the group of executive summaries we coded manually when training the language model. Below, we demonstrate that these executive summaries indeed contain accounts of very successful or unsuccessful activities.

Regarding the two most negative reports, UNESCO’s evaluation of its field reform in Africa (UNESCO, 2015) has only 16% positive sentences in the executive summary, compared to 84% negative sentences.Footnote 10 This indicates very poor performance. Manual inspection yields that this is also reflected in the content of the executive summary, which explains that “achievements thus far have been limited” (5), that the field reform “was not complemented by a strategy (…) or a robust implementation plan with clear targets and deliverables” (5), and that the “overall leadership, monitoring and oversight over the reform was ambivalent, uncoordinated and uneven” (6). Next, the evaluation of UNHCR’s efforts to phase down its presence in Angola, Botswana and Namibia also has received only few (13%) positive sentences (UNHCR, 2018). Here, evaluators strongly criticized UNHCR’s lack of strategy and guidance for the phase down process. We quote two exemplary sentences below:

“Nevertheless, when the 2013 decision [on the phasedown, the authors] was made, the intended outcomes were formulated only in terms of office structures and presences. The decision did not include a transparent analysis of underlying assumptions and preconditions that could have guided field offices; as a result, appropriate strategies, with clear indicators, operational milestones and roadmaps were not developed (…); and could not be used to support the review of progress in subsequent years” (iv).

By contrast, two reports ranging at the other end of the scale, with the highest share of positive sentences, paint a completely different picture. The executive summary of the FAO’s evaluation of its project on water use and management in the Sana Basin contains 100% positive assessment sentences. The report indeed consistently emphasizes the “competence and determination of the project team”, saying that it “met its objectives” and “provided an effective model” for future regulation (FAO, 2018). Similarly, the IOM’s evaluation of its project on human security also contains only positive assessments. In the report, evaluators state that the project was “relevant”, “effective”, and “efficiently implemented both in terms of operations and financially” (IOM, 2018). Overall, this illustrative content analysis of extreme cases indicates that the low or high share of manually coded evaluation reports clearly and adequately reflects activities that contain highly positive or negative performance assessments.

Overall, these findings demonstrate that the language model classifies the content of sentences with a very high accuracy and the share of positive sentences adequately represents the performance-related content of reports. Given that predicted probabilities for each sentence are known (model’s confidence), the insecurity of the estimation can be incorporated as a control variable in future analyses.

5.2 Convergent validation

Convergent validation compares how a novel measure correlates with a similar concept (Adcock & Collier, 2001). In this case, we compare the text-based performance measure with an exogenous performance metric of IO activities. The World Bank (WB) offers such alternative data in the form of a manual outcome rating for projects provided by the organization’s Independent Evaluation Group (IEG). Their standardized rating procedure aims to provide coherent and consistent performance scores to allow comparison over time and between countries (IEG, 2022). If the shares of positive assessments, as identified by our model, correlate with the IEG ratings (in the same sample of WB reports), then our confidence that these shares indeed reflect project performance increases. Note that there is a discussion about the validity of the IEG performance scores (e.g., Malik & Stone, 2018), which is why we expect no perfect correlation.

To compare our model results with the IEG ratings, we collect a sample of WB evaluation reports which all contain standardized IEG performance ratings. We focus on the most general metric termed “Outcome”, which measures “the extent to which the operation’s major relevant objectives were achieved, or are expected to be achieved, efficiently” (IEG, 2014: 5). Hence, it generally aligns to the evaluation objectives specified by the UN Evaluation Group. There are six possible ratings for this metric: “Highly Satisfactory”, “Satisfactory”, “Moderately Satisfactory”, “Moderately Unsatisfactory”, “Unsatisfactory”, “Highly Unsatisfactory”.

We collected available WB reports published between 2012 to 2021 in line with the timeframe used for the compilation of our main IOEval dataset. The WB dataset consists of 661 reports in total, which includes two different report types: 473 are so-called Implementation Completion and Results Report Reviews (ICRRs), which the IEG drafts for all WB projects based on interviews and document analysis. 189 documents are Project Performance Assessment Report (PPARs). These reports are conducted on 20% of all ICRRs but involve much more in-depth research.Footnote 11 The report types thus vary in length, scope and how long after the project end they were performed but, crucially, utilize the same ratings framework. After applying the same data-preprocessing steps as used above, the reports were separated into their composite sentences (255,732 in total)Footnote 12 for feeding into our language model. In turn, average shares of positive assessments per evaluation report were calculated and correlated with the IEG ratings.

Figure 3 shows at a report level (represented by dots on the plot) how the share of positive assessments per report (y-axis) corresponds with the associated IEG ratings (x-axis). Density estimates of the distribution show that reports associated to each WB class are approximately normally distributed around progressively median points and that these distributions are statistically distinct (Figure A V.3 in the Appendix). Clearly visible at this level is a steady positive correlation between the two variables, with each higher rating resulting in a distribution centered around a higher median point for the positive assessment share variable, save for the lowest IEG outcome rating level of “Highly Unsatisfactory”. There is also a strong positive correlation (Spearman’s rank correlationFootnote 13: r(648) = 0.76, p = 2.2e-16).

Fig. 3
figure 3

Share of positive assessment (y-axis) across IEG Ratings (x-axis) by report. Note that IEG Outcome Rating is an ordinal variable with no values in between, the points have been spread randomly to improve interpretability

Whilst this initial examination supports the convergent validity of our novel measure with the manual metric of performance by the IEG, it also shows that the distribution of the lowest rating level does not follow the expected trend. However, statistical power analysis shows that the number of reports rated as “Highly Unsatisfactory” is too small to give an accurate representation of the distribution for this rating level.Footnote 14 While we still include this underpowered rating level in the main analysis, separate models in the Appendix exclude this level and show that the differences are negligible (see Table A VI.2).

To further investigate to what extent each IEG rating category matches our text-based measure, we report findings from an ordinary least squares (OLS) regression analysis in Table 1. Model 1 shows the bivariate association between the IEG ratings, taken as a continuous variable between 1 (‘Highly Unsatisfactory’) and 6 (‘Highly Satisfactory’) and our continuous positive assessment share measure. We find that each higher level of IEG Rating is associated with a statistically significant mean increase in predicted positive assessment share of a report. The finding also holds for other model specifications as reported below.

Table 1 Models of IEG outcome rating and average assessment relationship. IEG Outcome treated as continuous value (1–6). Restricted: neutral sentences removed; 95% confidence intervals displayed in brackets below the coefficients (alternative models reported in Appendix VI yield highly comparable results)

In Model 1, that has no other control variables, each rating level is associated with an increase of 0.117 in share of positive assessments at a 99% confidence level. With an R2-value of 0.58, the measure explains substantive variation dependent variable. This supports the claim that there is a consistent positive relationship between our novel measure and the manual measure produced by the IEG. This convergence, in turn, supports the argument that our text-based performance measure is aligned with the IEG measure.

A limitation, also visible in Fig. 3 above, is that there is overlap between the rating levels. Especially for the ratings “Satisfactory” and “Highly Satisfactory”. This might also be reflected in the R2-value showing that in Model 1 the IEG scores account for approximately 58% of the variation in the positive assessment share, and not more (see also the prediction plot in Appendix VI, Figure A VI.2). Thus, reports with a relatively high share of positive assessments indicate high performance but map only partially into a specific IEG rating category. It is important to stress that this does not undermine the positive correlation between both measurements detected above. But it shows that our novel measure does not allow to predict each IEG score in a deterministic way.

The reason is that the performance of IO activities is a latent concept. There is no certainty that either the IEG scores or our own sentence-based measure are fully accurate. On the one hand, the IEG measure could be biased. One study that investigated the IEG outcome scores by re-rating performance based on a second reading of the evaluation reports found some deviations (Malik & Stone, 2018; see also Weaver, 2010). On the other hand, our text-based measure could also be a source of bias. Evaluators could be discouraged to report concrete negative examples because they know that reports will be published with their names on them. In our sample, IEG evaluations are on average slightly more positive compared to UN evaluation reports (Appendix V, Figure A V.1). But whether the remaining gap between both measures is caused by our text-based measurement or the IEG’s measurement remains up for further investigation, as highlighted in the discussion.

To investigate what other factors could influence the relationship, we include additional covariates to the regression analysis. First, Model 2 investigates whether there may be a systematic flaw in our text-based input data. As mentioned above, there are two types of IEG evaluation reports, the shorter and more standardized Implementation Completion and Results Report Reviews (ICRRs), and the much longer and more in-depth Project Performance Assessment Report (PPARs). It is possible that the length or accuracy of these report types accounts for the remaining gap between the two measures. In Model 2, the coefficient remains statistically significant but there is a slight difference (-0.042) between both report types in our sample. The R2 remains close to that of Model 1, which suggests that the type of report does not seem to drive the gap.

A second possibility is that there are biases on the side of human coders of the IEG. For example, it could be that a group of IEG employees who provide the scores for a range of reports from a given country or at a given time could systematically deliver more or less positive performance grades. In Model 3, we therefore add year fixed effects indicating the year of report completion and the country/region in which the evaluation was conducted (see Appendix VI for full table). The R2 figure increases substantively with the inclusion of these controls, meaning that the model accounts for just under 70% of the variation in the dependent variable, while coefficients remain highly comparable to Model 1.

Overall, our results thus support the proposition that our measure for performance tracks that of the human coders at the IEG.

5.3 Construct validation

The third step is construct validation, i.e., whether a measure has theoretically expected results. For that, we draw on the different types of evaluations in the IOEval dataset. These types refer to the different IO activities being assessed: projects, programs, institutional, or thematic activities. For program and project type activities, management literature and literature on IO performance yield the expectation that projects, with their much more narrow and attainable goals, should on average have higher performance scores compared to the more ambitious and complex program activities.

In order to achieve long-term policy goals, modern public organizations employ strategic management frameworks that structure their work along a hierarchical logic, stretching from abstract goals to concrete action (Maylor et al., 2006; Poister et al., 2015): At the top is the organizational strategy, outlining a vision with respect to a policy field. Programs specify mid- and long-term goals and objectives. Projects are concrete activities with specific goals that are carried out within a short time frame. This has implications for the success chances of projects and programs as a study by Shao et al., (2012: 46) found: “project success is focused on project deliverables, whereas program success is concerned with delivering benefits and strategies.”

The idea of hierarchically structured policy activities has also affected how the international community approaches major policy change, such as the fight against climate change or the strive towards sustainable development goals. Broader goals (and indicators) are agreed upon politically. Organizations and other actors then design policy programs and project activities in order to enable transformation pathways towards these goals. Such a hierarchical logic of programs and projects also structures the internal management of most UN system organizations. For example, the UNDP’s operational policy specifiesFootnote 15 that “UNDP’s results are outlined in Country Programmes, Regional Programmes and the Global Programme. […]. Programmes are operationally implemented through projects with multi-year or annual work plans.”

Success chances are not equally distributed between projects and programs. To account for this, IO performance literature has used the metaphor of a pyramid. Gutner and Thompson (2010: 236), for instance, expect good performance to “trickle up,” with success at each lower stage serving as building blocks for success as we move up the pyramid.” Thereby, the success chances are higher for projects, with their more attainable goals. Programs by contrast depend on the success of their downstream projects. This has also been shown empirically by aid effectiveness literature. In their study of 1,600 Asian Development Bank projects and programs, Feeny and Vuong (2017: 329) find that “projects are more likely to be successful than programs”. For the construct validation, the expectation therefore is that due to their more limited scope and objectives, IO activities at the project level should on average perform better than activities at the program level.

Figure 4 shows the difference in means of positive assessments between project and program level activities for the UN organizations in the IOEval data set. Program level activities contain on average 53% positive assessments, whereas projects contain on average 59% positive assessments. This difference in the average share of positive assessments is statistically significant, as displayed in Table 2 as well as in Appendix VII. Aid effectiveness literature highlight that most variation in performance ratings can be explained based on country-level determinants (Denizer et al., 2013). Unsurprisingly, including year, country and IO fixed effects therefore reduces the difference between both groups. But in line with Feeny and Vuong (2017: 329), project reports remain 0.038 percentage points more positive at a p-value below 0.01.

Fig. 4
figure 4

Difference in means of the share of positive assessments between program level activities (left y-axis) and project level activities (right y-axis) displayed in boxplots

Table 2 Models of evaluation type rating and average assessment relationship. Program evaluations used as base level for evaluation type. 95% confidence intervals displayed in brackets below the coefficients

Consistent with the theoretical expectation, we therefore find that projects, which have more attainable goals, perform better than programs. Being able to reproduce such an established expectation based on our proposed performance measure thus serves as construct validation. Furthermore, the repeated finding of differences in outcomes for projects and programs, in combination with our data set, opens a new and interesting route for research: Efforts to achieve long term policy goals, such as climate change or sustainable development, could benefit from an in-depth understanding of how certain combinations of programs and projects enable or obstruct transformation pathways.

6 Discussion

To summarize, we combined three validation strategies to scrutinize the validity of our novel measure for the performance of IO activities. Content analysis of extreme case reports finds that a very high or low share of positive sentences corresponds with reports that present IO activities as very successful or unsuccessful. Furthermore, we show that our language model classifies sentences with an accuracy of 89%. Our performance measure also converges strongly with an alternative metric offered by the World Bank’s IEG. And it has theoretically expected results. We therefore argue that these analyses act as a suitable validation measure under the content, convergent and construct validation framework utilized in previous research (Adcock & Collier, 2001; Lührmann et al., 2020; Weidmann & Schutte, 2017). We maintain our initial proposition on the text-based performance measure, stating that the more positive assessments an evaluation report contains, the more positive the performance of the evaluated IO activity. An advantage compared to existing approaches is that our measure takes the full evaluation report into consideration, and it produces a continuous measure rather than a categorical one.

Although the language model classifies sentences with a very high accuracy and although the report-level measure correlates highly with alternative performance scores, there are remaining limitations.

First, the process of generating the share of positive assessments has several opportunities for introducing error, such as when sentences are separated incorrectly or inaccurately classified. This could lead to particularly positive or negative parts of text being subsumed in other text or just going undetected. Moreover, there is currently no weighting process applied to the sentences. A long sentence that contains lots of comments on major aspects of a project contributes as much weight to the positive assessment share as a short sentence that pertains to something of minor significance to the overall project. Future refinements of the language model could however improve its accuracy, for instance by weighting sentences.

Secondly, regarding convergent validation, the positive assessment shares of individual reports overlap to some extent across IEG rating levels, causing a gap in the correlation. However, this gap is reduced if control variables are added to the model. Hence, it might emerge due to certain biases in how the IEG assigns scores to its project evaluation reports (Malik & Stone, 2018; see also Weaver, 2010) (although future research should scrutinize this claim). Nevertheless, considering the positive assessment shares are distributed normally around a mean that correlates well with the IEG measure, we deem the underlying convergent validation not impinged.

A third question refers to the extent to which our measure applies to unseen evaluation reports. The language model was trained on the UN reports (excluding the World Bank) and we show that it also performs well in predicting the World Bank’s IEG outcome ratings. This supports that there is sufficient consistency between the language used by the IEG and UN evaluators. However, these individuals form a relatively homogenous epistemic community of experts who work on evaluation in international politics. There may be other evaluation cultures at other types of organizations or policy fields. While we are optimistic that generalization is possible, at least as long as evaluations follow the UNEG (or OECD-DAC) criteria, further research is needed to substantiate this claim.

With these remaining limitations, we suggest that our text-based performance score should not necessarily be understood as a measure that is superior to previous numeric performance scores, such as by the IEG. In that sense, we do not claim that it is able to capture the ‘true value’ of the latent concept of performance more accurately than a human coder does. However, given the way it is constructed, based on full evaluation reports and a classification of sentence-level assessments, it offers an alternative data source – or, when no performance ratings exist (e.g., as in evaluations from the nine UN system IOs in our dataset) – a valid novel source on the performance of IO activities. The key advantage is that it provides a highly reliable and replicable measure that can be applied consistently to any evaluation report (bearing in mind the above limitations). It ensures transitivity, which means that a report containing more positive assessment sentences on an activity is also rated higher on the performance metric, compared to a report with fewer positive assessment sentences. And our measure is also scalable, enabling researchers to easily expand the existing IOEval dataset with new evaluation reports.

7 Conclusion: Areas for application and future research

As its main contribution, this article develops an original method to extract the performance information from text-based evaluation reports by classifying sentence-based assessments as positive, negative, or neutral, and by calculating the share of positive assessments per evaluation report. We demonstrated this method’s validity for IOs in the UN system by means of content, convergent, and construct validation. Moreover, we publish a novel dataset of IO evaluation reports, with performance measures on 1,028 evaluated project, program, thematic, and institutional activities from nine UN system IOs. It contains cleaned text at the level of close to one million sentences, as well as the probability values for our classification of sentences as positive or negative assessments, or neutral descriptive information. We also publish the language model which means that anybody can expand the IOEval dataset.

This offers a range of exciting opportunities for future research and practitioner application. First, the dataset and the model enhance our ability to study the causes of IO performance. Existing studies have already pointed to relevant performance-affecting factors like transparency (Honig et al., 2022; Marchesi & Masi, 2021), level of control and autonomy (Honig, 2019; Lall, 2017), unilateral donor influence (Watkins, 2022), or decentralization of IO staff (Honig, 2020) and their competence (Bulman et al., 2017; Heinzel, 2022; Heinzel & Liese, 2021). However, these studies focus mainly on the field of international and bilateral development assistance, broadly defined. Fewer insights exist for organizations in other fields, such as humanitarian aid, health, and social policy. By treating our model’s performance score as a dependent variable, scholars can explore factors explaining the successes and failures of IO activities for an extended set of organizations and policy fields. Furthermore, in addition to project performance scores, reports in the IOEval dataset also cover activities at the program, institutional, and thematic level. Factors explaining performance can be explored using our metadata variables which account for a range of contextual factors (years, types of activity, report level as country, regional, global).

New insights can also be gained by employing additional methods of (computational) text analysis to our text corpus. Keyword-in-context or topic modelling analyses are just some of the simpler methods that enable extracting information on the factors that account for IO performance (see, for instance, Cormier & Manger, 2022). Especially for the work of the UN system IOs at the country level, this presents a rich source of data for comparative analyses.

To be sure, our data allows comparing performance tendencies both within and between IOs, yet this should be done with caution. So far, we lack information on the underlying evaluation case selection. While project and program evaluations are oftentimes part of the regular project management life-cycle, organizations may commission institutional or thematic evaluation precisely for areas where their performance is weaker. The evaluation data should therefore not be used at the aggregate IO level in the sense of a general IO performance score. Comparisons between organizations are hence possible for certain types of activities and when considering the lack of information about evaluation case selection as a limitation.

Second, the dataset and the model also allow to study the consequences of IO performance. For example, existing research has asked how IOs affect states’ domestic spending (Stubbs et al., 2020), conflict recurrence or economic recovery (Flores & Nooruddin, 2009). Treating our performance score as an independent variable, scholars could similarly explore further impacts of IO performance, for instance, regarding IO funding patterns (see Patz & Goetz, 2019) or IO survival and member state contestation (see Borzyskowski & Vabulas, 2019; Eilstrup-Sangiovanni, 2020).

Lastly, policymakers, too, can use the novel text-based performance metric. To the extent that their organization uses quantitative scores, they can investigate potential gaps between performance metrics as an additional quality check. But as we argue above, most IOs do not even have such performance information. In these cases, our model could be used to identify outlier reports, for example, extremely negatively or positively evaluated activities. Such insights should contribute to learning and accountability for evaluation systems in IOs. After all, the average evaluation in the UN system costs around 500,000 USD (OIOS, 2019) which certainly warrants some attention as to the quality and impact of their findings.

As Gutner and Thompson (2010: 234) write, “understanding and explaining the performance of international organizations is uniquely difficult – and uniquely interesting”. We thus hope that the introduced dataset and the language model algorithm – both of which we make available to practitioners and the academic community – helps to overcome at least some of these difficulties.

8 Author contribution statement

Steffen Eckhard and Vytautas Jankauskas led the development of the research design, conceptualization, and theory (50/50%). Elena Leuschner and Ian Burton led the text analysis (50/50%). Tilman Kerl and Rita Sevastjanova led the training of the BERT language model (50/50%). The order of authors reflects the significance of the authors’ contributions.

9 Data availability statement

All data generated and analyzed during the current study, including the language model developed as part of this study, are available at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/0SI2VX.