Skip to main content
Log in

Taking PISA Seriously: How Accurate are Low-Stakes Exams?

  • Published:
Journal of Labor Research Aims and scope Submit manuscript

Abstract

PISA is seen as the gold standard for evaluating educational outcomes worldwide. Yet, being a low-stakes exam, students may not take it seriously resulting in downward biased scores and inaccurate rankings. This paper provides a method to identify and account for non-serious behavior in low-stakes exams by leveraging information in computer-based assessments in PISA 2015. Our method corrects for non-serious behavior by fully imputing scores for items not taken seriously. We compare the scores/rankings calculated by our method to the scores/rankings calculated by giving zero points to skipped items as well as to the scores/rankings calculated by treating skipped items at the end of the exam as if they were not administered, which is the procedure followed by PISA. We show that a country can improve its ranking by up to 15 places by encouraging its own students to take the exam seriously and that the PISA approach corrects for only about half of the bias generated by the non-seriousness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Note: Data Source: 2015 PISA Cognitive item dataset. Time spent on each question (by all students who are faced with the question and who spend some time on it, whether or not they answer it) is standardized so it has mean zero and variance 1. For each position in a cluster, the median standardized time of the questions in that position is calculated. The y-axis depicts the median time spent on items in each order

Fig. 2

Note: Data Source: 2015 PISA Cognitive item dataset. The score for each question, 0, 0.5 or 1, is standardized so the overall score has mean zero and variance 1. Items that are not reached or missing are dropped from the sample. The no response items are assigned a score of 0. For each position in a cluster, the average standardized score of all questions in that position is calculated. The y-axis depicts the mean standardized score of the items in each order

Fig. 3

Note: Data Source: 2015 PISA Cognitive item dataset. The residuals of time spent for each student and question are obtained by running a regression of time spent on each item on type of question (multiple choice or open-ended), position within a cluster and position of the cluster and getting the residuals. Here time spent is conditional on having answered the question. The y-axis depicts the mean of the residual time relative to the difficulty of the items which is measured by the fraction who got the question correct. The green line is for non-serious students and the red line is for serious students

Fig. 4

Note: Data Source: 2015 PISA Cognitive item dataset. The residuals of time spent for each student and question are obtained by running a regression of time spent on each item on type of question (multiple choice or open-ended), position within a cluster and position of the cluster and getting the residuals. Here time spent is conditional on having answered the question. The y-axis depicts the mean of the residual time relative to the difficulty of the items which is measured by the fraction who got the question correct. The red line is for serious students while the black line is for non-serious students who satisfy criterion 3, or missing-item students

Fig. 5

Note: Data Source: 2015 PISA Cognitive item dataset. The residuals of time spent for each student and question are obtained by running a regression of time spent on each item on type of question (multiple choice or open-ended), position within a cluster and position of the cluster and getting the residuals. Here time spent is conditional on having answered the question. The y-axis depicts the mean of the residual time relative to the difficulty of the items which is measured by the fraction who got the question correct. The red line is for serious students while the black line is for non-serious students excluding missing-item students

Fig. 6

Note:The top panel plots the scatter plot and regression lines between ln(ans) and ln(yns), showing the contribution of non-serious students’ ability to the increased fraction correct. The middle panel plots the relationship between ln(ens) and ln(yns) for every country, showing the contribution of extent of non-seriousness. The bottom panel plots the relationship between ln(pns) and ln(yns) for every country, showing the contribution of proportion of no-serious students

Similar content being viewed by others

Notes

  1. Other well known low-stakes tests include Trends in International Mathematics and Science Study (TIMSS) and Progress in International Reading Literacy Study (PIRLS). PISA assesses whether students can apply what they have learned to solve “real world” problems. PIRLS and TIMSS are grade-based (4th and 8th graders) and curriculum oriented.

  2. The previous work in this field (Zamarro et al. 2019; Huang et al. 2012) have used the term “careless answering/responding” instead of “non seriousness”.

  3. One item is one question. We use the word “item” or “question” interchangeably in the paper.

  4. See the article in The Guardian, May 6, 2014, entitled “OECD and Pisa tests are damaging education worldwide-academics”, Retrieved from the following link: https://www.theguardian.com/education/2014/may/06/oecd-pisa-tests-damaging-education-academics

  5. See the article in the National, Sept 25, 2017, entitled “Abu Dhabi pupils prepare for Pisa 2018”, Retrieved from the following link: https://www.thenational.ae/uae/abu-dhabi-pupils-prepare-for-pisa-2018-1.661627

  6. In the appendix, we provide some data that suggests that the results we see in the science section will likely be correlated to and magnified in the reading and math sections as non-seriousness seems more prevalent in math and reading than in science.

  7. One might argue that students do not understand that it is better to guess than to skip. However, if this was the only reason for skipping, then skipping behavior should not be related to the position of the item, which clearly is as shown below. One might also argue that as this is a computer based test, students cannot go back to answer skipped item as they might in a paper test. If students do not realize this, they may skip inadvertently. Again, since they will quickly learn that they cannot go back even if they do not know this to begin with, skipping should be less prevalent in the second cluster than the first. Again, the opposite is true.

  8. Note that observable characteristics also include several proxies for non-cognitive skills, such as test anxiety and achieving motivation. See Table 16 for the full list of variables used in the imputation

  9. SENA is short for Skipped at the End Not Administered, which is the procedure followed by PISA.

  10. One of their measures of effort is the extent to which performance falls when the question occurs later in the exam. Another is the extent to which questions are skipped in the survey that students have to fill out and a third is the extent of carelessness in filling the survey.

  11. Our estimates below also show that China seems to be less affected by non serious behavior than the US.

  12. In the 2012 PISA exam, 32 countries/regions were invited to complete both a paper and a computer version of mathematics test. However, by 2015, 58 moved to a computer based assessment. Jerrim (2016, 2018) find that taking the PISA exam in a computer-based mode affects students’ performance negatively in many countries.

  13. For countries that choose to implement the assessment of financial literacy, it requires an additional 60 minutes.

  14. For more detail see PISA 2015 Technical Report Chapter 2. OECD (2015b)

  15. One complex multiple choice question includes several yes-or-no questions.

  16. Some countries also have parent and teacher questionnaire.

  17. Note that in the imputation we impute both multiple choice and open response questions. Our logic is that if the student did not answer the question because he did not know the answer, the imputation procedure is likely to give a score of zero for the question.

  18. There may have been technical issues that prevented them from taking the exam. In any case, there is no way for their responses to be imputed as there is no information.

  19. There are roughly 60 minutes allocated for the two science clusters which have in total an average of 31 questions.

  20. This is similar to Kuhfeld and Soland (2019) in which a student is flagged as disengaged if over 10% of his or her item responses were rapid.

  21. Note that students who skipped open response questions in the middle of the exam, even if they spent very little time on them, were not seen as non serious. They could have equally well been labeled as non-serious. However, such open questions, which are both not answered and spent too little time on, only account for 0.7% of the total questions, so we are not worried this will affect our results.

  22. This is indeed an issue as high-ability students (those with high scores) have a higher fraction correct for too-little-time items than that for normal-time ones, while the opposite is true for low-ability students.

  23. To calculate time spent on two clusters we should add time spent on position 1 and 2 or add time spent on position 3 and 4.

  24. We did not plot time spent on the last 3 items for missing-item students because they miss these items by definition

  25. Note that students satisfying criterion 3 have on average 15 more minutes left.

  26. To do so we regress time spent on each item on type of question (multiple choice or open ended), position within a cluster and position of the cluster. We then remove the effect of question type, position and cluster to get the residual for each student and question. We plot the residuals for correct and incorrect answers for serious and non-serious students. We do not include individual fixed effects in the regression as we wish to see how serious and non-serious students differ in their responses.

  27. Serious students spend 19.5 minutes per cluster while non-serious ones spend 17.8 minutes per cluster.

  28. We only impute too little time items for students who satisfy Criterion 4.

  29. We did include both multiple choice and open response items in criterion 3 (missing) and criterion 4 (too little time items)

  30. This choice is unlikely to make a difference as only 0.7% students have less than 1 minute left, and 3% have less than 5 minutes left.

  31. If this were not so, there would be no similar individuals/items/schools to impute from.

  32. In the imputation, we categorize partial credit answers as wrong answers for simplicity. On average students have only 8% of questions in their exams which allow partial credit.

  33. There are 36 random numbers in total which determine the specific science clusters assigned to students. Moreover, students have science clusters either in the first two sessions or in the last two sessions. Therefore in total there are 72 groups within which students answer the same questions in the same order.

  34. See page 148 of OECD (2015b), chapter 9.

  35. This is also the practice used by Gneezy et al. (2019)

  36. To quote PISA (page 149 of OECD (2015a))

    “Omitted responses prior to a valid response are treated as incorrect responses; whereas, omitted responses at the end of each of the two one-hour test sessions in both PBA and CBA are treated as not reached/not administered.”

  37. These numbers differ slightly from the numbers in the original working paper posted as we used sampling weights for each student in this version and not in the earlier one. The ranks do not change across the versions.

  38. Recall that non-serious items include non-reached, no-response and missing items, and items with too little time if a student spends too little time on at least three items and the fraction correct for little-time items is lower than that for normal-time ones. Here we also include open response items which are non-reached or no-response.

  39. These three regression lines add up to the 450 line.

  40. Imputed number correct is calculated by taking the mean of ten draws of number correct.

  41. Note that they do not account for skipped items in the middle and too little time items.

  42. We used the 5 questions in ST118 for the anxiety variable and the 5 questions in ST119 for the ambition variable.

  43. We calculate the stakes of standardized tests given in school as follows. In school questionnaire, school principles were asked whether the school used standardized tests for 11 different purposes. We mark the stake of each purpose to be between 1 to 3 and sum up the stakes for each school. Then we sort countries by their mean stakes and mark the top 36 countries as high-stake countries while the remaining 36 countries are marked as low-stake ones.

  44. Our results are robust to using fraction correct on items that are answered seriously as a measure of ability.

  45. See column 1 and the row for time on classes and out-of-school science learning.

  46. For each cluster, the predicted probability at each level of question difficulty in the figure takes the mean value of the predicted probability at that level of difficulty.

References

  • Attali Y, Neeman Z, Schlosser A (2011) Rise to the challenge or not give a damn: Differential performance in high vs. low stakes tests

  • Azmat G, Calsamiglia C, Iriberri N (2016) Gender differences in response to big stakes. J Eur Econ Assoc 14(6):1372–1400

    Article  Google Scholar 

  • Azur M J, Stuart E A, Frangakis C, Leaf P J (2011) Multiple imputation by chained equations: What is it and how does it work?. Int J Methods Psych Res 20(1):40–49

    Article  Google Scholar 

  • Baumert J, Demmrich A (2001) Test motivation in the assessment of student skills: The effects of incentives on motivation and performance. Eur J Psychol Educ 16(3):441

    Article  Google Scholar 

  • Borghans L, Schils T (2012) The leaning tower of pisa: decomposing achievement test scores into cognitive and noncognitive components. Unpublished manuscript

  • Borgonovi F, Biecek P (2016) An international comparison of students’ ability to endure fatigue and maintain motivation during a low-stakes test. Learn Individ Differ 49:128–137

    Article  Google Scholar 

  • Butler J, Adams R J (2007) The impact of differential investment of student effort on the outcomes of international studies. J Appl Measur 8(3):279–304

    Google Scholar 

  • Cole J S, Bergin D A, Whittaker T A (2008) Predicting student achievement for low stakes tests with effort and task value. Contemp Educ Psychol 33(4):609–624

    Article  Google Scholar 

  • Duckworth A L, Quinn P D, Lynam D R, Loeber R, Stouthamer-Loeber M (2011) Role of test motivation in intelligence testing. Proc Natl Acad Sci 108(19):7716–7720

    Article  Google Scholar 

  • Eklöf H (2010) Skill and will: test-taking motivation and assessment quality. Assess Educ Principles Policy Practice 17(4):345–356

    Article  Google Scholar 

  • Eklöf H, Pavešič B J, Grønmo L S (2014) A cross-national comparison of reported effort and mathematics performance in timss advanced. Appl Meas Educ 27(1):31–45

    Article  Google Scholar 

  • Finn B (2015) Measuring motivation in low-stakes assessments. ETS Res Rep Ser 2015(2):1–17

    Article  Google Scholar 

  • Gneezy U, List J A, Livingston J A, Qin X, Sadoff S, Xu Y (2019) Measuring success in education: the role of effort on the test itself. Amer Econ Rev Insights 1(3):291–308

    Article  Google Scholar 

  • Hanushek E A, W ößmann L (2006) Does educational tracking affect performance and inequality? differences-in-differences evidence across countries. Econ J 116(510):C63–C76

    Article  Google Scholar 

  • Hanushek E A, Link S, Woessmann L (2013) Does school autonomy make sense everywhere? panel estimates from pisa. J Dev Econ 104:212–232

    Article  Google Scholar 

  • Huang J L, Curran P G, Keeney J, Poposki E M, DeShon R P (2012) Detecting and deterring insufficient effort responding to surveys. J Bus Psychol 27(1):99–114

    Article  Google Scholar 

  • Jacob B A (2005) Accountability, incentives and behavior: The impact of high-stakes testing in the chicago public schools. J Publ Econ 89 (5-6):761–796

    Article  Google Scholar 

  • Jalava N, Joensen J S, Pellas E (2015) Grades and rank: Impacts of non-financial incentives on test performance. J Econ Behav Organ 115:161–196

    Article  Google Scholar 

  • Jerrim J (2016) Pisa 2012: How do results for the paper and computer tests compare?. Assess Educ Principles Policy Practice 23(4):495–518

    Article  Google Scholar 

  • Jerrim J, Micklewright J, Heine J-H, Salzer C, McKeown C (2018) Pisa 2015: how big is the ‘mode effect-and what has been done about it?. Oxf Rev Educ 44(4):476–493

    Article  Google Scholar 

  • Krosnick J A, Narayan S, Smith W R (1996) Satisficing in surveys: Initial evidence. Direct Eval 1996(70):29–44

    Article  Google Scholar 

  • Kuhfeld M, Soland J (2019) Using assessment metadata to quantify the impact of test disengagement on estimates of educational effectiveness. J Res Educ Effect:1–29

  • Kuhfeld M, Soland J (2020) Using assessment metadata to quantify the impact of test disengagement on estimates of educational effectiveness. J Res Educ Effect 13(1):147–175

    Google Scholar 

  • Lavy V (2015) Do differences in schools’ instruction time explain international achievement gaps? evidence from developed and developing countries. Econ J 125(588):F397–F424

    Article  Google Scholar 

  • Leys C, Ley C, Klein O, Bernard P, Licata L (2013) Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J Exp Soc Psychol 49(4):764–766

    Article  Google Scholar 

  • Lounkaew K (2013) Explaining urban–rural differences in educational achievement in thailand: Evidence from pisa literacy data. Econ Educ Rev 37:213–225

    Article  Google Scholar 

  • OECD (2015a) Pisa 2015 results(volumn 1): Excellence and equity in education. Technical Reprto, OECD

  • OECD (2015b) Pisa 2015 technical report. Technical Reprto, OECD

  • Penk C, Richter D (2017) Change in test-taking motivation and its relationship to test performance in low-stakes assessments. Educ Assess Eval Account 29(1):55–79

    Article  Google Scholar 

  • Pintrich P R, De Groot E V (1990) Motivational and self-regulated learning components of classroom academic performance. J Educ Psychol 82 (1):33

    Article  Google Scholar 

  • Prince Edward Island (2002) Preparing students for pisa (mathematical literacy): Teacher’s handbook. Technical Report, Prince Edward Island

  • Schafer J L, Graham J W (2002) Missing data: Our view of the state of the art. Psychol Methods 7:147–177

    Article  Google Scholar 

  • Schnipke D L, Scrams D J (1997) Modeling item response times with a two-state mixture model: A new method of measuring speededness. J Educ Meas 34(3):213–232

    Article  Google Scholar 

  • Wise S L, DeMars C E (2005a) Low examinee effort in low-stakes assessment: Problems and potential solutions. Educ Assess 10(1):1–17

    Article  Google Scholar 

  • Wise S L, Kong X (2005b) Response time effort: A new measure of examinee motivation in computer-based tests. Appl Meas Educ 18(2):163–183

    Article  Google Scholar 

  • Wise S L (2006a) An investigation of the differential effort received by items on a low-stakes computer-based test. Appl Meas Educ 19(2):95–114

    Article  Google Scholar 

  • Wise S L, DeMars C E (2006b) An application of item response time: The effort-moderated irt model. J Educ Meas 43(1):19–38

    Article  Google Scholar 

  • Wise S L, Ma L (2012) Setting response time thresholds for a cat item pool: The normative threshold method. In: annual meeting of the National Council on Measurement in Education, Vancouver, Canada

  • Wise S L, Soland J, Bo Y (2020) The (non) impact of differential test taker engagement on aggregated scores. Int J Test 20(1):57–77

    Article  Google Scholar 

  • Wolf L F, Smith J K (1995) The consequence of consequence: Motivation, anxiety, and test performance. Appl Meas Educ 8(3):227–242

    Article  Google Scholar 

  • Zamarro G, Hitt C, Mendez I (2019) When students don’t care: Reexamining international differences in achievement and student effort. J Hum Cap 13(4):000–000

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to participants at the Econometrics Society World Congress in 2020, Econometric Society meetings in Shanghai, China in 2018, International Association of Applied Econometrics Conference in Cyprus in 2019, Conference of the European Society for Population Economics (ESPE) in Bath, UK in 2019 and 9th ifo Dresden Workshop on Labor Economics and Social Policy in 2019. We would particularly like to thank Joris Pinkse, Keisuke Hirano, and Kim Ruhl for their comments and suggestions and Meghna Bramhachari for help in proofreading. We owe special thanks to colleagues at the OECD for answering our numerous questions about the data. Huacong Liu was instrumental in our working on this project, and we thank her for all her help. We are responsible for all errors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pelin Akyol.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix:

Appendix:

This appendix delves into more detail on a number of peripheral facts and issues. In the first part we present some non causal regressions on who are non serious and which questions tend to be taken non seriously.

We use the data on questions in the Science clusters only in the body of the paper. Our reason for doing so is that all students take two Science clusters, but may not be tested in Math or Reading. One might ask whether the patterns in other parts of the exam are similar. As a check we look at the fraction of non serious items across subjects in the third part of the A. Their similarity reassures us that our focus on the Science clusters is warranted.

In the fourth part of the A, we discuss in more detail the behavior patterns of serious and non-serious students in terms of time spent and accuracy of response as a function of question position.

In the fifth part, we discuss the exact variables we use in the imputation procedure. In the sixth part, we explain some details behind the decomposition for partially serious students and present the results for them.

A.1 What Drives Being Non-serious?

We have seen in Section 3 that serious and non-serious students behave very differently. The next question is, what factors correlate with being non-serious? We explore this in two levels. First, we look at the correlates of individuals being non-serious. After this, we look at correlates of the question not being taken seriously.

Table 7 Summary statistics

A.1.1 Summary Statistics

In this section we explain the definition of various factors which are potentially correlated with non-serious behavior. Table 7 gives the descriptive statistics for these factors. Scores in the component parts of the exam (reading, math and science) are scaled so that 500 is the mean and the standard deviation is 100 for all OECD countries together. Clearly, OECD countries do better than average as the mean math and science scores overall are 464 and 474 respectively. Students are on average in the 10th grade and half the students are female. The variable “anxiety” is an index we constructed by taking questions that asked about this subject (where the ranking was from a “1” to a “4” in terms of strength of the viewpoint where 1 strongly disagree and 4 is strongly agree) and taking a simple average of the response. The median is 2.8 suggesting a fair degree of anxiety on the part of students. Similarly for “ambition” where the median response is 3.2.Footnote 42 The variable skipping class/arriving late uses the response for the three questions in ST062 about skipping, its intensity and arriving late and adds them up. A 1 is never in the last two weeks, a 2 is 1 or 2 times and a 3 is 3 or 4 times, and a 4 is 5 or more times. On average, such behavior exists but is not endemic.

The median time spent learning out of school is 16 hours per week, while time spent learning in school is 27 hours per week. Students spend more than 40 hours a week on school related work. The standard deviations are roughly 15 and 11 suggesting that a fair number of students are spending well over 60 to 70 hours a week on such work. Standardized test frequency and teacher developed test frequency is the response to question SC034. A response of 1 means there were no such tests and a response of 5 means the tests were given more than monthly. The median value is 2 or the frequency was 1-2 times a year. The variable “Stakes of standardized (teacher developed) tests comes from the answers to SC035. The question is composed of 11 yes/no sub-questions (where a yes is a 1 and a 0 is a no) regarding the purpose of these tests. We label each purpose as low, medium or high stakes for the students giving them a weight of 1, 2 and 3 respectively. Of the 11 sub-questions, 5 are low, 3 are medium and 3 are high stakes. We then add these weighted responses up to get our index. As the maximum value the index could have taken is 20, the median of 10 and 13 suggest the stakes are high, especially of teacher developed tests.

A.1.2 Who is Non-serious?

The factors that correlate with a student being non-serious are explored in Table 8. Column 1 shows the results for all countries. The dependent variable is 1 if the student is non-serious. In columns 1 to 3, being non-serious is defined as meeting at least one of criterion 1, 2 or 4. In column 4, being non-serious is defined as meeting criterion 3. We make this distinction because the patterns explored in the previous section differ across these two groups. We also look at high-stake countries, ones where the standardized tests given in school are high-stakesFootnote 43, as well as low-stake countries as the patterns in the two might be different. If, for example, students are fed up with exams in high-stakes countries while not in low-stakes countries, we might expect a higher probability of being non-serious in PISA in high-stake countries. One might want to do these regressions country by country, but with 58 countries, this would be overkill as this is not the main object of this paper. Also note that we are not claiming any causal effects, merely pointing out some correlations in the data.

Table 8 Factors related to being non-serious

To begin with, we ask whether better students are more or less likely to be non-serious. Columns 1-3 suggest that higher math scores (a proxy for ability) are associated with a student being less likely to be non-serious, except when we use criterion 3, suggesting that students with missing items are a different breed.Footnote 44 Students with high socioeconomic status (ESCS) and in lower grades are more likely to be non-serious. Again the sign in column 4 is reversed. This suggests that poor able students use criterion 3 when they are non serious while the rest use criterion 1, 2 or 4.

Students from richer countries are more likely to be non-serious, though the shape is that of an inverted U with a turning point at about $33,000 for per capita GDP. However, this pattern is again reversed in column 4 where the pattern is U shaped with a turning point at about $38,500.

Gender matters: women are less likely to be non-serious in columns 1-3, but are more likely to be non-serious (by quitting in the middle of the exam) in column 4 suggesting that women “blow off” the exam in different ways than men. As might be expected, being anxious or ambitious is associated with being less likely to be non-serious, while being undisciplined, i.e., having a pattern of skipping class or arriving late, is associated with being non-serious.

One might speculate that students who are over-worked and over-tested, especially with high-stakes exams, have test fatigue and passively resist taking yet another test, and therefore are more likely to not take PISA seriously. There is some evidence in favor of this. First, countries with high-stakes exams do seem to make students work harder. The data reveals that on average students spend 1.3 hours more per week in class and 3.1 hour more on out-of-school learning in all subjects in high-stakes countries relative to low-stakes ones. Working harder seems to be associated with not taking PISA seriously. In column 1, spending more time on studies out of school is significant for all countries together, but the effect seems to be coming from high stakes countries. Time spent in school is positive but not significant for all countries together, but is significant for low stake countries.Footnote 45 Having more tests (standardized or teacher-developed) does seem to correlate positively with being non-serious overall, though the coefficients are not significant. This might be because the effects differ in high stake and low stake countries. Having more standardized tests raises the likelihood of being non-serious in high-stakes countries (column 2) but does the opposite in low-stakes ones (column 3).

When teacher-developed tests are being given, raising the stakes seems to make students more likely to be serious, not less, suggesting that such testing may be less likely to result in test fatigue. Students from better schools, as reflected in the log of the school science score, are also less likely to be non-serious in low stakes countries, but more likely to be non-serious in high stakes countries. This makes sense if better schools push students more in high stakes countries resulting in fatigue. In Table 9 and Table 10, we present correlates of each non-seriousness criterion for cutoff level of 10% (as defined in Section 3) and 6%, respectively. The results are consistent across different cutoff levels.

Our results here should be seen as preliminary as there is no causation implied, merely correlation. The patterns described above are suggestive and might be worth exploring in future work.

Table 9 Factors related to being non-serious for each criterion (for cutoff level of 10%)
Table 10 Factors related to being non-serious for each criterion (for cutoff level of 6%)

A.1.3 Which Questions are Not Taken Seriously?

We define a non serious question as those that were not reached, for which there was no response, were missing, or on which too little time was spent. Non-reached and no-response items were looked at by the student who then chose not to answer the item despite having time left. Had he taken the exam seriously, he would have answered to the best of his ability. The student did not even look at missing items. But he had time left. In general, students have ample time to do the exam. Not even bothering to even read the question is again an indication of non seriousness. One might argue that no-response items, i.e., those that were skipped in the middle of the exam, should be treated differently as this was a computer based exam and students could not go back. Assuming they knew this, their choosing to skip again indicates the question is not taken seriously. Questions on which too little time was spent (as explained in criterion 4 for defining non serious students) are those where the response time is below a threshold which is country specific and for which the proportion correct is lower than that for normal time items for the same person. This is to prevent us from mistakenly labeling a question as non serious when in fact the student knew the answer immediately and so spent little time on it.

We explore the effects of question characteristics on the probability of a question being skipped, i.e., being not-reached or no-response. We also do the same for the probability of too little time being spent on a question. In both cases we run a linear probability model with individual fixed effects as well as question characteristics. Figure 7 shows the predicted probability of skipping a question and the predicted probability of spending too little time on a question for each cluster as a function of the difficulty of the question.Footnote 46 In all clusters, as the difficulty of the question increases, the probability of skipping increases though there is a slight decrease as questions become very difficult(top panel). In the bottom panel, we see that the probability of spending too little time is roughly flat: first increasing, then decreasing and finally increasing again. Students seem to try to answer if the question is easy but as it gets difficult, they seem to give up. There are also differences between clusters. Consistent with the “fatigue” hypothesis, questions are more likely to be taken non seriously in the second and fourth clusters.

Fig. 7
figure 7

Pr(skip) and Pr(spend too little time) w.r.t. cluster and difficulty. In the figure, lowess-smoothed lines are presented. Predicted probabilities are obtained from a linear probability model with individual fixed effects as well as question characteristics such as cluster, sequence, difficulty and the type of the question. For each cluster, the predicted probability at each level of question difficulty takes the mean value of the predicted probability at that level of difficulty. In the figure, lowess-smoothed lines are presented

Table 11 Factors affecting Pr(Skip) and Pr(Spend too little time) (Individual Characteristics)

In Fig. 8, we explore whether question type affects the probability of skipping or spending too little time as a function of question order. For all questions, the probability of skipping rises with order, or sequence, in a cluster and jumps down at the beginning of the new cluster and more so after the break, which is consistent with “fatigue”. The graph of complex multiple choice questions for the probability of skipping lies between the open response and simple multiple choice questions. This makes sense as it is easy to guess an answer for simple multiple choice questions so that they are less likely to be skipped.

Fig. 8
figure 8

Pr(skip) and Pr(spend too little time) w.r.t. sequence and the type of the question. Predicted probabilities are obtained from a linear probability model with individual fixed effects as well as question characteristics such as cluster, sequence, difficulty, and the type of the question. For each question type, the mean value of the predicted probability at each order is presented

Non-serious behavior in terms of spending too little time weakly falls with the order within a cluster for all question types. However, there is a large jump up at the beginning of the second and fourth clusters. The above pattern suggests that for open response questions at least, as the exam proceeds, students substitute towards skipping with a reset at the end of each cluster. Hence we see a fall with sequence within a cluster and a jump up in each new cluster. While skipping is more likely for open response questions, spending too little time is less likely for such questions relative to other question types.

In order to understand the effects of individual characteristics on the probability of being skipped or spending too little time, we run individual characteristics on estimated individual fixed effects from our linear probability model, see Table 11. The results are in line with those of Table 8.

So far we ran choice regressions as if they were independent. However, the appropriate model is a multinomial choice one as the student has three mutually exclusive and exhaustive options for each question: skip, answer with too little time or answer with normal time. We used the linear probability model as it allowed us to incorporate individual fixed effects, which we could not do with logit. With logit, we can control for individual characteristics, but as we are unlikely to have information on all possible characteristics, we might have omitted variable bias.

Table 12 presents the results of a logit regression where the baseline choice is spending normal time answering the item. In the regression, we control for the question characteristics and the individual characteristics used in the previous tables. The first and second columns show the factors affecting the probability of skipping and the probability of spending too little time, respectively. The position within a cluster is positively correlated with the probability of skipping and negatively correlated with the probability of spending too little time, consistent with students switching from spending too little time to skipping as the exam progresses. If a question is in the second, third or fourth cluster relative to being in the first cluster, it is more likely to be skipped and this likelihood is much higher in the second and fourth clusters as they are the last clusters in each science session. Open response and complex multiple choice questions move students towards skipping and away from spending too little time. However, as the difficulty of the questions increase, the students become more likely to skip and spend too little time. The coefficients on individual characteristics are roughly in line with those in Table 8. The math score of the student is negatively correlated with the probability of skipping and the probability of spending too little time. Female students are less likely to skip or spend too little time. Ambitious students are less likely to skip. Consistent with our previous findings, students from richer countries are more likely to skip and spent too little time, though the shape is that of an inverted U with a turning point at about $43,000 for per capita GDP. We control for standardized test frequencies and teacher developed test frequencies to investigate whether there is any evidence that students are fed up with testing, and as a result do not take them seriously. We find that as the frequency of the standardized tests increases, students likelihood of skipping and spending too little time significantly increases which is consistent with the “fatigue” effect. However, the teacher-developed tests do the exact opposite. This suggests that students view them very differently.

Table 12 Factors affecting Pr(Skip) and Pr(Spend too little time) (Logit results)

A.2 Fraction of Non-serious items Across Subjects

Table 13 shows the fraction of no-response items and the fraction of non-reached items for the subject of science, reading and math. The fraction of no response items for the reading and math tests are a bit higher on average than science. Moreover, the fraction of no-response and non-reached items are highly correlated across subjects. For example, the correlation between the fraction of no-response items for science and for reading is 0.98, showing that non seriousness is common across subjects of the test as might be expected.

Table 13 Fraction of non-serious items across subjects
Table 14 Time per science cluster (minutes)

A.3 Time Spent, Accuracy and Position

Table 14 shows time per science cluster across positions for serious and non-serious students. Note that time spent on the cluster falls with the position of the cluster and then jumps back up after the break at the end of cluster 2 and this is more so for non-serious students. There is substantial heterogeneity between non-serious students according to the criterion used. Students with no-response or too-little-time items, not surprisingly, spend less time per cluster than serious students regardless of cluster position. However, the opposite holds for those with non-reached or missing items but only for the first and third clusters. For the second and fourth clusters their time spent is 30-40% less than that of serious students. It is also worth noting that for these students, time is still not a constraint: on average they have more than 15 minutes left. This suggests that “fatigue” sets in faster for non-serious students.

The upper part of Table 15 shows proportion correct for all items (not just answered ones) across positions. Serious students have higher proportion correct than each category of non-serious students. Accuracy falls in the second cluster compared to the first one, and this is more so for non-serious students, reminiscent of the patterns for time spent. However, non-serious students will have a lower proportion correct on all items by definition as they skip many items. If we want to know what their accuracy is we should divide by the number of answered questions as done in the lower part of Table 15. The numbers show that even with this correction non-serious students have lower accuracy than serious ones. In addition, the degree to which accuracy falls across clusters is now similar (around 2%) for both serious and non-serious students. This is consistent with non-serious students’ performance experiencing a substantial drop in the second cluster primarily because they skip more items there.

Table 15 Proportion correct in science clusters

A.4 Variables Used in Imputation

PISA data has a rich array of information from the student and school questionnaires in the survey. In the imputation we use variables constructed from these surveys by PISA. We choose the variables that seem relevant. A list of the variables used is contained in Table 16. Binary variables are clearly identified. All others are continuous indices. Details of these are available in the PISA technical report, OECD (2015b), Chapter 16. The imputation also uses the individual’s scores for all other items and other students’ scores for all items as in the standard MICE imputations. We also include country fixed effect in the imputations.

Table 16 Variables used in imputation

A.5 Decomposition for Partially Serious Students

We call fully serious students those who neither skip items nor spend too little time on any item. These fully serious students, together with what we call partially-serious students, make up what we have termed serious students. For fully serious students, the number correct will be the same before and after imputation by definition. The increase in fraction correct for serious students (Ys) therefore only comes from imputations for partially serious students who did skip a few items or spent too little time on a small enough number of items so that they were not classified as non-serious. There are PS partially serious students. Next we will decompose Ys into its component parts.

$$ \begin{array}{@{}rcl@{}} Y_{s} &=&\frac{\sum \limits_{i\in S}I_{i}}{\sum\limits_{i\in S\cup NS}T_{i}} \\ &=&\frac{\sum\limits_{i\in PS}\left( I_{i}\right) }{\sum\limits_{i\in PS}NI_{i}}\frac{\sum\limits_{i\in PS}NI_{i}}{\sum\limits_{i\in PS}T_{i}} \frac{\sum\limits_{i\in PS}T_{i}}{\sum\limits_{i\in S\cup NS}T_{i}} \\ &=&A_{ps}E_{ps}P_{ps} \end{array} $$

Aps is the increase in the fraction correct for non-serious items among partially serious students. Eps is the fraction of non-serious items among all items for partially serious students, which measures the degree of non-seriousness. Pps approximately measures the proportion of partially serious students in a country as partially serious students on average have the same number of total items as other students. The values of Yps, Aps, Eps and Pps for each country are provided in Table 17.

Table 17 Decomposed factors for partially serious students

Similar to the decomposition for non-serious students, we divide both sides by the geometric mean and get

$$ y_{ps}=\frac{Y_{ps}}{\bar{Y}_{ps}}=\left( \frac{A_{ps}}{\bar{A}_{ps}}\right) \left( \frac{E_{ps}}{\bar{E}_{ps}}\right) \left( \frac{P_{ps}}{\bar{P}_{ps}} \right) =a_{ps}e_{ps}p_{ps} $$
(7)

Take the logarithm on both sides of Eq. 7 gives:

$$ \ln (y_{ps})=\ln a_{ps}+\ln e_{ps}+\ln p_{ps} $$
(8)

Next we run the regression of \(\ln a_{ps}\), \(\ln e_{ps}\), \(\ln p_{ps}\) separately on \(\ln y_{ps}\), that is,

$$ \begin{array}{@{}rcl@{}} \ln a_{ps} &=&\alpha_{2}\ln y_{ps}+\epsilon_{a} \\ \ln e_{ps} &=&{\upbeta}_{2}\ln y_{ps}+\epsilon_{d} \\ \ln p_{ps} &=&\gamma_{2}\ln y_{ps}+\epsilon_{p.} \end{array} $$

Let the OLS estimates be denoted by \(\hat {\alpha }_{2},\hat {\upbeta }_{2},\hat { \gamma }_{2}\). Similarly we can show that \(\hat {\alpha _{2}}+\hat {{\upbeta }_{2}}+ \hat {\gamma _{2}}=1\ \)and the coefficients \(\hat {\alpha _{2}}\), \(\hat {{\upbeta }_{2}}\), \(\hat {\gamma _{2}}\) measure the contribution of partially serious students’ ability, extent of non-seriousness and proportion to a country’s increase in fraction correct. Figure 9 plots the scatter plot and regression lines above for partially serious students.

Fig. 9
figure 9

yps Versus its Components for Partially Serious Students

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Akyol, P., Krishna, K. & Wang, J. Taking PISA Seriously: How Accurate are Low-Stakes Exams?. J Labor Res 42, 184–243 (2021). https://doi.org/10.1007/s12122-021-09317-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12122-021-09317-8

Keywords

JEL Classification

Navigation